[00:19:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:20:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:11] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:34:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:35:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:11] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:59:41] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:59:53] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:15] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:11] (SystemdUnitFailed) resolved: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:33:56] (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [03:38:29] (03CR) 10Jdlrobson: [C: 03+2] Adds skin field in mobilewebuiactions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968674 (https://phabricator.wikimedia.org/T350205) (owner: 10Kimberly Sarabia) [03:39:05] (03Merged) 10jenkins-bot: Adds skin field in mobilewebuiactions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968674 (https://phabricator.wikimedia.org/T350205) (owner: 10Kimberly Sarabia) [04:34:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:35:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:48:12] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:33:56] (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [06:36:26] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10MusikAnimal) 05Open→03Resolved Also fixed for the original report https://pageviews.wmcloud.org/mediaviews/?pro... [07:38:41] (DruidSegmentsUnavailable) firing: (5) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [07:45:15] (EventgateValidationErrors) firing: ... [07:45:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [07:53:41] (DruidSegmentsUnavailable) firing: (5) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [08:43:15] * brouberol waves good morning [08:47:46] 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10brouberol) So do I @gmodena, even after the last sync, which seems wrong, as I was expecting to see a `max.message.byte` property here with a value of `1000013`. So I went digging, and found 2 inter... [09:17:39] elukey: if you;re around, I have a question related to kafka ACLs and datahub [09:17:57] (context: https://phabricator.wikimedia.org/T344989#9311615) [09:28:29] 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10brouberol) From what I've gathered, datahub does not log to Kafka, but merely uses an anonymous (aka unlogged) user. I think we're missing the following ACL from ` Current ACLs for resource `Topic:... [09:40:03] brouberol: o/ I am out sick for a few days (sigh) but the proposal makes sense [09:41:29] sorry, I didn't know! Please go back to getting better [09:45:56] thanks! [09:46:14] the new ACL (assuming it is for jumbo) looks sane [09:46:22] so you can go ahead in my opinion [09:49:40] Thanks [09:50:17] 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10brouberol) After getting approval from @elukey, I ran ` brouberol@kafka-jumbo1010:~$ kafka acls --add --allow-principal User:ANONYMOUS --allow-host '*' --operation DescribeConfigs --topic '*' kafka-... [09:53:50] Does anyone know a) what pod runs the datahub kafka topic metadata collectiob job, and b) how I could run it manually? [09:56:44] or can we run it manually from the datahub UI? [09:57:17] brouberol: It's an airflow job. I can show you. [09:57:34] We don't have any jobs that are runnable from the datahub UI at the moment. [09:58:00] great! Let's talk about that during the sync :_1 [09:58:07] 👍 [10:40:42] !log re-running the kafka_jumbo_ingestion in analytics airflow [10:40:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:43:12] gmodena: it worked! You can now see metadata for kafka topics, cf https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:kafka,staging.mediawiki.cirrussearch-request,PROD)/Properties [10:44:25] brouberol woot! That's terrific [10:45:15] brouberol this makes my life soo much easier! Many thanks for this. [10:46:13] \o/ [10:49:25] 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10brouberol) The ACL fix worked! After re-running the ingestion airflow job, we can now see the following for all kafka topics {F41464828} We still don't see the schema though, which I'd like to inv... [10:58:43] (03PS1) 10Phuedx: Add trv.wikisource and ab.wikibooks to Sqoop and pageview allowlists [analytics/refinery] - 10https://gerrit.wikimedia.org/r/972348 [11:08:34] (03CR) 10Aqu: [C: 03+1] "LGTM" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/972348 (owner: 10Phuedx) [11:36:06] !log deploying datahub to staging to start using pki certificates - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/969345/ [11:36:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:43:34] 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for zghwiki - https://phabricator.wikimedia.org/T350240 (10BTullis) 05Open→03Resolved I've run the cookbook again and the DNS step has now completed, so it must have been a transient failure. Resolving this ticket. [11:44:17] 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for bbcwiki - https://phabricator.wikimedia.org/T350372 (10BTullis) a:03BTullis [11:44:27] 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for bjnwikiquote - https://phabricator.wikimedia.org/T350234 (10BTullis) a:03BTullis [11:44:36] 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for dgawiki - https://phabricator.wikimedia.org/T350228 (10BTullis) a:03BTullis [11:47:12] 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for bbcwiki - https://phabricator.wikimedia.org/T350372 (10BTullis) 05Open→03Resolved This is now working, including the DNS alias. ` btullis@tools-sgebastion-10:~$ sql bbcwiki Reading table information for completion of table a... [11:48:21] 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for bjnwikiquote - https://phabricator.wikimedia.org/T350234 (10BTullis) 05Open→03Resolved This is now complete. ` btullis@tools-sgebastion-10:~$ sql bjnwikiquote Reading table information for completion of table and column name... [11:49:46] 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for dgawiki - https://phabricator.wikimedia.org/T350228 (10BTullis) 05Open→03Resolved This is now complete. ` btullis@tools-sgebastion-10:~$ sql dgawiki Reading table information for completion of table and column names You can... [11:53:56] (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [12:05:14] !log deploying datahub to prod for the pki certificates. [12:05:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:22:14] 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10pfischer) @brouberol, thank you for looking into this. I can confirm, that I'm able to see those properties now, too! 👍 [12:41:54] 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10brouberol) Excellent @pfischer ! I purposefully did not add you to the `wmf` group, to see whether it was a datahub ACL issue or not, and it seems like it's not. [12:44:55] 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10brouberol) 05Open→03Resolved I investigated the missing schema metadata in the `Schema` tab for kafka topics, and I found that Karapace only knows about 3 topics: ` brouberol@kafka-jumbo1010:~$... [12:52:44] ottomata: do you know anything about Karapace? I found that we don't see schema metadata in datahub, for kafka topics. It seems it's because datahub expects to be able to talk to a kafka registry, and that we use Karapace (a drop down replacement for the confluence registry) due to its Apache licensing, That being said, we don't seem that we _use_ [12:52:44] it more than as a stub? [12:55:15] My understanding is that we store schemas in repositories, such as https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/event-schemas/+/refs/heads/master/avro/mediawiki/CirrusSearchRequestSet/101446746400.avsc, and that we use these schemas in our code [12:55:57] that being said, do we have a static mapping of topic <-> schema somewhere? If so, we could populate Karapace with it and get extra information in datahub [13:31:57] brouberol: yes this mapping (stream_name -> {topics, schema}) is hosted by MW: https://meta.wikimedia.org/w/api.php?action=streamconfigs&all_settings (code: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/ext-EventStreamConfig.php) [13:32:52] schemas are at https://schema.wikimedia.org/#!/ the avro stuff you found is obsolete [13:33:24] Thanks dcausse! [13:54:59] (PuppetFailure) firing: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:05:59] 10Data-Platform-SRE: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) [14:49:59] (PuppetFailure) resolved: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:52:11] !log roll-restarting hadoop masters on the test cluster, after upgrading to puppet 7 [14:52:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:06:15] (03PS21) 10Phuedx: Add analytics/product_metrics/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [15:35:46] !log roll-restarting hadoop workers in test, to test new puppet 7 CA settings. [15:35:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:36:37] btullis could I request a quick review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/972397 ? My previous related CR didn't have the expected effect /facepalms [15:40:26] brouberol: Done :-) [15:41:38] (03PS11) 10Phuedx: Add product metrics fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [15:41:42] (03PS22) 10Phuedx: Add analytics/product_metrics/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [15:41:48] thank you! [15:42:06] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10CodeReviewBot) xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/533 switch druid host to index to the druid-public cluster and datahub inje... [15:48:10] !log restart hive-server2 and hive-metastore services on an-test-coord1001 to pick up new puppet 7 CA settings. [15:48:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:49:08] !log restart presto-server service on an-test-coord1001 and an-test-presto1001 to pick up new puppet 7 CA settings [15:49:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:49:51] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [15:50:10] !log restart mariadb service on an-test-coord100 [15:50:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:50:14] !log restart mariadb service on an-test-coord1001 [15:50:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:52:23] !log restart airflow-sheduler and airflow-webserver services on an-test-client1002 [15:52:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:52:50] (03CR) 10Clare Ming: [C: 03+1] Add analytics/product_metrics/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [15:53:56] (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [16:04:58] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Privacy Engineering, 10Patch-For-Review, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis) >>! In T349910#9291873, @sbassett wrote: >>>! In T349910#9288036, @BTullis wrote... [16:09:43] 10Data-Platform-SRE, 10Epic: [Epic] define a strategy around alerting for Data Platform SRE and implement it - https://phabricator.wikimedia.org/T345698 (10BTullis) Is this a duplicate of {T346438}? [16:16:58] 10Data-Platform-SRE: Restart Elasticsearch services for java 11 updates - https://phabricator.wikimedia.org/T350703 (10bking) [16:19:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [16:19:51] (HdfsFSImageAge) resolved: The HDFS FSImage on analytics-test-hadoop:an-test-master1002:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [16:24:51] (HdfsFSImageAge) resolved: The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [16:54:45] (03PS12) 10Phuedx: Add product metrics fragments and schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [16:55:34] (03Abandoned) 10Phuedx: Add analytics/product_metrics/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [17:00:26] (03CR) 10Phuedx: "I squashed my follow-on patch into this one to avoid rebase hell. I'll note that we've resolved the naming issue that was raised against t" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [17:02:53] (03CR) 10Clare Ming: [C: 03+1] "thanks for taking care of this - lgtm" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [17:06:33] 10Data-Engineering, 10serviceops, 10Event-Platform: Increase k8s namespace limits for eventgate-analytics - https://phabricator.wikimedia.org/T350707 (10Ottomata) [17:15:30] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10Ottomata) Did another eventgate-main deployment just now. I don't see any flo... [17:22:18] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to remote HTTP fetch errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) [17:34:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:36:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:12] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:52:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:53:13] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:00:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:03:12] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:07:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:08:13] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:16:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:12] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:27] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to remote HTTP fetch errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) [18:20:06] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to remote HTTP fetch errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) [18:20:21] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) [18:29:28] (03CR) 10Clare Ming: [C: 03+1] "tested with local eventgate -- events validated against new schema ID in java client" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [18:37:21] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) [18:37:34] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) [18:37:43] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) [18:41:58] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) The stream config... [18:49:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:50:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:03:01] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10CodeReviewBot) otto merged https://gitlab.wikimedia.... [19:03:10] 10Data-Engineering-Planning, 10Event-Platform (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/... [19:03:12] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:33:12] (SystemdUnitFailed) firing: aqs.service Failed on aqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:33:23] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:37] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:38:12] (SystemdUnitFailed) resolved: aqs.service Failed on aqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:43:08] 10Data-Platform-SRE, 10Cassandra: AQS fails on Debian Bullseye (Node 12) - https://phabricator.wikimedia.org/T349228 (10Eevans) I'm not sure //why// this is the case (i.e. what changed to require this), but it is enough to add `/srv/deployment/analytics/aqs/deploy/src` to `NODE_PATH`. [19:53:56] (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [19:54:31] 10Data-Platform-SRE: Restart Elasticsearch services for java 11 updates - https://phabricator.wikimedia.org/T350703 (10Gehel) p:05Triage→03High [19:54:41] (SystemdUnitFailed) firing: aqs.service Failed on aqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:58:12] (SystemdUnitFailed) resolved: aqs.service Failed on aqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:00:32] 10Data-Engineering, 10Data Engineering and Event Platform Team: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10Ottomata) [20:00:52] 10Data-Platform-SRE, 10Data Products, 10superset.wikimedia.org, 10Wikimedia-production-error: Prod Superset down, showing HTTP 500 instead - https://phabricator.wikimedia.org/T350718 (10BTullis) [20:00:54] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4): workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10Ottomata) [20:01:06] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) Some CI release logic is broken: {T350732} Working on it... [20:01:38] 10Data-Platform-SRE, 10Data Products, 10superset.wikimedia.org, 10Wikimedia-production-error: Prod Superset down, showing HTTP 500 instead - https://phabricator.wikimedia.org/T350718 (10Jdforrester-WMF) Now seemingly fixed. [20:02:10] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4): workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10Ottomata) a:03Ottomata [20:02:22] 10Data-Platform-SRE, 10Data Products, 10superset.wikimedia.org, 10Wikimedia-production-error: Prod Superset down, showing HTTP 500 instead - https://phabricator.wikimedia.org/T350718 (10BTullis) I restarted the superset service on an-tool1010 with: ` btullis@an-tool1010:~$ sudo systemctl restart superset.s... [20:02:35] 10Data-Platform-SRE, 10Data Products, 10superset.wikimedia.org, 10Wikimedia-production-error: Prod Superset down, showing HTTP 500 instead - https://phabricator.wikimedia.org/T350718 (10BTullis) p:05Triage→03High [20:06:18] 10Data-Platform-SRE, 10Data Products, 10superset.wikimedia.org, 10Wikimedia-production-error: Prod Superset down, showing HTTP 500 instead - https://phabricator.wikimedia.org/T350718 (10BTullis) I believe that the reason why it failed is the upgrade to puppet7, compbined with this necessary patch: https://... [20:10:15] (03PS2) 10Kimberly Sarabia: Adds new readme [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968714 (https://phabricator.wikimedia.org/T349729) [20:10:50] (03CR) 10Kimberly Sarabia: Adds new readme (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968714 (https://phabricator.wikimedia.org/T349729) (owner: 10Kimberly Sarabia) [20:11:41] 10Data-Platform-SRE, 10Data Products, 10superset.wikimedia.org, 10Wikimedia-production-error: Prod Superset down, showing HTTP 500 instead - https://phabricator.wikimedia.org/T350718 (10BTullis) I also need to update the database connection HTTPS CA parameters, so that it knows how to validate the TLS cert... [20:13:13] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4): workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) otto updated https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/39 fixes for conda ci templa... [20:13:25] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4): workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/39 fixes for conda ci templates [20:19:27] 10Data-Platform-SRE, 10Data Products, 10superset.wikimedia.org, 10Wikimedia-production-error: Prod Superset down, showing HTTP 500 instead - https://phabricator.wikimedia.org/T350718 (10BTullis) I believe that this is resolved now, as all charts and dashboards are working for me. I've requested that any af... [20:23:18] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) otto opened https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_reque... [20:23:29] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_reque... [20:30:01] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10xcollazo) For completeness, I reproed this issue on `mediawiki-content-dumps`: https://gitlab.wikimedia.org/repos/data-... [20:33:20] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) otto opened https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_reque... [20:34:45] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) Was hoping to deploy this today, but had t... [20:37:20] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) xcollazo opened https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge... [20:46:26] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) xcollazo opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/37... [20:48:24] !log Ran 'kerberos-run-command hdfs hdfs dfs -chmod -R g+w /wmf/data/wmf_dumps/wikitext_raw_rc2' to ease experimentation on this release candidate table. [20:48:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:51:56] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge... [20:53:21] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/37... [20:54:11] (03PS7) 10Ottomata: Use eventutilities-spark JsonSchemaSparkConverter in Refine and elsewhere [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/936293 (https://phabricator.wikimedia.org/T321854) [21:04:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:06:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:45] (EventgateValidationErrors) resolved: ... [21:17:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:18:13] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:34:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:35:35] 10Data-Platform-SRE: Restart Elasticsearch services for java 11 updates - https://phabricator.wikimedia.org/T350703 (10bking) a:03bking [21:35:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:48:13] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:00:38] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ahoelzl) Announcement: https://wikimedia.slack.com/archives/C01R06P8... [22:20:54] 10Data-Platform-SRE, 10Cassandra, 10Patch-For-Review: AQS fails on Debian Bullseye (Node 12) - https://phabricator.wikimedia.org/T349228 (10Eevans) >>! In T349228#9314266, @gerritbot wrote: > Change 972461 had a related patch set uploaded (by Eevans; author: Eevans): > %%%[operations/puppet@production] aqs:... [22:34:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:36:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:45:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:12] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:48:22] 10Data-Platform-SRE: Restart Elasticsearch services for java 11 updates - https://phabricator.wikimedia.org/T350703 (10bking) [22:52:44] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) [22:53:49] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) >>! In T347504#9307835, @dcausse wrote: > @bking thanks for triggering the import, could you update the task description with the dump files y... [23:53:56] (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable