[00:19:41] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:20:31] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:17] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:33:11] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:34:41] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:35:51] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:45:41] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:48:11] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:59:41] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:59:53] <icinga-wm>	 PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:01:15] <icinga-wm>	 RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:03:11] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:33:56] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[03:38:29] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+2] Adds skin field in mobilewebuiactions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968674 (https://phabricator.wikimedia.org/T350205) (owner: 10Kimberly Sarabia)
[03:39:05] <wikibugs>	 (03Merged) 10jenkins-bot: Adds skin field in mobilewebuiactions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968674 (https://phabricator.wikimedia.org/T350205) (owner: 10Kimberly Sarabia)
[04:34:41] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:35:23] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:45:09] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:48:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:33:56] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[06:36:26] <wikibugs>	 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10MusikAnimal) 05Open→03Resolved Also fixed for the original report https://pageviews.wmcloud.org/mediaviews/?pro...
[07:38:41] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: (5) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[07:45:15] <jinxer-wm>	 (EventgateValidationErrors) firing: ...
[07:45:16] <jinxer-wm>	 eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors
[07:53:41] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: (5) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[08:43:15] * brouberol waves good morning
[08:47:46] <wikibugs>	 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10brouberol) So do I @gmodena, even after the last sync, which seems wrong, as I was expecting to see a `max.message.byte` property here with a value of `1000013`. So I went digging, and found 2 inter...
[09:17:39] <brouberol>	 elukey: if you;re around, I have a question related to kafka ACLs and datahub
[09:17:57] <brouberol>	 (context: https://phabricator.wikimedia.org/T344989#9311615)
[09:28:29] <wikibugs>	 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10brouberol) From what I've gathered, datahub does not log to Kafka, but merely uses an anonymous (aka unlogged) user. I think we're missing the following ACL from  ` Current ACLs for resource `Topic:...
[09:40:03] <elukey>	 brouberol: o/ I am out sick for a few days (sigh) but the proposal makes sense 
[09:41:29] <brouberol>	 sorry, I didn't know! Please go back to getting better
[09:45:56] <elukey>	 thanks!
[09:46:14] <elukey>	 the new ACL (assuming it is for jumbo) looks sane
[09:46:22] <elukey>	 so you can go ahead in my opinion
[09:49:40] <brouberol>	 Thanks
[09:50:17] <wikibugs>	 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10brouberol) After getting approval from @elukey, I ran ` brouberol@kafka-jumbo1010:~$ kafka acls --add --allow-principal User:ANONYMOUS --allow-host '*' --operation DescribeConfigs --topic '*' kafka-...
[09:53:50] <brouberol>	 Does anyone know a) what pod runs the datahub kafka topic metadata collectiob job, and b) how I could run it manually?
[09:56:44] <brouberol>	 or can we run it manually from the datahub UI?
[09:57:17] <btullis>	 brouberol: It's an airflow job. I can show you.
[09:57:34] <btullis>	 We don't have any jobs that are runnable from the datahub UI at the moment.
[09:58:00] <brouberol>	 great! Let's talk about that during the sync :_1
[09:58:07] <brouberol>	 👍
[10:40:42] <btullis>	 !log re-running the kafka_jumbo_ingestion in analytics airflow
[10:40:45] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:43:12] <brouberol>	 gmodena: it worked! You can now see metadata for kafka topics, cf https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:kafka,staging.mediawiki.cirrussearch-request,PROD)/Properties
[10:44:25] <gmodena>	 brouberol woot! That's terrific 
[10:45:15] <gmodena>	 brouberol this makes my life soo much easier! Many thanks for this.
[10:46:13] <brouberol>	 \o/
[10:49:25] <wikibugs>	 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10brouberol) The ACL fix worked! After re-running the ingestion airflow job, we can now see the following for all kafka topics  {F41464828}  We still don't see the schema though, which I'd like to inv...
[10:58:43] <wikibugs>	 (03PS1) 10Phuedx: Add trv.wikisource and ab.wikibooks to Sqoop and pageview allowlists [analytics/refinery] - 10https://gerrit.wikimedia.org/r/972348
[11:08:34] <wikibugs>	 (03CR) 10Aqu: [C: 03+1] "LGTM" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/972348 (owner: 10Phuedx)
[11:36:06] <btullis>	 !log deploying datahub to staging to start using pki certificates - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/969345/
[11:36:07] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:43:34] <wikibugs>	 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for zghwiki - https://phabricator.wikimedia.org/T350240 (10BTullis) 05Open→03Resolved I've run the cookbook again and the DNS step has now completed, so it must have been a transient failure. Resolving this ticket.
[11:44:17] <wikibugs>	 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for bbcwiki - https://phabricator.wikimedia.org/T350372 (10BTullis) a:03BTullis
[11:44:27] <wikibugs>	 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for bjnwikiquote - https://phabricator.wikimedia.org/T350234 (10BTullis) a:03BTullis
[11:44:36] <wikibugs>	 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for dgawiki - https://phabricator.wikimedia.org/T350228 (10BTullis) a:03BTullis
[11:47:12] <wikibugs>	 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for bbcwiki - https://phabricator.wikimedia.org/T350372 (10BTullis) 05Open→03Resolved This is now working, including the DNS alias. ` btullis@tools-sgebastion-10:~$ sql bbcwiki Reading table information for completion of table a...
[11:48:21] <wikibugs>	 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for bjnwikiquote - https://phabricator.wikimedia.org/T350234 (10BTullis) 05Open→03Resolved This is now complete. ` btullis@tools-sgebastion-10:~$ sql bjnwikiquote Reading table information for completion of table and column name...
[11:49:46] <wikibugs>	 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for dgawiki - https://phabricator.wikimedia.org/T350228 (10BTullis) 05Open→03Resolved This is now complete. ` btullis@tools-sgebastion-10:~$ sql dgawiki Reading table information for completion of table and column names You can...
[11:53:56] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[12:05:14] <btullis>	 !log deploying datahub to prod for the pki certificates.
[12:05:15] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:22:14] <wikibugs>	 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10pfischer) @brouberol, thank you for looking into this. I can confirm, that I'm able to see those properties now, too! 👍
[12:41:54] <wikibugs>	 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10brouberol) Excellent @pfischer ! I purposefully did not add you to the `wmf` group, to see whether it was a datahub ACL issue or not, and it seems like it's not.
[12:44:55] <wikibugs>	 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10brouberol) 05Open→03Resolved I investigated the missing schema metadata in the `Schema` tab for kafka topics, and I found that Karapace only knows about 3 topics:  ` brouberol@kafka-jumbo1010:~$...
[12:52:44] <brouberol>	 ottomata: do you know anything about Karapace? I found that we don't see schema metadata in datahub, for kafka topics. It seems it's because datahub expects to be able to talk to a kafka registry, and that we use Karapace (a drop down replacement for the confluence registry) due to its Apache licensing, That being said, we don't seem that we _use_
[12:52:44] <brouberol>	 it more than as a stub?
[12:55:15] <brouberol>	 My understanding is that we store schemas in repositories, such as https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/event-schemas/+/refs/heads/master/avro/mediawiki/CirrusSearchRequestSet/101446746400.avsc, and that we use these schemas in our code
[12:55:57] <brouberol>	 that being said, do we have a static mapping of topic <-> schema somewhere? If so, we could populate Karapace with it and get extra information in datahub
[13:31:57] <dcausse>	 brouberol: yes this mapping (stream_name -> {topics, schema}) is hosted by MW: https://meta.wikimedia.org/w/api.php?action=streamconfigs&all_settings (code: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/ext-EventStreamConfig.php)
[13:32:52] <dcausse>	 schemas are at https://schema.wikimedia.org/#!/ the avro stuff you found is obsolete
[13:33:24] <brouberol>	 Thanks dcausse!
[13:54:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[14:05:59] <wikibugs>	 10Data-Platform-SRE: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene)
[14:49:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[14:52:11] <btullis>	 !log roll-restarting hadoop masters on the test cluster, after upgrading to puppet 7
[14:52:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:06:15] <wikibugs>	 (03PS21) 10Phuedx: Add analytics/product_metrics/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833)
[15:35:46] <btullis>	 !log roll-restarting hadoop workers in test, to test new puppet 7 CA settings.
[15:35:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:36:37] <brouberol>	 btullis could I request a quick review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/972397 ? My previous related CR didn't have the expected effect /facepalms
[15:40:26] <btullis>	 brouberol: Done :-)
[15:41:38] <wikibugs>	 (03PS11) 10Phuedx: Add product metrics fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming)
[15:41:42] <wikibugs>	 (03PS22) 10Phuedx: Add analytics/product_metrics/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833)
[15:41:48] <brouberol>	 thank you!
[15:42:06] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10CodeReviewBot) xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/533  switch druid host to index to the druid-public cluster and datahub inje...
[15:48:10] <btullis>	 !log restart hive-server2 and hive-metastore services on an-test-coord1001 to pick up new puppet 7 CA settings.
[15:48:17] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:49:08] <btullis>	 !log restart presto-server service on an-test-coord1001 and an-test-presto1001 to pick up new puppet 7 CA settings
[15:49:10] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:49:51] <jinxer-wm>	 (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[15:50:10] <btullis>	 !log restart mariadb service on an-test-coord100
[15:50:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:50:14] <btullis>	 !log restart mariadb service on an-test-coord1001
[15:50:15] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:52:23] <btullis>	 !log restart airflow-sheduler and airflow-webserver services on an-test-client1002
[15:52:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:52:50] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+1] Add analytics/product_metrics/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx)
[15:53:56] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[16:04:58] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Privacy Engineering, 10Patch-For-Review, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis) >>! In T349910#9291873, @sbassett wrote: >>>! In T349910#9288036, @BTullis wrote...
[16:09:43] <wikibugs>	 10Data-Platform-SRE, 10Epic: [Epic] define a strategy around alerting for Data Platform SRE and implement it - https://phabricator.wikimedia.org/T345698 (10BTullis) Is this a duplicate of {T346438}?
[16:16:58] <wikibugs>	 10Data-Platform-SRE: Restart Elasticsearch services for java 11 updates - https://phabricator.wikimedia.org/T350703 (10bking)
[16:19:51] <jinxer-wm>	 (HdfsFSImageAge) firing: The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[16:19:51] <jinxer-wm>	 (HdfsFSImageAge) resolved: The HDFS FSImage on analytics-test-hadoop:an-test-master1002:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[16:24:51] <jinxer-wm>	 (HdfsFSImageAge) resolved: The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[16:54:45] <wikibugs>	 (03PS12) 10Phuedx: Add product metrics fragments and schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming)
[16:55:34] <wikibugs>	 (03Abandoned) 10Phuedx: Add analytics/product_metrics/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx)
[17:00:26] <wikibugs>	 (03CR) 10Phuedx: "I squashed my follow-on patch into this one to avoid rebase hell. I'll note that we've resolved the naming issue that was raised against t" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming)
[17:02:53] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+1] "thanks for taking care of this - lgtm" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming)
[17:06:33] <wikibugs>	 10Data-Engineering, 10serviceops, 10Event-Platform: Increase k8s namespace limits for eventgate-analytics - https://phabricator.wikimedia.org/T350707 (10Ottomata)
[17:15:30] <wikibugs>	 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10Ottomata) Did another eventgate-main deployment just now.  I don't see any flo...
[17:22:18] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to remote HTTP fetch errors - https://phabricator.wikimedia.org/T326002 (10Ottomata)
[17:34:41] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:36:42] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:45:10] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:48:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:52:16] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:53:13] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:00:46] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:03:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:07:52] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:08:13] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:16:24] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:18:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:19:27] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to remote HTTP fetch errors - https://phabricator.wikimedia.org/T326002 (10Ottomata)
[18:20:06] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to remote HTTP fetch errors - https://phabricator.wikimedia.org/T326002 (10Ottomata)
[18:20:21] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors - https://phabricator.wikimedia.org/T326002 (10Ottomata)
[18:29:28] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+1] "tested with local eventgate -- events validated against new schema ID in java client" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming)
[18:37:21] <wikibugs>	 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata)
[18:37:34] <wikibugs>	 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata)
[18:37:43] <wikibugs>	 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata)
[18:41:58] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) The stream config...
[18:49:41] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:50:26] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:00:20] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:03:01] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10CodeReviewBot) otto merged https://gitlab.wikimedia....
[19:03:10] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/...
[19:03:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:33:12] <jinxer-wm>	 (SystemdUnitFailed) firing: aqs.service Failed on aqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:33:23] <icinga-wm>	 PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:37:37] <icinga-wm>	 RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:38:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: aqs.service Failed on aqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:43:08] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: AQS fails on Debian Bullseye (Node 12) - https://phabricator.wikimedia.org/T349228 (10Eevans) I'm not sure //why// this is the case (i.e. what changed to require this), but it is enough to add `/srv/deployment/analytics/aqs/deploy/src` to `NODE_PATH`.
[19:53:56] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[19:54:31] <wikibugs>	 10Data-Platform-SRE: Restart Elasticsearch services for java 11 updates - https://phabricator.wikimedia.org/T350703 (10Gehel) p:05Triage→03High
[19:54:41] <jinxer-wm>	 (SystemdUnitFailed) firing: aqs.service Failed on aqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:58:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: aqs.service Failed on aqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:00:32] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10Ottomata)
[20:00:52] <wikibugs>	 10Data-Platform-SRE, 10Data Products, 10superset.wikimedia.org, 10Wikimedia-production-error: Prod Superset down, showing HTTP 500 instead - https://phabricator.wikimedia.org/T350718 (10BTullis)
[20:00:54] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4): workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10Ottomata)
[20:01:06] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) Some CI release logic is broken: {T350732}  Working on it...
[20:01:38] <wikibugs>	 10Data-Platform-SRE, 10Data Products, 10superset.wikimedia.org, 10Wikimedia-production-error: Prod Superset down, showing HTTP 500 instead - https://phabricator.wikimedia.org/T350718 (10Jdforrester-WMF) Now seemingly fixed.
[20:02:10] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4): workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10Ottomata) a:03Ottomata
[20:02:22] <wikibugs>	 10Data-Platform-SRE, 10Data Products, 10superset.wikimedia.org, 10Wikimedia-production-error: Prod Superset down, showing HTTP 500 instead - https://phabricator.wikimedia.org/T350718 (10BTullis) I restarted the superset service on an-tool1010 with: ` btullis@an-tool1010:~$ sudo systemctl restart superset.s...
[20:02:35] <wikibugs>	 10Data-Platform-SRE, 10Data Products, 10superset.wikimedia.org, 10Wikimedia-production-error: Prod Superset down, showing HTTP 500 instead - https://phabricator.wikimedia.org/T350718 (10BTullis) p:05Triage→03High
[20:06:18] <wikibugs>	 10Data-Platform-SRE, 10Data Products, 10superset.wikimedia.org, 10Wikimedia-production-error: Prod Superset down, showing HTTP 500 instead - https://phabricator.wikimedia.org/T350718 (10BTullis) I believe that the reason why it failed is the upgrade to puppet7, compbined with this necessary patch: https://...
[20:10:15] <wikibugs>	 (03PS2) 10Kimberly Sarabia: Adds new readme [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968714 (https://phabricator.wikimedia.org/T349729)
[20:10:50] <wikibugs>	 (03CR) 10Kimberly Sarabia: Adds new readme (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968714 (https://phabricator.wikimedia.org/T349729) (owner: 10Kimberly Sarabia)
[20:11:41] <wikibugs>	 10Data-Platform-SRE, 10Data Products, 10superset.wikimedia.org, 10Wikimedia-production-error: Prod Superset down, showing HTTP 500 instead - https://phabricator.wikimedia.org/T350718 (10BTullis) I also need to update the database connection HTTPS CA parameters, so that it knows how to validate the TLS cert...
[20:13:13] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4): workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) otto updated https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/39  fixes for conda ci templa...
[20:13:25] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4): workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/39  fixes for conda ci templates
[20:19:27] <wikibugs>	 10Data-Platform-SRE, 10Data Products, 10superset.wikimedia.org, 10Wikimedia-production-error: Prod Superset down, showing HTTP 500 instead - https://phabricator.wikimedia.org/T350718 (10BTullis) I believe that this is resolved now, as all charts and dashboards are working for me. I've requested that any af...
[20:23:18] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) otto opened https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_reque...
[20:23:29] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_reque...
[20:30:01] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10xcollazo) For completeness, I reproed this issue on `mediawiki-content-dumps`: https://gitlab.wikimedia.org/repos/data-...
[20:33:20] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) otto opened https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_reque...
[20:34:45] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) Was hoping to deploy this today, but had t...
[20:37:20] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) xcollazo opened https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge...
[20:46:26] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) xcollazo opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/37...
[20:48:24] <xcollazo>	 !log Ran 'kerberos-run-command hdfs hdfs dfs -chmod -R g+w /wmf/data/wmf_dumps/wikitext_raw_rc2' to ease experimentation on this release candidate table.
[20:48:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:51:56] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge...
[20:53:21] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/37...
[20:54:11] <wikibugs>	 (03PS7) 10Ottomata: Use eventutilities-spark JsonSchemaSparkConverter in Refine and elsewhere [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/936293 (https://phabricator.wikimedia.org/T321854)
[21:04:42] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:06:08] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:16:04] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:17:45] <jinxer-wm>	 (EventgateValidationErrors) resolved: ...
[21:17:46] <jinxer-wm>	 eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors
[21:18:13] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:34:41] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:35:35] <wikibugs>	 10Data-Platform-SRE: Restart Elasticsearch services for java 11 updates - https://phabricator.wikimedia.org/T350703 (10bking) a:03bking
[21:35:54] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:45:44] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:48:13] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:00:38] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ahoelzl) Announcement: https://wikimedia.slack.com/archives/C01R06P8...
[22:20:54] <wikibugs>	 10Data-Platform-SRE, 10Cassandra, 10Patch-For-Review: AQS fails on Debian Bullseye (Node 12) - https://phabricator.wikimedia.org/T349228 (10Eevans) >>! In T349228#9314266, @gerritbot wrote: > Change 972461 had a related patch set uploaded (by Eevans; author: Eevans): > %%%[operations/puppet@production] aqs:...
[22:34:41] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:36:36] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:45:04] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:48:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:48:22] <wikibugs>	 10Data-Platform-SRE: Restart Elasticsearch services for java 11 updates - https://phabricator.wikimedia.org/T350703 (10bking)
[22:52:44] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking)
[22:53:49] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) >>! In T347504#9307835, @dcausse wrote: > @bking thanks for triggering the import, could you update the task description with the dump files y...
[23:53:56] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable