[00:16:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [00:16:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [00:17:50] (SystemdUnitFailed) firing: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:03] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [00:21:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [00:30:37] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:50] (SystemdUnitFailed) resolved: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:37:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [00:37:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [00:42:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [00:42:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [02:45:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:42:11] 10Data-Engineering, 10Trust and Safety Product Team: Mitigate unwanted side effects to anti abuse work from Google Chrome's IP Protection rollout - https://phabricator.wikimedia.org/T350428 (10kostajh) [06:45:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:52:15] (EventgateValidationErrors) firing: ... [07:52:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [07:57:15] (EventgateValidationErrors) resolved: ... [07:57:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [07:58:15] (EventgateValidationErrors) firing: ... [07:58:15] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:02:30] (EventgateValidationErrors) resolved: ... [08:02:30] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:06:16] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) Looks like it happened again: ` [2023-11-02 17:02:06 +0000] [11] [INFO] Booting worker with pid: 11 [2023-11-02 23:03:43 +0000] [11] [ERROR] Error handling request / Traceback (most recent call last): File "/usr/loc... [08:11:45] (EventgateValidationErrors) firing: ... [08:11:45] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:15:29] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) Perhaps gunicorn maintains connections differently than uwsgi? [08:19:25] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10JMeybohm) >>! In T349823#9303470, @Ottomata wrote: > Also q: `per_try_timeout: "20s"` is... [09:37:00] o/ brouberol could we enable puppet on an-test-client1002 [09:38:50] o/ ottomata any help with the eventgate-analytics-external stream eventlogging__WMDEBannerSizeIssue validation errors [10:21:32] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 (10Gehel) [10:22:34] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10Gehel) [10:22:44] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Prepare new WDQS hosts for graph splitting - https://phabricator.wikimedia.org/T347505 (10Gehel) [10:23:19] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Prepare new WDQS hosts for graph splitting - https://phabricator.wikimedia.org/T347505 (10Gehel) [10:25:54] 10Data-Platform-SRE, 10Discovery-Search (Current work): Load Wikidata split graphs into test servers - https://phabricator.wikimedia.org/T350465 (10Gehel) [10:27:22] 10Data-Platform-SRE, 10Discovery-Search (Current work): Load Wikidata split graphs into test servers - https://phabricator.wikimedia.org/T350465 (10Gehel) p:05Triage→03High [10:27:26] 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10Gehel) p:05Triage→03Medium [10:27:35] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 4): [Data Platform] Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10Gehel) p:05Triage→03Medium [10:27:49] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the NFDI4Culture Knowledge Graph - https://phabricator.wikimedia.org/T346455 (10Gehel) p:05Triage→03Medium [10:28:15] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate staging rdf-streaming-updater to flink operator - https://phabricator.wikimedia.org/T349095 (10Gehel) p:05Triage→03Medium [10:28:22] 10Data-Platform-SRE, 10Patch-For-Review: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10Gehel) p:05Triage→03Medium [10:28:29] 10Data-Platform-SRE, 10Patch-For-Review: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 (10Gehel) p:05Triage→03Low [10:29:23] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10Gehel) p:05Triage→03High [10:29:29] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10Gehel) p:05Triage→03Medium [10:29:33] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10Gehel) p:05Triage→03Medium [10:29:37] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Gehel) p:05Triage→03Medium [10:29:41] 10Data-Platform-SRE, 10Patch-For-Review: Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 (10Gehel) p:05Triage→03Medium [10:29:49] 10Data-Platform-SRE, 10Discovery-Search, 10Epic: Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10Gehel) p:05Triage→03High [10:30:03] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar, 10Epic: Estimate cirrus streaming updater's usage of MWAPI - https://phabricator.wikimedia.org/T350185 (10Gehel) p:05Triage→03High [10:32:38] 10Data-Platform-SRE, 10DBA, 10Data Engineering and Event Platform Team, 10Data-Services: Prepare and check storage layer for fonwiki - https://phabricator.wikimedia.org/T347938 (10Gehel) p:05Triage→03High [10:33:26] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Estimate cirrus streaming updater's usage of MWAPI - https://phabricator.wikimedia.org/T350185 (10Gehel) [10:36:37] 10Data-Platform-SRE: Migrate archiva to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349292 (10Gehel) [10:36:39] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10Gehel) [10:45:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [13:08:12] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] mw-page-content-change-enrich should (re)produce kafka keys - https://phabricator.wikimedia.org/T338231 (10gmodena) > No, but I think the Flink library code should avoid caring (I'm sure we violate t... [13:36:42] (SystemdUnitFailed) resolved: aqs.service Failed on aqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:22] (03CR) 10Xcollazo: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/970404 (owner: 10Clare Ming) [14:31:21] (03CR) 10Phuedx: Add analytics/metrics_platform/{app,web}/base schemas (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [14:32:27] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Data Quality] [SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10gmodena) @Ottomata shall we park this task for now? I don't think the S... [14:45:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:59:45] 10Data-Engineering, 10Projects-Cleanup, 10Wikimedia-GitHub: Clean up wikimetrics - https://phabricator.wikimedia.org/T318193 (10Kizule) [15:17:46] (EventgateValidationErrors) resolved: ... [15:17:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [15:20:58] 10Data-Platform-SRE, 10Epic: [Epic] define a strategy around alerting for Data Platform SRE and implement it - https://phabricator.wikimedia.org/T345698 (10bking) [15:21:00] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10bking) [15:46:34] (03CR) 10Ottomata: Add analytics/metrics_platform/{app,web}/base schemas (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [15:48:15] (EventgateValidationErrors) firing: ... [15:48:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [15:53:15] (EventgateValidationErrors) resolved: ... [15:53:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:32:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:51:03] 10Data-Platform-SRE, 10Discovery-Search: Search Platform Airflow jobs: Identify dependencies and configure alerts - https://phabricator.wikimedia.org/T350499 (10bking) [17:00:20] 10Data-Platform-SRE, 10Discovery-Search: Search Platform Airflow jobs: Identify dependencies and configure alerts - https://phabricator.wikimedia.org/T350499 (10bking) [17:01:17] (03CR) 10Jdlrobson: [C: 03+1] Adds skin field in mobilewebuiactions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968674 (https://phabricator.wikimedia.org/T350205) (owner: 10Kimberly Sarabia) [17:04:41] 10Data-Platform-SRE, 10Discovery-Search: Search Platform Airflow jobs: Identify dependencies and configure alerts - https://phabricator.wikimedia.org/T350499 (10bking) Tagging @BTullis as we may beable to leverage his work in T343234 towards this goal. [17:21:15] (EventgateValidationErrors) firing: ... [17:21:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [17:36:15] (EventgateValidationErrors) resolved: ... [17:36:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [17:54:14] 10Data-Platform-SRE, 10Discovery-Search, 10Epic: Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10bking) a:03bking [18:18:15] (EventgateValidationErrors) firing: ... [18:18:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [18:23:15] (EventgateValidationErrors) resolved: ... [18:23:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [18:45:16] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:49:17] (EventgateValidationErrors) firing: ... [18:49:17] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:02:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:04:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:44:16] (EventgateValidationErrors) resolved: ... [20:44:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:47:57] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:45] (EventgateValidationErrors) firing: ... [20:50:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:00:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:46] (EventgateValidationErrors) resolved: ... [21:15:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:17:57] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:20:30] (EventgateValidationErrors) firing: ... [21:20:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:25:30] (EventgateValidationErrors) resolved: ... [21:25:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [22:50:01] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange