[00:17:50] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:19:09] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:25] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:50] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:02:50] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:50] (SystemdUnitFailed) resolved: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:45:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:23:49] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create dashboards for Search SLOs - https://phabricator.wikimedia.org/T338009 (10RKemper) Balthazar and I met last week. We took a look at the [[ https://grafana.wikimedia.org/d/H6f-bA7Sk/rkemper-search-sli-test?forceLogin&from=1693181669857&orgId=1&to=16... [06:45:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:36:50] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [07:37:50] (SystemdUnitFailed) firing: (3) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:51:15] (EventgateValidationErrors) firing: ... [07:51:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:12:32] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ec8c920c-26ba-48f2-b8e1-6bc3a415578a) set by stevemunene@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their servic... [08:21:44] * brouberol waves good morning! [08:54:17] good morning brouberol [08:56:30] Heads up brouberol , I rebooted stat1008 yesterday so you can skip it for the maintenance window you had set for later on today. [09:09:03] Thanks, I just saw that! [09:09:12] good morning! [09:13:26] 10Data-Engineering, 10serviceops, 10Event-Platform: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) [09:20:03] 10Data-Platform-SRE, 10Patch-For-Review: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10brouberol) Looking at the airflow logs, it seems that everytime I change the skein certificate to the PKI-generated one, we suddenly can't seem to find the appropriate Hive data partit... [09:23:38] xcollazo: when you're around, would you have a bit of time to pair on an investigation related to Skein? It seems that whenever I swap the self-issued skein certificate by one issued by our PKI, an hourly DAG starts failing. However, the weird part is that it fails at the NamedHivePartitionSensor part [09:23:38] (https://phabricator.wikimedia.org/T329398#9300751) which makes no sense to me. Thanks! [09:26:21] (03CR) 10Michael Große: [C: 03+2] "Thanks!" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/970816 (https://phabricator.wikimedia.org/T348644) (owner: 10Lucas Werkmeister (WMDE)) [09:27:43] (03Merged) 10jenkins-bot: Add or update some more Grafana links [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/970816 (https://phabricator.wikimedia.org/T348644) (owner: 10Lucas Werkmeister (WMDE)) [09:53:08] 10Data-Platform-SRE, 10Patch-For-Review: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10brouberol) This is what `openssl s_client` outputs with the Skein self-signed certificate: ` (airflow) analytics@an-test-client1002:/srv/airflow-analytics_test/.skein$ openssl s_client... [10:16:49] (03PS1) 10Michael Große: Add or update some more Grafana links [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/970763 (https://phabricator.wikimedia.org/T348644) [10:17:42] (03CR) 10Michael Große: [C: 03+2] Add or update some more Grafana links [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/970763 (https://phabricator.wikimedia.org/T348644) (owner: 10Michael Große) [10:18:13] (03Merged) 10jenkins-bot: Add or update some more Grafana links [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/970763 (https://phabricator.wikimedia.org/T348644) (owner: 10Michael Große) [10:45:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:57:10] 10Data-Platform-SRE, 10Patch-For-Review: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10brouberol) I've tried to import intermediate CA into a JKS truststore, and pass options to the Skein driver pointing to this truststore, to no avail. ` brouberol@an-test-client1002:~$... [11:51:30] (EventgateValidationErrors) firing: ... [11:51:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [12:08:04] (03PS1) 10Gerrit maintenance bot: Add bbc.wikipedia to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/970838 (https://phabricator.wikimedia.org/T350373) [13:25:14] 10Data-Engineering, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata-Campsite, 10Event-Platform: Validation Error for eventlogging_WMDEBannerSizeIssue - https://phabricator.wikimedia.org/T344027 (10Ottomata) [13:50:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:01:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:15:22] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) [14:15:36] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) [14:45:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:47:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:17:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:30] (EventgateValidationErrors) firing: ... [15:51:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [15:58:36] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) [15:58:40] 10Data-Platform-SRE: Improve data-reload cookbook based on graph split needs - https://phabricator.wikimedia.org/T349011 (10bking) 05Open→03Resolved With the above merge, we believe this issue has been addressed. Closing... [16:00:45] 10Data-Platform-SRE: Standardize/document Elastic snapshot configuration - https://phabricator.wikimedia.org/T348686 (10bking) a:03bking [16:14:32] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10bking) This approach discussed with @Gehel at our 1x1 meeting: - Start with an audit of current alerts - specifics not required, just types of alerts - Tie t... [16:36:16] (EventgateValidationErrors) resolved: ... [16:36:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:37:15] (EventgateValidationErrors) firing: ... [16:37:15] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:51:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:51] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:08] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) Looks like the web pod had some db connection issues a little after it started. Restarting seems to have cleared it, though let's see if it comes back. ` [2023-11-01 12:20:32 +0000] [1] [INFO] Starting gunicorn 21.2.0... [17:05:41] 10Data-Engineering, 10EventStreams, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ottomata) Luca and I met today, and after reading https://wikitech.wikimedia.org/wi... [17:06:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:15:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:05] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:21:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:31:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:37:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:38:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:51] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:50:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:00:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:09] 10Data-Platform-SRE: Standardize/document Elastic snapshot configuration - https://phabricator.wikimedia.org/T348686 (10bking) You can verify whether or not snapshots are configured in ES from any Elastic host: `curl -XGET http://127.0.0.1:9200/_snapshot`. Example response: ` { "elastic_snaps" : { "type"... [18:02:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:02:52] 10Data-Platform-SRE: Standardize/document Elastic snapshot configuration - https://phabricator.wikimedia.org/T348686 (10bking) [18:04:46] 10Data-Platform-SRE: Standardize/document Elastic snapshot configuration - https://phabricator.wikimedia.org/T348686 (10bking) [18:09:35] 10Data-Engineering, 10Trust and Safety Product Team: Mitigate unwanted side effects to anti abuse work from Google Chrome's IP Protection rollout - https://phabricator.wikimedia.org/T350428 (10kostajh) [18:13:40] 10Data-Engineering, 10Trust and Safety Product Team: Mitigate unwanted side effects to anti abuse work from Google Chrome's IP Protection rollout - https://phabricator.wikimedia.org/T350428 (10kostajh) [18:15:33] 10Data-Engineering, 10Trust and Safety Product Team: Mitigate unwanted side effects to anti abuse work from Google Chrome's IP Protection rollout - https://phabricator.wikimedia.org/T350428 (10kostajh) [18:18:08] 10Data-Platform-SRE: Standardize/document Elastic snapshot configuration - https://phabricator.wikimedia.org/T348686 (10bking) [18:45:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:49:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:52:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:00:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:07:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:07:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:15:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:44:23] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:50] (SystemdUnitFailed) firing: aqs.service Failed on aqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:50:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:50] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:00:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:15] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:50] (SystemdUnitFailed) resolved: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:05:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:50] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:15:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:05] (SystemdUnitFailed) resolved: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:19:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:26:19] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) @REsquito-WMF @Chlod, we'd like to make an announcement, a... [20:30:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:32:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:34:26] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) Discussed with @gmodena in slack. We agre... [20:37:15] (EventgateValidationErrors) firing: ... [20:37:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:50:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:52:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:00:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:09:55] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10Ottomata) > But also - why are we not retrying to enqueue jobs if it fails? We should pro... [21:10:50] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10Ottomata) envoy proxy request timeout to eventgate-main is currently 61 seconds though, a... [21:11:31] (EventgateValidationErrors) resolved: ... [21:11:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:20:35] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10Ottomata) Also q: `per_try_timeout: "20s" is set on eventgate-main services_proxy, but I... [21:50:00] 10Data-Platform-SRE: Standardize/document Elastic snapshot configuration - https://phabricator.wikimedia.org/T348686 (10bking) I've registered a snapshot repository for all clusters (chi, omega, psi) for all environments (cloudelastic, codfw, eqiad, and relforge). Before closing this ticket, we should verify op... [22:45:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [23:35:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [23:35:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [23:40:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [23:40:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag