[00:37:05] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:50:01] (PuppetConstantChange) firing: (4) Puppet performing a change on every puppet run on druid1006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [01:30:02] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:26] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:32:50] (SystemdUnitFailed) firing: (4) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:49] (SystemdUnitFailed) firing: (4) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:32:49] (SystemdUnitFailed) firing: (3) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:37:05] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:50:16] (PuppetConstantChange) firing: (4) Puppet performing a change on every puppet run on druid1006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:51:16] 10Data-Platform-SRE, 10DBA, 10Data Engineering and Event Platform Team, 10Data-Services: Prepare and check storage layer for fonwiki - https://phabricator.wikimedia.org/T347938 (10Marostegui) @BTullis this is also ready [06:58:42] 10Data-Engineering, 10Product-Analytics, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): Add linter and formatter to wmfdata-python (and link check) - https://phabricator.wikimedia.org/T348999 (10nshahquinn-wmf) > The Product Analytics Styleguide suggests PEP 8, but maybe we want to consider... [07:29:50] good morning folks! [07:29:57] I noticed a lot of alarms for DE: [07:29:58] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=team%3Ddata-engineering [07:30:18] and also some druid-public related issues, but I see that stevemunene is doing some maintenance [07:30:25] can you check the alerts when you have a moment? [07:31:41] good morning [07:31:50] having a look elukey [07:33:04] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:35:24] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=60f9f206-7461-495b-8373-fa1c60ddaf2e) set by stevemunene@cumin1001 for 3 days, 0:00:00 on 1 host(s) and their servic... [07:37:59] stevemunene: <3 thanks! [07:38:26] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) After reading the always enlightening page https://www.brendangregg.com/perf.html I tried to run perf with the LBR (`--call-graph lbr`), as a... [07:55:01] (PuppetConstantChange) firing: (4) Puppet performing a change on every puppet run on druid1006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [08:00:01] (PuppetConstantChange) firing: (4) Puppet performing a change on every puppet run on druid1006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [08:10:01] (PuppetConstantChange) resolved: (3) Puppet performing a change on every puppet run on druid1006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [08:14:21] --^ host appears okay and is not visible on https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange looking into why were getting the alerts [08:40:39] * brouberol waves good morning! [10:23:25] 10Data-Engineering, 10Product-Analytics, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): Add linter and formatter to wmfdata-python (and link check) - https://phabricator.wikimedia.org/T348999 (10AndrewTavis_WMDE) Ruff also just brought out [[ https://github.com/astral-sh/ruff/blob/main/crat... [10:25:32] 10Data-Engineering, 10Product-Analytics, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): Add linter and formatter to wmfdata-python (and link check) - https://phabricator.wikimedia.org/T348999 (10AndrewTavis_WMDE) We could further make use of [[ https://pre-commit.com/ | pre-commit ]] to che... [10:32:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:34:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:11:16] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10fgiunchedi) >>! In T343232#9287406, @Antoine_Quhen wrote: > From Airf... [11:56:11] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:50] (SystemdUnitFailed) firing: aqs.service Failed on aqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:01:08] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:50] (SystemdUnitFailed) resolved: aqs.service Failed on aqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:43] elukey: do you still think we should hold on upgrading eventgate based on your current investagions in changeprop? [13:34:39] ottomata: o/ The eventgate metrics didn't look that good IIRC, cpu and latency increased, I'd try to check if there is anything going on before reaching main [13:35:09] we are jumping to librdkafka 2.x so a ton of changes :( [13:37:37] elukey: yeah, the lil cpu increase i don't mind so much. latency not good though. at least mem usage went down :) [13:38:08] its strange that the major latency increase is mostly for static and irrelevant GET requests, not for the events POSTing [13:38:28] but, those always took longer anyway? no idea why a GET to e.g. robots.txt takes longer than POSTing to /v1/events [13:38:44] but, even so, the latency is still very low, no? [13:39:05] average 150 microseconds [13:44:45] there is also the 404 increase, etc.. it is just to be very cautious, it is a big jump from node 10 to 18 + librdkafka etc.. [13:52:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:52:47] elukey: I looked into the 404, https://phabricator.wikimedia.org/T347477#9281056 [13:53:06] i think they are not new, just newly intrumented somehow? [13:54:06] okok [14:15:57] 10Data-Engineering, 10serviceops, 10Event-Platform: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) Deployed the new image to staging, and added the `debug:all` settings for consumer and producer. What I see is that librdkafka keeps trying to fetch data from vari... [14:24:35] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10Ottomata) > last release of Bluebird is in fact ~4y old, and I don't think it natively supports nodejs 18 FWIW, swapping out any direct usages of Bl... [14:36:03] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) >>! In T348950#9291443, @Ottomata wrote: >> last release of Bluebird is in fact ~4y old, and I don't think it natively supports nodejs 18 >... [14:38:41] 10Data-Engineering, 10Product-Analytics, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): Add linter and formatter to wmfdata-python (and link check) - https://phabricator.wikimedia.org/T348999 (10xcollazo) > what other Wikimedia Python projects do FWIW, in Data Engineering land, [[ https://g... [14:43:36] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10Ottomata) https://www.npmjs.com/package/util-promisifyall ? Doesn't really do anything except iterate of object props and call [[ https://nodejs.or... [14:55:02] 10Data-Platform-SRE, 10Discovery-Search: Migrate MjoLniR deploy repo to Gitlab - https://phabricator.wikimedia.org/T350043 (10bking) [15:04:09] (03CR) 10Sbisson: "Should we rebase this patch on top of master so it can be merged sooner?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966658 (https://phabricator.wikimedia.org/T348613) (owner: 10Conniecc1) [15:11:16] (EventgateValidationErrors) firing: ... [15:11:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [15:20:37] 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10VRiley-WMF) cloudvirt-wdqs1003 has been relocated cloudvirt-wdqs1003 - C 8. U 21. port 18. CableID 4015 Side note, we had to use a 1 Gig connection sinc... [15:22:57] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10Ottomata) > The promisifyAll(consumer) call IIUC creates some Async functions Looks like only: - `consumer.consumeAsync` - `consumer.commitMessageAs... [15:43:04] 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1003.eqiad.wmnet with OS bookworm [16:02:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:06:30] 10Quarry: Move away from nfs? - https://phabricator.wikimedia.org/T349690 (10rook) Some initial tinkering suggests this may not be in reach in WMCS at the moment: Making a pv: ` apiVersion: v1 kind: PersistentVolume metadata: name: results spec: storageClassName: manual capacity: storage: 1Gi accessM... [16:07:02] 10Quarry: Move away from nfs? - https://phabricator.wikimedia.org/T349690 (10rook) [16:07:21] 10Quarry: Find somewhere else (not NFS) to store Quarry's resultsets - https://phabricator.wikimedia.org/T178520 (10rook) [16:09:27] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10Gehel) We might still want to test a k8s upgrade, but this should not be blocking moving to prod... [16:11:44] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Privacy Engineering, 10Patch-For-Review, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10sbassett) >>! In T349910#9288036, @BTullis wrote: > Adding #security-team to request thei... [16:15:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:22:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [18:26:00] 10Data-Platform-SRE: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) Current state of operations: I've suppressed alerts for all 4 search-loader hosts for the next day. I have deliberately NOT run python on `search-loader1001` just in case we have to roll... [19:02:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:04:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:11:31] (EventgateValidationErrors) firing: ... [19:11:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:15:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:36:30] (03PS1) 10DLynch: Add missing `new-sticky-header` init_mechanism to editattemptstep [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/969958 (https://phabricator.wikimedia.org/T237063) [19:51:29] (03CR) 10Bartosz Dziewoński: [C: 03+2] Add missing `new-sticky-header` init_mechanism to editattemptstep [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/969958 (https://phabricator.wikimedia.org/T237063) (owner: 10DLynch) [19:52:05] (03Merged) 10jenkins-bot: Add missing `new-sticky-header` init_mechanism to editattemptstep [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/969958 (https://phabricator.wikimedia.org/T237063) (owner: 10DLynch) [20:02:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:29:21] 10Data-Platform-SRE: Prometheus unable to scrape search-loader[12]002 - https://phabricator.wikimedia.org/T348222 (10bking) 05Open→03Resolved a:03bking We've confirmed that these hosts can still be useful (see T346039 ), so I'm closing this one out. [20:45:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:49:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:17:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:19:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:21:45] 10Data-Platform-SRE: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) After the firewall patch directly above this comment, [[ https://logstash.wikimedia.org/goto/ca2dd480835e60da6edb024f35e079e6 | it appears that the connection errors for search-loader1002 a... [21:21:55] 10Data-Platform-SRE, 10Discovery-Search: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) [21:22:33] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) [21:24:05] 10Data-Platform-SRE, 10Discovery-Search: Decom search-loader VMs still using Buster - https://phabricator.wikimedia.org/T350078 (10bking) [21:24:44] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) [21:24:46] 10Data-Platform-SRE, 10Discovery-Search: Decom search-loader VMs still using Buster - https://phabricator.wikimedia.org/T350078 (10bking) [21:30:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:02:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:03:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:11:16] (EventgateValidationErrors) resolved: ... [22:11:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [22:16:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed