[00:11:03] (03PS5) 10Kimberly Sarabia: Adds new web fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960141 (https://phabricator.wikimedia.org/T346106) [00:47:58] (SystemdUnitFailed) firing: (2) kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service Failed on kafka-jumbo1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:51:54] (03PS7) 10Sharvaniharan: New Event schema for mobile apps [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 [02:00:46] (03CR) 10Sharvaniharan: New Event schema for mobile apps (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 (owner: 10Sharvaniharan) [02:39:06] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [04:47:59] (SystemdUnitFailed) firing: (2) kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service Failed on kafka-jumbo1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:24:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [06:24:27] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [06:34:28] Back in the game after 12h of silencing --^ [06:39:06] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [07:04:06] 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10CodeReviewBot) tchin updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/5... [08:31:08] brouberol: o/ [08:31:12] back if you need me [08:31:30] elukey: o/ [08:34:41] Those HDFS corrupt blocks are a bit persistent, aren't they? It's been stuck at 70 since yesterday. [08:35:38] btullis: I suspect it is the jmx staleness issue, fsck should be clean in theory [08:41:12] elukey: Ah yes, I remember that now. Thanks. [08:42:20] Note #3 from here. https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - I will restart the prometheus exporters. [08:42:48] I'll check fsck first though. [08:45:23] 0 corrupt blocks, but of course we can't restart the exporters independently, because they're not a separate process. Hmm. [08:47:09] yeah :( [08:47:59] (SystemdUnitFailed) firing: (2) kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service Failed on kafka-jumbo1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:23] oh noes [08:48:29] again? [08:48:33] ^ Oh, this is expired downtime [08:48:38] ahhh right 1001 [08:48:39] fiuuu [08:48:40] We need to fix the alert to match the config [08:48:41] okok :D [08:49:52] brouberol: Have you met the alertmanager config repository yet? This could be a good opportunity. [08:50:04] I haven [08:51:14] 't. Let me silence the alert for now, we're lookign into how to resolving the issue for mw_page_content_change_enrich with elukey, as it's caused by the reassignment [08:51:32] Yes, absolutely. [09:02:21] o/ any objections to restarting the rdf-streaming-updater flink job (reading & pushing to kafka-jumbo)? [09:02:57] dcausse: None from me. Are you doing the re-deploy yourself? [09:03:05] yes [09:03:26] Ack, all good on my side. [09:04:11] thanks! [09:15:54] joal: o/ [09:15:57] do you have a min? [09:34:58] joal was off for a couple of hours [09:38:20] I see a consume rate and a produce rate of zero in the codfw deployment of mw_page_content_change_enrich here: https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?from=1686676660402&orgId=1&to=1686687460402&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=mw-page-content-change-enrich&var-operator_name=All&var-helm_release=main&var-flink_job_name=mw_page_content_change_enrich [09:39:11] I'm confused by elukey's comment here: https://phabricator.wikimedia.org/T347676#9209807 [09:40:00] ...because I thought that there were running flink clusters for mw_page_content_change_enrich in the wikikube clusters (as opposed to dse-k8s) [09:40:59] Oh but this: [09:41:02] https://www.irccloud.com/pastebin/LyIlMj6U/ [10:12:31] I understand the cause of my confusion above and it was a pebcak. There was no mention of dse-k8s in the comment, I just imagined it. [10:14:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ... [10:14:27] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [10:37:12] PROBLEM - Host an-worker1086 is DOWN: PING CRITICAL - Packet loss = 100% [10:39:06] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [10:40:18] ACKNOWLEDGEMENT - SSH on an-worker1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Btullis T347287 - Not booting with failed disk https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:40:18] ACKNOWLEDGEMENT - Host an-worker1086 is DOWN: PING CRITICAL - Packet loss = 100% Btullis T347287 - Not booting with failed disk [10:50:48] RECOVERY - Host an-worker1086 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [10:55:20] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:18:40] PROBLEM - Host an-worker1085 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [11:26:27] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [11:29:26] RECOVERY - Host an-worker1085 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [11:37:26] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:06:37] !log Various restarts of mw_page_content_change_enrich k8s app since yesterday - the app is failing to send data to kafka [12:06:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:07:06] !log mw_page_content_change_enrich alert silenced for the weekend, the app is down, more investigation next week [12:07:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:17:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:32:59] 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10Gehel) [12:46:32] 10Data-Platform-SRE: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10elukey) We are going to re-deploy the flink app in codfw when the kafka partitions re-assignment task (on Jumbo) is done, to verify if it is has anything to... [13:08:52] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2): Provide overview of best DQ practices and system design - https://phabricator.wikimedia.org/T346283 (10lbowmaker) 05Open→03Resolved [13:09:54] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [BUG] eventutilites-python: fix type checking CI job - https://phabricator.wikimedia.org/T346085 (10lbowmaker) 05Open→03Resolved [13:10:15] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: eventutilities-python: cookicutter template example should be updated - https://phabricator.wikimedia.org/T345390 (10lbowmaker) 05Open→03Resolved [13:10:35] RECOVERY - Check systemd state on kafka-jumbo1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:37] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2): Document data pipeline and data set ownership - https://phabricator.wikimedia.org/T346295 (10lbowmaker) 05Open→03Resolved [13:10:49] !log systemctl reset-failed on kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service on kafka-jumbo1001 [13:10:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:10:53] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: Enum with an entry of `null` should fail jsonschema-tools validation - https://phabricator.wikimedia.org/T344511 (10lbowmaker) 05Open→03Resolved [13:11:19] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [BUG] mediawi-page-content-change-enrich: cross DC network calls to swift are failing - https://phabricator.wikimedia.org/T346877 (10lbowmaker) 05Open→03Resolved [13:12:19] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: eventutilities-python: Gitlab CI pipeline should use memory optimized runners. - https://phabricator.wikimedia.org/T346084 (10lbowmaker) 05Open→03Resolved [13:12:37] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [SPIKE] Should we enable compression on kafka jumbo? - https://phabricator.wikimedia.org/T345657 (10lbowmaker) 05Open→03Resolved [13:12:43] (SystemdUnitFailed) firing: (2) kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service Failed on kafka-jumbo1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:51] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Document the onboarding journey on Event Platfrom - https://phabricator.wikimedia.org/T345193 (10lbowmaker) 05Open→03Resolved [13:15:47] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10lbowmaker) [13:17:13] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10lbowmaker) [13:17:51] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: Define Flink k8s operator SLO - https://phabricator.wikimedia.org/T345914 (10lbowmaker) [13:18:00] 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10lbowmaker) [13:20:24] joal: https://stream-beta.wmflabs.org/ seems to work, still a few nits [13:20:24] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10lbowmaker) [13:20:38] and I also need to find how to trigger new events, to then be displayed [13:20:38] Awesome job elukey :) [13:20:46] not straightforward in deployment-prep [13:20:52] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) Fixed the last issue, I needed to add an explicit `stream_ui_enabled: true` to the config. All works, there... [13:26:35] 10Data-Engineering, 10Data-Platform-SRE, 10SRE Observability, 10Data Engineering and Event Platform Team (Sprint 3): Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430 (10lbowmaker) [13:26:43] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3): Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10lbowmaker) [13:27:12] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3), 10Patch-For-Review: Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10lbowmaker) [13:27:38] 10Data-Engineering, 10Machine-Learning-Team, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 3), and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10lbowmaker) [13:29:06] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10lbowmaker) [13:29:29] 10Data-Engineering, 10Discovery-Search, 10serviceops-radar, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [NEEDS GROOMING] Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10lbowmaker) [13:34:00] 10Data-Platform-SRE, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10bking) >>! In T346948#9184440, @Andrew wrote: > @bking two questions > > 1) (a repeat of T324147) do y'all still want these servers to do things that can't be... [13:35:51] 10Data-Platform-SRE: Investigate failure of Hadoop namenode coinciding with krb1001 reboot - https://phabricator.wikimedia.org/T346135 (10BTullis) I've found it impossible to find time to work on this so far, given a number of other incidents we observed this week and overall workload. Maybe we will have a quiet... [14:00:00] 10Data-Engineering, 10Data-Platform-SRE: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10BTullis) [14:39:06] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [14:41:10] 10Data-Platform-SRE: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10Jclark-ctr) [14:50:04] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10Jclark-ctr) [15:12:35] 10Data-Platform-SRE, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10EBernhardson) >>! In T346948#9210654, @bking wrote: > >>>! In T346948#9184440, @Andrew wrote: >> @bking two questions >> >> 1) (a repeat of T324147) do y'all... [15:16:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [15:22:20] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) I found some sources of the `/srv/dumps/xmldatadumps/public/other/misc` folder that were being rsynced from... [15:26:42] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [15:26:42] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [15:31:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [15:32:44] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Data Catalog MVP - https://phabricator.wikimedia.org/T299910 (10BTullis) 05Open→03Resolved I'm going to be bold and resolve this epic. We have completed all of the tasks that were originally defined to get DataHub to an MVP stage. There are improvements to... [15:34:23] ^^ I'm looking into the mw_page_content_change_enrich alert [15:35:52] inflatador: I think that joal and elukey and brouberol have been working on it for much of the day, under: T347676 You may want to check with them. [15:35:53] T347676: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 [15:38:42] btullis ACK, I applied to bring the service back up, but destroyed after I saw your msg. Wonder if there's a way to suppress these alerts [15:40:10] Should be able to do it from the alertmanager web gui: https://wikitech.wikimedia.org/wiki/Alertmanager#Silences_&_acknowledgements [15:41:02] 10Data-Platform-SRE: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10bking) Hey all, I just saw the alert for mw-page-content-change-enrich and did a `helmfile apply` to get the service back up. @BTullis messaged me in IRC a... [15:41:27] may have already just been suppressed? Or maybe the suppression doesn't stop the IRC msgs? [15:41:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ... [15:41:27] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [15:41:46] Or...it resolved when I started the service and will fire again shortly [15:45:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [15:45:27] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [15:52:25] OK, it's silenced for the next 3 days [15:52:55] 10Data-Platform-SRE: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10bking) I just silenced `mw_page_content_change_enrich in codfw is not running` alerts for the next 3 days. [16:05:57] (03PS8) 10Sharvaniharan: New Event schema for mobile apps [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 (https://phabricator.wikimedia.org/T347729) [16:09:40] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) 05Open→03Resolved I think it's fixed. {F37853141,width=50%} [16:17:02] 10Data-Platform-SRE: Root cause Archiva outage from 2023-09-24 - https://phabricator.wikimedia.org/T347343 (10LSobanski) [16:21:42] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [16:22:04] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) p:05Triage→03Medium [16:23:21] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10Eevans) [16:29:18] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) All right the code worked! I was able to see [[ https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Pola... [17:12:59] (SystemdUnitFailed) firing: wikidatardf-truthy-dumps.service Failed on snapshot1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:47:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:49:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:39:06] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [19:10:17] 10Data-Engineering, 10Event-Platform, 10Platform Team Initiatives (New Hook System): Update EventBus to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T346539 (10Umherirrender) a:03Umherirrender [19:10:20] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10bking) [[ https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#heading=h.fs3knmjt7fjy | The essay "My Philosophy on Alerting" ]] w... [19:13:44] (SystemdUnitCrashLoop) firing: crashloop on kafka-jumbo1005:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:18:44] (SystemdUnitCrashLoop) firing: (3) crashloop on kafka-jumbo1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:23:44] (SystemdUnitCrashLoop) firing: (4) crashloop on kafka-jumbo1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:28:44] (SystemdUnitCrashLoop) firing: (5) crashloop on kafka-jumbo1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:31:34] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 6.592e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [19:37:34] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10Jclark-ctr) [19:50:45] (03CR) 10Tsevener: "Thanks! It's looking good, all of my feedback is just documentation-related on the README." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 (https://phabricator.wikimedia.org/T347729) (owner: 10Sharvaniharan) [19:58:06] (03PS13) 10Milimetric: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [20:01:07] (03CR) 10Milimetric: Create a job to dump XML/SQL MW history files to HDFS (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [20:01:33] (03CR) 10Milimetric: Create a job to dump XML/SQL MW history files to HDFS (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [20:30:00] (03CR) 10Jdlrobson: [C: 03+2] Adds new web fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960141 (https://phabricator.wikimedia.org/T346106) (owner: 10Kimberly Sarabia) [20:30:53] (03Merged) 10jenkins-bot: Adds new web fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960141 (https://phabricator.wikimedia.org/T346106) (owner: 10Kimberly Sarabia) [22:02:59] (SystemdUnitFailed) firing: wikidatardf-truthy-dumps.service Failed on snapshot1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:39:06] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [23:07:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:28:45] (SystemdUnitCrashLoop) firing: (5) crashloop on kafka-jumbo1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop