[00:11:03] <wikibugs>	 (03PS5) 10Kimberly Sarabia: Adds new web fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960141 (https://phabricator.wikimedia.org/T346106)
[00:47:58] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service Failed on kafka-jumbo1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:51:54] <wikibugs>	 (03PS7) 10Sharvaniharan: New Event schema for mobile apps [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100
[02:00:46] <wikibugs>	 (03CR) 10Sharvaniharan: New Event schema for mobile apps (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 (owner: 10Sharvaniharan)
[02:39:06] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[04:47:59] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service Failed on kafka-jumbo1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:24:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ...
[06:24:27] <jinxer-wm>	 mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[06:34:28] <joal>	 Back in the game after 12h of silencing --^
[06:39:06] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[07:04:06] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10CodeReviewBot) tchin updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/5...
[08:31:08] <elukey>	 brouberol: o/
[08:31:12] <elukey>	 back if you need me
[08:31:30] <brouberol>	 elukey: o/ 
[08:34:41] <btullis>	 Those HDFS corrupt blocks are a bit persistent, aren't they? It's been stuck at 70 since yesterday.
[08:35:38] <elukey>	 btullis: I suspect it is the jmx staleness issue, fsck should be clean in theory
[08:41:12] <btullis>	 elukey: Ah yes, I remember that now. Thanks.
[08:42:20] <btullis>	 Note #3 from here. https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - I will restart the prometheus exporters.
[08:42:48] <btullis>	 I'll check fsck first though.
[08:45:23] <btullis>	 0 corrupt blocks, but of course we can't restart the exporters independently, because they're not a separate process. Hmm.
[08:47:09] <elukey>	 yeah :(
[08:47:59] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service Failed on kafka-jumbo1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:48:23] <elukey>	 oh noes
[08:48:29] <elukey>	 again?
[08:48:33] <btullis>	 ^ Oh, this is expired downtime
[08:48:38] <elukey>	 ahhh right 1001
[08:48:39] <elukey>	 fiuuu
[08:48:40] <btullis>	 We need to fix the alert to match the config
[08:48:41] <elukey>	 okok :D
[08:49:52] <btullis>	 brouberol: Have you met the alertmanager config repository yet? This could be a good opportunity.
[08:50:04] <brouberol>	 I haven
[08:51:14] <brouberol>	 't. Let me silence the alert for now, we're lookign into how to resolving the issue for mw_page_content_change_enrich with elukey, as it's caused by the reassignment
[08:51:32] <btullis>	 Yes, absolutely.
[09:02:21] <dcausse>	 o/ any objections to restarting the rdf-streaming-updater flink job (reading & pushing to kafka-jumbo)?
[09:02:57] <btullis>	 dcausse: None from me. Are you doing the re-deploy yourself?
[09:03:05] <dcausse>	 yes
[09:03:26] <btullis>	 Ack, all good on my side.
[09:04:11] <dcausse>	 thanks!
[09:15:54] <elukey>	 joal: o/
[09:15:57] <elukey>	 do you have a min?
[09:34:58] <brouberol>	 joal was off for a couple of hours
[09:38:20] <btullis>	 I see a consume rate and a produce rate of zero in the codfw deployment of mw_page_content_change_enrich here: https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?from=1686676660402&orgId=1&to=1686687460402&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=mw-page-content-change-enrich&var-operator_name=All&var-helm_release=main&var-flink_job_name=mw_page_content_change_enrich
[09:39:11] <btullis>	 I'm confused by elukey's comment here: https://phabricator.wikimedia.org/T347676#9209807
[09:40:00] <btullis>	 ...because I thought that there were running flink clusters for mw_page_content_change_enrich in the wikikube clusters (as opposed to dse-k8s)
[09:40:59] <btullis>	 Oh but this:
[09:41:02] <btullis>	 https://www.irccloud.com/pastebin/LyIlMj6U/
[10:12:31] <btullis>	 I understand the cause of my confusion above and it was a pebcak. There was no mention of dse-k8s in the comment, I just imagined it.
[10:14:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ...
[10:14:27] <jinxer-wm>	 mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[10:37:12] <icinga-wm>	 PROBLEM - Host an-worker1086 is DOWN: PING CRITICAL - Packet loss = 100%
[10:39:06] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[10:40:18] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on an-worker1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Btullis T347287 - Not booting with failed disk https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:40:18] <icinga-wm>	 ACKNOWLEDGEMENT - Host an-worker1086 is DOWN: PING CRITICAL - Packet loss = 100% Btullis T347287 - Not booting with failed disk
[10:50:48] <icinga-wm>	 RECOVERY - Host an-worker1086 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[10:55:20] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:18:40] <icinga-wm>	 PROBLEM - Host an-worker1085 is DOWN: PING CRITICAL - Packet loss = 100%
[11:26:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ...
[11:26:27] <jinxer-wm>	 mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[11:29:26] <icinga-wm>	 RECOVERY - Host an-worker1085 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[11:37:26] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:06:37] <joal>	 !log Various restarts of mw_page_content_change_enrich k8s app since yesterday - the app is failing to send data to kafka
[12:06:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:07:06] <joal>	 !log mw_page_content_change_enrich alert silenced for the weekend, the app is down, more investigation next week
[12:07:08] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:17:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:19:42] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:30:12] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:32:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:32:59] <wikibugs>	 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10Gehel)
[12:46:32] <wikibugs>	 10Data-Platform-SRE: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10elukey) We are going to re-deploy the flink app in codfw when the kafka partitions re-assignment task (on Jumbo) is done, to verify if it is has anything to...
[13:08:52] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2): Provide overview of best DQ practices and system design - https://phabricator.wikimedia.org/T346283 (10lbowmaker) 05Open→03Resolved
[13:09:54] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [BUG] eventutilites-python: fix type checking CI job - https://phabricator.wikimedia.org/T346085 (10lbowmaker) 05Open→03Resolved
[13:10:15] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: eventutilities-python: cookicutter template example should be updated - https://phabricator.wikimedia.org/T345390 (10lbowmaker) 05Open→03Resolved
[13:10:35] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:37] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2): Document data pipeline and data set ownership - https://phabricator.wikimedia.org/T346295 (10lbowmaker) 05Open→03Resolved
[13:10:49] <btullis>	 !log  systemctl reset-failed on kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service on kafka-jumbo1001
[13:10:50] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:10:53] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: Enum with an entry of `null` should fail jsonschema-tools validation - https://phabricator.wikimedia.org/T344511 (10lbowmaker) 05Open→03Resolved
[13:11:19] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [BUG] mediawi-page-content-change-enrich: cross DC network calls to swift are failing - https://phabricator.wikimedia.org/T346877 (10lbowmaker) 05Open→03Resolved
[13:12:19] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: eventutilities-python: Gitlab CI pipeline should use memory optimized runners. - https://phabricator.wikimedia.org/T346084 (10lbowmaker) 05Open→03Resolved
[13:12:37] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [SPIKE] Should we enable compression on kafka jumbo? - https://phabricator.wikimedia.org/T345657 (10lbowmaker) 05Open→03Resolved
[13:12:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service Failed on kafka-jumbo1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:12:51] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Document the onboarding journey on Event Platfrom - https://phabricator.wikimedia.org/T345193 (10lbowmaker) 05Open→03Resolved
[13:15:47] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10lbowmaker)
[13:17:13] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10lbowmaker)
[13:17:51] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: Define Flink k8s operator SLO - https://phabricator.wikimedia.org/T345914 (10lbowmaker)
[13:18:00] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10lbowmaker)
[13:20:24] <elukey>	 joal: https://stream-beta.wmflabs.org/ seems to work, still a few nits
[13:20:24] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10lbowmaker)
[13:20:38] <elukey>	 and I also need to find how to trigger new events, to then be displayed
[13:20:38] <joal>	 Awesome job elukey :)
[13:20:46] <elukey>	 not straightforward in deployment-prep
[13:20:52] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) Fixed the last issue, I needed to add an explicit `stream_ui_enabled: true` to the config.  All works, there...
[13:26:35] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10SRE Observability, 10Data Engineering and Event Platform Team (Sprint 3): Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430 (10lbowmaker)
[13:26:43] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3): Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10lbowmaker)
[13:27:12] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3), 10Patch-For-Review: Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10lbowmaker)
[13:27:38] <wikibugs>	 10Data-Engineering, 10Machine-Learning-Team, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 3), and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10lbowmaker)
[13:29:06] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10lbowmaker)
[13:29:29] <wikibugs>	 10Data-Engineering, 10Discovery-Search, 10serviceops-radar, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [NEEDS GROOMING] Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10lbowmaker)
[13:34:00] <wikibugs>	 10Data-Platform-SRE, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10bking)  >>! In T346948#9184440, @Andrew wrote: > @bking two questions >  > 1) (a repeat of T324147) do y'all still want these servers to do things that can't be...
[13:35:51] <wikibugs>	 10Data-Platform-SRE: Investigate failure of Hadoop namenode coinciding with krb1001 reboot - https://phabricator.wikimedia.org/T346135 (10BTullis) I've found it impossible to find time to work on this so far, given a number of other incidents we observed this week and overall workload. Maybe we will have a quiet...
[14:00:00] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10BTullis)
[14:39:06] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[14:41:10] <wikibugs>	 10Data-Platform-SRE: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10Jclark-ctr)
[14:50:04] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10Jclark-ctr)
[15:12:35] <wikibugs>	 10Data-Platform-SRE, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10EBernhardson) >>! In T346948#9210654, @bking wrote: >  >>>! In T346948#9184440, @Andrew wrote: >> @bking two questions >>  >> 1) (a repeat of T324147) do y'all...
[15:16:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[15:22:20] <wikibugs>	 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) I found some sources of the `/srv/dumps/xmldatadumps/public/other/misc` folder that were being rsynced from...
[15:26:42] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ...
[15:26:42] <jinxer-wm>	 mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[15:31:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[15:32:44] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Data Catalog MVP - https://phabricator.wikimedia.org/T299910 (10BTullis) 05Open→03Resolved I'm going to be bold and resolve this epic. We have completed all of the tasks that were originally defined to get DataHub to an MVP stage. There are improvements to...
[15:34:23] <inflatador>	 ^^ I'm looking into the mw_page_content_change_enrich alert 
[15:35:52] <btullis>	 inflatador: I think that joal and elukey and brouberol have been working on it for much of the day, under: T347676  You may want to check with them.
[15:35:53] <stashbot>	 T347676: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676
[15:38:42] <inflatador>	 btullis ACK, I applied to bring the service back up, but destroyed after I saw your msg. Wonder if there's a way to suppress these alerts 
[15:40:10] <btullis>	 Should be able to do it from the alertmanager web gui: https://wikitech.wikimedia.org/wiki/Alertmanager#Silences_&_acknowledgements
[15:41:02] <wikibugs>	 10Data-Platform-SRE: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10bking) Hey all, I just saw the alert for mw-page-content-change-enrich and did a `helmfile apply` to get the service back up. @BTullis messaged me in IRC a...
[15:41:27] <inflatador>	 may have already just been suppressed? Or maybe the suppression doesn't stop the IRC msgs?
[15:41:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ...
[15:41:27] <jinxer-wm>	 mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[15:41:46] <inflatador>	 Or...it resolved when I started the service and will fire again shortly
[15:45:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ...
[15:45:27] <jinxer-wm>	 mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[15:52:25] <inflatador>	 OK, it's silenced for the next 3 days
[15:52:55] <wikibugs>	 10Data-Platform-SRE: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10bking) I just silenced `mw_page_content_change_enrich in codfw is not running` alerts for the next 3 days.
[16:05:57] <wikibugs>	 (03PS8) 10Sharvaniharan: New Event schema for mobile apps [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 (https://phabricator.wikimedia.org/T347729)
[16:09:40] <wikibugs>	 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) 05Open→03Resolved I think it's fixed. {F37853141,width=50%}
[16:17:02] <wikibugs>	 10Data-Platform-SRE: Root cause Archiva outage from 2023-09-24 - https://phabricator.wikimedia.org/T347343 (10LSobanski)
[16:21:42] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans)
[16:22:04] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) p:05Triage→03Medium
[16:23:21] <wikibugs>	 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10Eevans)
[16:29:18] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) All right the code worked! I was able to see [[ https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Pola...
[17:12:59] <jinxer-wm>	 (SystemdUnitFailed) firing: wikidatardf-truthy-dumps.service Failed on snapshot1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:47:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:49:12] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:00:58] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:02:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:39:06] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[19:10:17] <wikibugs>	 10Data-Engineering, 10Event-Platform, 10Platform Team Initiatives (New Hook System): Update EventBus to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T346539 (10Umherirrender) a:03Umherirrender
[19:10:20] <wikibugs>	 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10bking) [[ https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#heading=h.fs3knmjt7fjy | The essay "My Philosophy on Alerting" ]] w...
[19:13:44] <jinxer-wm>	 (SystemdUnitCrashLoop) firing:  crashloop on kafka-jumbo1005:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[19:18:44] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (3)  crashloop on kafka-jumbo1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[19:23:44] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (4)  crashloop on kafka-jumbo1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[19:28:44] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (5)  crashloop on kafka-jumbo1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[19:31:34] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 6.592e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad
[19:37:34] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10Jclark-ctr)
[19:50:45] <wikibugs>	 (03CR) 10Tsevener: "Thanks! It's looking good, all of my feedback is just documentation-related on the README." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 (https://phabricator.wikimedia.org/T347729) (owner: 10Sharvaniharan)
[19:58:06] <wikibugs>	 (03PS13) 10Milimetric: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu)
[20:01:07] <wikibugs>	 (03CR) 10Milimetric: Create a job to dump XML/SQL MW history files to HDFS (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu)
[20:01:33] <wikibugs>	 (03CR) 10Milimetric: Create a job to dump XML/SQL MW history files to HDFS (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu)
[20:30:00] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+2] Adds new web fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960141 (https://phabricator.wikimedia.org/T346106) (owner: 10Kimberly Sarabia)
[20:30:53] <wikibugs>	 (03Merged) 10jenkins-bot: Adds new web fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960141 (https://phabricator.wikimedia.org/T346106) (owner: 10Kimberly Sarabia)
[22:02:59] <jinxer-wm>	 (SystemdUnitFailed) firing: wikidatardf-truthy-dumps.service Failed on snapshot1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:39:06] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[23:07:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[23:28:45] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (5)  crashloop on kafka-jumbo1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop