[01:16:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:25:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:02:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:07:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [09:18:23] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10nfraison) a:03nfraison [09:24:30] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10nfraison) Impacted nodes are ` node /an-conf100[1-3]\.eqiad\.wmnet/ { role(analytics_cluster::zookeeper)... [09:37:28] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10nfraison) @BTullis even if nodes will be able to rejoin the cluster if data is deleted I would be in favor of... [09:49:29] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10nfraison) Current disk configuration 2 raid devices md0 and md1 ` Device Boot Start End Sector... [09:51:54] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10elukey) +1 on keeping /var/lib/zookeeper when doing the reimages, seems the safest bet. IIRC zookeeper is not... [10:03:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:13:56] PROBLEM - Check systemd state on an-airflow1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-kerberos@search.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:52:37] 10Analytics-Radar, 10Data-Engineering-Icebox, 10ChangeProp, 10MediaWiki-Core-JobQueue, and 2 others: Consider the possibility of separating ChangeProp and JobQueue on Kafka level - https://phabricator.wikimedia.org/T199431 (10LSobanski) I don't see any specific actions for #SRE, removing the tag, please ad... [14:46:35] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09), 10Patch-For-Review: Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 (10lbowmaker) [14:47:05] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09), 10Patch-For-Review: Flink + Event Platform integration for writing into streams via Table API - https://phabricator.wikimedia.org/T324114 (10lbowmaker) [14:47:09] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09), 10Patch-For-Review: Flink SQL queries should access Kafka topics from a Catalog - https://phabricator.wikimedia.org/T322022 (10lbowmaker) [14:48:25] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Deploy mediawiki-event-enrichment flink app to DSE k8s - https://phabricator.wikimedia.org/T325305 (10lbowmaker) 05Open→03Resolved [14:48:28] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10lbowmaker) [14:49:06] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Remove hardcoded kafka parameters - https://phabricator.wikimedia.org/T329061 (10lbowmaker) 05Open→03Resolved [14:49:09] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10lbowmaker) [14:49:28] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [14:50:25] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08), 10MW-1.40-notes (1.40.0-wmf.23; 2023-02-13): mediawiki.page-undelete stream is empty - https://phabricator.wikimedia.org/T329064 (10lbowmaker) 05Open→03Resolved [14:50:54] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10lbowmaker) [14:50:57] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08), 10MW-1.40-notes (1.40.0-wmf.23; 2023-02-13), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10lbowmaker) 05Open→03Resolved [14:51:10] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10lbowmaker) [14:51:12] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08), 10Patch-For-Review: Tests for mediawiki-stream-enrichment-python flink job via eventutilities-python - https://phabricator.wikimedia.org/T326565 (10lbowmaker) 05Open→03Resolved [16:13:54] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): eventutilities-python source and destination stream must be versioned - https://phabricator.wikimedia.org/T327866 (10lbowmaker) 05Open→03Resolved [16:14:18] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): eventutilities-python should support nested row type info - https://phabricator.wikimedia.org/T327900 (10lbowmaker) 05Open→03Resolved [18:20:54] 10Data-Engineering-Planning: Check home/HDFS leftovers of mepps - https://phabricator.wikimedia.org/T329820 (10lbowmaker) [18:23:24] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): Document Flink job deployment to k8s - https://phabricator.wikimedia.org/T329629 (10lbowmaker) [18:24:10] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Document Flink job deployment to k8s - https://phabricator.wikimedia.org/T329629 (10lbowmaker) [18:26:20] 10Data-Engineering-Planning: User-centric documentation links - https://phabricator.wikimedia.org/T329550 (10lbowmaker) [18:29:54] 10Data-Engineering-Planning: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10lbowmaker) [18:30:30] 10Data-Engineering-Planning, 10Data Pipelines: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10lbowmaker) [18:33:25] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics-Platform-Planning: Determine what UserBucketService::getUserEditCountBucket should return for anons - https://phabricator.wikimedia.org/T329292 (10lbowmaker) [18:35:44] 10Analytics-Wikistats, 10Data-Engineering: String found too soon, while searching for ' 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Event Platform Value Stream Documentation Tasks - https://phabricator.wikimedia.org/T329628 (10JArguello-WMF) [18:37:55] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Refactor Image Suggestions Feedback > Cassandra Flink Job and Deploy to DSE k8s - https://phabricator.wikimedia.org/T329524 (10JArguello-WMF) [18:37:57] 10Data-Engineering-Planning, 10Event-Platform Value Stream: jsonschema-tools tests - ensure that array items type is set - https://phabricator.wikimedia.org/T329515 (10JArguello-WMF) [18:37:59] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MediaWiki-Vagrant: EventBus should not blackhole undeclared streams - https://phabricator.wikimedia.org/T329480 (10JArguello-WMF) [18:39:20] 10Analytics-Radar, 10Analytics-Wikistats, 10Data-Engineering: German derivative of Wikistats report shows marked difference for new editors in Aug vs Sep - https://phabricator.wikimedia.org/T178891 (10odimitrijevic) 05Open→03Declined [19:14:47] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 08): [Airflow] Build Druid Operator - https://phabricator.wikimedia.org/T309996 (10JArguello-WMF) 05Open→03Resolved [21:14:48] 10Data-Engineering, 10Data Pipelines: Delete empty tables unique_devices_*_wide_* - https://phabricator.wikimedia.org/T329978 (10odimitrijevic) [21:37:39] 10Data-Engineering-Planning, 10Data Pipelines: Delete empty tables unique_devices_*_wide_* - https://phabricator.wikimedia.org/T329978 (10lbowmaker) [22:48:37] 10Data-Engineering-Planning, 10Data Pipelines: Deprecate old mobile datasets - https://phabricator.wikimedia.org/T329310 (10odimitrijevic) Consider doing this at the same time as https://phabricator.wikimedia.org/T329978