[00:20:51] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:36:39] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:19:51] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:24:37] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:57:05] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:12:53] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:45:21] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:44:50] (03CR) 10Gergő Tisza: [C: 03+1] homepagemodule: Document total_pageviews_count in action_data [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/886355 (https://phabricator.wikimedia.org/T328391) (owner: 10Kosta Harlan) [07:20:07] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [07:31:00] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:16:47] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:17:21] (03PS1) 10Aqu: Java Hive UDF thread safety [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/886800 (https://phabricator.wikimedia.org/T327072) [08:30:53] (03CR) 10Aqu: Remove Guava from dependency (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [08:32:41] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Replace refinery-source Guava caches by Caffeine - https://phabricator.wikimedia.org/T325266 (10Antoine_Quhen) I've performed some tests with Hive and Spark to make sure `geocoded_data` generates the same output as before: * https://phabricator.wikime... [08:35:46] (03PS2) 10Aqu: Java Hive UDF thread safety [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/886800 (https://phabricator.wikimedia.org/T327072) [08:43:54] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) [08:49:13] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:49:38] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MoritzMuehlenhoff) [08:59:51] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Flink SQL queries should access Kafka topics from a Catalog - https://phabricator.wikimedia.org/T322022 (10tchin) This is so hard to describe through text I just made a [[ https://miro.com/app/board/uXjVPqmlfFA=/?share_link_id=532196066529... [09:10:51] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:39:12] 10Data-Engineering, 10API Platform (Sprint 04), 10AQS2.0, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10SGupta-WMF) @BPirkle Yes , your analysis is correct . I will be creating separate bugs for these tickets [09:57:50] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10akosiaris) >>! In T327925#8587186, @Marostegui wrote: >>>! In T327925#8587104, @Joe wrote: >> I would suggest that instead of hand... [10:05:57] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10EChetty) Hey @Marostegui, Is this work blocking anything at the moment? It currently on our radar, but not prioritised. Please let me know if its a blocker or urgent and we can... [10:08:52] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 (10EChetty) [10:10:22] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10Marostegui) It is not blocking anything specifically, but this host needs to _at least_ be rebooted soon as there are several kernel upgrades it has missed as it has been up for... [10:11:29] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) Cool! I am going to repool the hosts then :) [11:08:32] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:45:27] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Aklapper) [11:47:02] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10ops-monitoring-bot) Host rebooted by btullis@cumin1001 with reason: picking up new kernel [11:50:05] 10Data-Engineering-Planning, 10Epic, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Create puppet profiles for the new ceph cluster - https://phabricator.wikimedia.org/T328123 (10BTullis) @EChetty this needs to be in the current sprint because it's the next logical piece of work on this project.... [11:58:36] PROBLEM - mysqld processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:01:52] ACKNOWLEDGEMENT - mysqld processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Btullis T304492 Issues after rebooting. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:03:37] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10BTullis) I rebooted db1108, but the systemd service definitions for the analytics_meta and matomo database instances no longer exist for some reason. Checking the cause now. [12:19:30] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10achou) [12:43:28] RECOVERY - mysqld processes on db1108 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:47:34] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10dcaro) [12:51:42] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) I am repooling all the databases since we are going to fully depool codfw for reads. [12:53:38] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10BTullis) >>! In T304492#8589207, @BTullis wrote: > I rebooted db1108, but the systemd service definitions for the analytics_meta and matomo database instances no longer exist for... [12:55:45] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10EChetty) [13:00:02] 10Data-Engineering, 10Product-Analytics (Kanban), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)): Superset Date Filter fix needed - https://phabricator.wikimedia.org/T318299 (10EChetty) 05Open→03Resolved [13:05:20] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10Marostegui) I have also upgraded mariadb from 10.4.18 to 10.4.22 [13:30:25] (03CR) 10DCausse: Remove Guava from dependency (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [13:58:13] 10Analytics, 10Data-Engineering-Planning, 10Data Pipelines: Add cawiki to clickstream dataset - https://phabricator.wikimedia.org/T327982 (10JAllemandou) [13:58:16] 10Data-Engineering-Planning, 10Data Pipelines, 10Privacy Engineering, 10Research, 10Epic: Add more languages to Wikipedia Clickstream - https://phabricator.wikimedia.org/T289532 (10JAllemandou) [14:09:47] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ssingh) [14:15:35] (03CR) 10Snwachukwu: [C: 03+2] Refactor and Expand External referer classification [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) (owner: 10Snwachukwu) [14:15:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp4052 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4052%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:15:42] (VarnishkafkaNoMessages) firing: varnishkafka on cp4052 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4052%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:16:46] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Deploy mediawiki-event-enrichment flink app to DSE k8s - https://phabricator.wikimedia.org/T325305 (10JArguello-WMF) [14:18:11] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Deploy mediawiki-event-enrichment flink app to DSE k8s - https://phabricator.wikimedia.org/T325305 (10JArguello-WMF) Monitoring needs a task of its own @Ottomata can you help me with that? [14:18:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp4044 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4044%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:18:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp4044 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4044%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:20:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:41] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4052 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4052%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:20:42] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4052 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4052%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:21:07] (03CR) 10Gehel: "I've added a bunch of comments inline. This is mostly about style, but there are a few things about concurrency." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/886800 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [14:22:34] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:23:41] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4044 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4044%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:23:42] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4044 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4044%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:24:02] (03Merged) 10jenkins-bot: Refactor and Expand External referer classification [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) (owner: 10Snwachukwu) [14:39:40] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MatthewVernon) If we're "just" depooling codfw it's worth noting we will still need to depool the affected ms-fe* nodes (since mw... [14:45:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:50] 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10Ottomata) > I doubt that helmfile will be able to properly merge this array while applying the different levels of values files Tr... [14:50:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:52] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:59:37] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10Ottomata) [15:00:08] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Deploy mediawiki-event-enrichment flink app to DSE k8s - https://phabricator.wikimedia.org/T325305 (10Ottomata) Done: {T328925} [15:08:20] (03PS2) 10Mforns: Support snapshot partitioning in HiveToDruid and DataFrameToDruid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/886114 (https://phabricator.wikimedia.org/T324485) [15:13:31] (03PS3) 10Mforns: Support snapshot partitioning in HiveToDruid and DataFrameToDruid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/886114 (https://phabricator.wikimedia.org/T324485) [15:13:43] (03CR) 10Mforns: Support snapshot partitioning in HiveToDruid and DataFrameToDruid (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/886114 (https://phabricator.wikimedia.org/T324485) (owner: 10Mforns) [15:14:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:14:42] (VarnishkafkaNoMessages) firing: varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:16:26] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:19:41] (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:19:42] (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:20:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp4039 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4039%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:21:47] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10Antoine_Quhen) We are about to move our base config to Airflow 2.5 with Postgres: https://... [15:25:41] (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:30:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:43] (03CR) 10Ottomata: "Started adding inline comments about adapting HivePartition to support `snapshot` as a possible date time key. If that were done, then I " [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/886114 (https://phabricator.wikimedia.org/T324485) (owner: 10Mforns) [15:45:16] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07), 10Product-Analytics (Kanban): Include EU Registered Country in the canonical country database - https://phabricator.wikimedia.org/T324995 (10mforns) [15:46:42] (VarnishkafkaNoMessages) firing: varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4047%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:46:42] (VarnishkafkaNoMessages) firing: varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4047%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:48:48] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:51:41] (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:51:42] (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:54:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp4050 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4050%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:54:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp4050 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4050%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:55:12] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07): [Airflow] Build Druid Operator - https://phabricator.wikimedia.org/T309996 (10mforns) [15:56:09] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07): [Airflow] Build Druid Operator - https://phabricator.wikimedia.org/T309996 (10mforns) This is the related merge request: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/207 [15:59:41] (VarnishkafkaNoMessages) resolved: (4) varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:59:41] (VarnishkafkaNoMessages) resolved: (4) varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [16:01:37] 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10dcausse) >>! In T328478#8589620, @Ottomata wrote: >> I wish that the flink-app chart provided some tooling to help with that. > >... [16:09:17] 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10Ottomata) Interesting, and let the helm dict merging of e.g. `config_files.my_app_config.content` handle the creation of merged con... [16:10:24] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:15:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:16] 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10Ottomata) > This might be a good reason to rely on a JVM based one for python apps too (unfortunetly). Maybe we should try refinery... [16:33:35] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly: Spike: Pageview Anomaly Analysis - https://phabricator.wikimedia.org/T328935 (10EChetty) [16:33:49] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10Gehel) a:03bking [16:33:56] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10MPhamWMF) [16:38:46] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): [Tracking] Migrate Search Airflow jobs to Airflow 2 and use shared supporting code from the data engineering Airflow - https://phabricator.wikimedia.org/T318414 (10Gehel) [16:42:48] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:51:40] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering-Planning, 10Data Pipelines: Merge Ks-Arab and Ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10EChetty) [16:53:34] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:00:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:15] 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10dcausse) >>! In T328478#8590013, @Ottomata wrote: > Interesting, and let the helm dict merging of e.g. `config_files.my_app_config.... [17:21:55] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jbond) [17:25:54] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:30:53] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Wikipedia-iOS-App-Backlog, and 5 others: Analyze possible bot traffic for enwiki article Index (statistics), Index & XXX:_Return_of_Xander_Cage - https://phabricator.wikimedia.org/T328127 (10Seddon) @SNowick_WMF Where necessary from the... [17:45:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:35:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:13] 10Data-Engineering-Planning, 10Epic, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Create puppet profiles for the new ceph cluster - https://phabricator.wikimedia.org/T328123 (10JArguello-WMF) @EChetty I see that this is tagged as an Epic. Is this an Epic? If so, this one should be broken down... [18:59:31] 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10gmodena) >>! In T328478#8590055, @Ottomata wrote: > > Or, we just use a different solution for python than for JVM. As long as th... [19:14:24] 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10gmodena) >>! In T328478#8590253, @dcausse wrote: > yes, the app would then be forced to have a feature to load options from a confi... [19:15:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:45:32] 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10Ottomata) > In this scenario, would helm be responsible of parsing my_app_config.properties and merging output in its own dict? I... [19:53:04] 10Data-Engineering, 10Equity-Landscape: Population output rank metrics - https://phabricator.wikimedia.org/T306624 (10JAnstee_WMF) Looking good for labels and the transformations with the exception of two adjustments for column labels in the outputs: FROM: population_presence TO: population FR... [19:56:45] 10Data-Engineering, 10Equity-Landscape: Access input metrics - https://phabricator.wikimedia.org/T324968 (10JAnstee_WMF) @ntsako and @KCVelaga_WMF - looking on track, but as I commented in the QA workbook - we have a dropped input metric - we should be extracting both the connectivity_index score as well as i... [20:00:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:14] 10Data-Engineering, 10Event-Platform Value Stream: Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10Ottomata) Starting a [[ https://grafana-rw.wikimedia.org/d/xp9E_EA4z/flink-enrichment-app-wip?orgId=1&var-datasource=eqiad+prometheus%2Fk8s-dse&var-namespace=stream-enrichment-poc&from... [20:05:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:46] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:21:48] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10bking) [20:22:09] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10bking) [20:34:56] 10Data-Engineering, 10Equity-Landscape: Access input metrics - https://phabricator.wikimedia.org/T324968 (10JAnstee_WMF) @KCVelaga_WMF Also one change to the output labels FROM: access_presence_growth TO access [20:40:06] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:45:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:55:36] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:27:54] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:45:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:50:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:51:51] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e0e96453-af13-467f-a75e-ebd1c4122a32) set by bking@cumin2002 for... [23:30:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:35:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state