[00:20:51] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:36:39] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:19:51] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:24:37] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:57:05] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:12:53] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:45:21] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:44:50] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] homepagemodule: Document total_pageviews_count in action_data [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/886355 (https://phabricator.wikimedia.org/T328391) (owner: 10Kosta Harlan)
[07:20:07] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[07:31:00] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[08:16:47] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:17:21] <wikibugs>	 (03PS1) 10Aqu: Java Hive UDF thread safety [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/886800 (https://phabricator.wikimedia.org/T327072)
[08:30:53] <wikibugs>	 (03CR) 10Aqu: Remove Guava from dependency (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu)
[08:32:41] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream: Replace refinery-source Guava caches by Caffeine - https://phabricator.wikimedia.org/T325266 (10Antoine_Quhen) I've performed some tests with Hive and Spark to make sure `geocoded_data` generates the same output as before: * https://phabricator.wikime...
[08:35:46] <wikibugs>	 (03PS2) 10Aqu: Java Hive UDF thread safety [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/886800 (https://phabricator.wikimedia.org/T327072)
[08:43:54] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi)
[08:49:13] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:49:38] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MoritzMuehlenhoff)
[08:59:51] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Flink SQL queries should access Kafka topics from a Catalog - https://phabricator.wikimedia.org/T322022 (10tchin) This is so hard to describe through text I just made a [[ https://miro.com/app/board/uXjVPqmlfFA=/?share_link_id=532196066529...
[09:10:51] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:39:12] <wikibugs>	 10Data-Engineering, 10API Platform (Sprint 04), 10AQS2.0, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10SGupta-WMF) @BPirkle Yes , your analysis is correct . I will be creating separate bugs for these tickets
[09:57:50] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10akosiaris) >>! In T327925#8587186, @Marostegui wrote: >>>! In T327925#8587104, @Joe wrote: >> I would suggest that instead of hand...
[10:05:57] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10EChetty) Hey @Marostegui,  Is this work blocking anything at the moment? It currently on our radar, but not prioritised. Please let me know if its a blocker or urgent and we can...
[10:08:52] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 (10EChetty)
[10:10:22] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10Marostegui) It is not blocking anything specifically, but this host needs to _at least_ be rebooted soon as there are several kernel upgrades it has missed as it has been up for...
[10:11:29] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) Cool! I am going to repool the hosts then :)
[11:08:32] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:45:27] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Aklapper)
[11:47:02] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10ops-monitoring-bot) Host rebooted by btullis@cumin1001 with reason: picking up new kernel
[11:50:05] <wikibugs>	 10Data-Engineering-Planning, 10Epic, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Create puppet profiles for the new ceph cluster - https://phabricator.wikimedia.org/T328123 (10BTullis) @EChetty this needs to be in the current sprint because it's the next logical piece of work on this project....
[11:58:36] <icinga-wm>	 PROBLEM - mysqld processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[12:01:52] <icinga-wm>	 ACKNOWLEDGEMENT - mysqld processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Btullis T304492 Issues after rebooting. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[12:03:37] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10BTullis) I rebooted db1108, but the systemd service definitions for the analytics_meta and matomo database instances no longer exist for some reason. Checking the cause now.
[12:19:30] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10achou)
[12:43:28] <icinga-wm>	 RECOVERY - mysqld processes on db1108 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[12:47:34] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10dcaro)
[12:51:42] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) I am repooling all the databases since we are going to fully depool codfw for reads.
[12:53:38] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10BTullis) >>! In T304492#8589207, @BTullis wrote: > I rebooted db1108, but the systemd service definitions for the analytics_meta and matomo database instances no longer exist for...
[12:55:45] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10EChetty)
[13:00:02] <wikibugs>	 10Data-Engineering, 10Product-Analytics (Kanban), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)): Superset Date Filter fix needed - https://phabricator.wikimedia.org/T318299 (10EChetty) 05Open→03Resolved
[13:05:20] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10Marostegui) I have also upgraded mariadb from 10.4.18 to 10.4.22
[13:30:25] <wikibugs>	 (03CR) 10DCausse: Remove Guava from dependency (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu)
[13:58:13] <wikibugs>	 10Analytics, 10Data-Engineering-Planning, 10Data Pipelines: Add cawiki to clickstream dataset - https://phabricator.wikimedia.org/T327982 (10JAllemandou)
[13:58:16] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Privacy Engineering, 10Research, 10Epic: Add more languages to Wikipedia Clickstream - https://phabricator.wikimedia.org/T289532 (10JAllemandou)
[14:09:47] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ssingh)
[14:15:35] <wikibugs>	 (03CR) 10Snwachukwu: [C: 03+2] Refactor and Expand External referer classification [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) (owner: 10Snwachukwu)
[14:15:41] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4052 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4052%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[14:15:42] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4052 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4052%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[14:16:46] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Deploy mediawiki-event-enrichment flink app to DSE k8s - https://phabricator.wikimedia.org/T325305 (10JArguello-WMF)
[14:18:11] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Deploy mediawiki-event-enrichment flink app to DSE k8s - https://phabricator.wikimedia.org/T325305 (10JArguello-WMF) Monitoring needs a task of its own @Ottomata can you help me with that?
[14:18:41] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4044 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4044%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[14:18:41] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4044 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4044%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[14:20:30] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:20:41] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp4052 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4052%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[14:20:42] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp4052 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4052%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[14:21:07] <wikibugs>	 (03CR) 10Gehel: "I've added a bunch of comments inline. This is mostly about style, but there are a few things about concurrency." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/886800 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu)
[14:22:34] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:23:41] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp4044 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4044%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[14:23:42] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp4044 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4044%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[14:24:02] <wikibugs>	 (03Merged) 10jenkins-bot: Refactor and Expand External referer classification [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) (owner: 10Snwachukwu)
[14:39:40] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MatthewVernon) If we're "just" depooling codfw it's worth noting we will still need to depool the affected ms-fe* nodes (since mw...
[14:45:18] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:48:50] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10Ottomata) > I doubt that helmfile will be able to properly merge this array while applying the different levels of values files  Tr...
[14:50:44] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:54:52] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:59:37] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream: Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10Ottomata)
[15:00:08] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Deploy mediawiki-event-enrichment flink app to DSE k8s - https://phabricator.wikimedia.org/T325305 (10Ottomata) Done: {T328925}
[15:08:20] <wikibugs>	 (03PS2) 10Mforns: Support snapshot partitioning in HiveToDruid and DataFrameToDruid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/886114 (https://phabricator.wikimedia.org/T324485)
[15:13:31] <wikibugs>	 (03PS3) 10Mforns: Support snapshot partitioning in HiveToDruid and DataFrameToDruid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/886114 (https://phabricator.wikimedia.org/T324485)
[15:13:43] <wikibugs>	 (03CR) 10Mforns: Support snapshot partitioning in HiveToDruid and DataFrameToDruid (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/886114 (https://phabricator.wikimedia.org/T324485) (owner: 10Mforns)
[15:14:41] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:14:42] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:16:26] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:19:41] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:19:42] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:20:41] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4039 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4039%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:21:47] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10Antoine_Quhen) We are about to move our base config to Airflow 2.5 with Postgres: https://...
[15:25:41] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:30:26] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:35:50] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:41:43] <wikibugs>	 (03CR) 10Ottomata: "Started adding inline comments about adapting HivePartition to support `snapshot` as a possible date time key.  If that were done, then I " [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/886114 (https://phabricator.wikimedia.org/T324485) (owner: 10Mforns)
[15:45:16] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07), 10Product-Analytics (Kanban): Include EU Registered Country in the canonical country database - https://phabricator.wikimedia.org/T324995 (10mforns)
[15:46:42] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4047%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:46:42] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4047%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:48:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:51:41] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:51:42] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:54:41] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4050 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4050%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:54:41] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4050 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4050%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:55:12] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07): [Airflow] Build Druid Operator - https://phabricator.wikimedia.org/T309996 (10mforns)
[15:56:09] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07): [Airflow] Build Druid Operator - https://phabricator.wikimedia.org/T309996 (10mforns) This is the related merge request: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/207
[15:59:41] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (4) varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:59:41] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (4) varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[16:01:37] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10dcausse) >>! In T328478#8589620, @Ottomata wrote: >> I wish that the flink-app chart provided some tooling to help with that. >  >...
[16:09:17] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10Ottomata) Interesting, and let the helm dict merging of e.g. `config_files.my_app_config.content` handle the creation of merged con...
[16:10:24] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:15:30] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:20:54] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:22:16] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10Ottomata) > This might be a good reason to rely on a JVM based one for python apps too (unfortunetly). Maybe we should try refinery...
[16:33:35] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly: Spike: Pageview Anomaly Analysis - https://phabricator.wikimedia.org/T328935 (10EChetty)
[16:33:49] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10Gehel) a:03bking
[16:33:56] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10MPhamWMF)
[16:38:46] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): [Tracking] Migrate Search Airflow jobs to Airflow 2 and use shared supporting code from the data engineering Airflow - https://phabricator.wikimedia.org/T318414 (10Gehel)
[16:42:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:51:40] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Data-Engineering-Planning, 10Data Pipelines: Merge Ks-Arab and Ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10EChetty)
[16:53:34] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:00:30] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:05:44] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:07:15] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10dcausse) >>! In T328478#8590013, @Ottomata wrote: > Interesting, and let the helm dict merging of e.g. `config_files.my_app_config....
[17:21:55] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jbond)
[17:25:54] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:30:53] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Wikipedia-iOS-App-Backlog, and 5 others: Analyze possible bot traffic for enwiki article Index (statistics), Index & XXX:_Return_of_Xander_Cage - https://phabricator.wikimedia.org/T328127 (10Seddon) @SNowick_WMF  Where necessary from the...
[17:45:20] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:50:44] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:30:24] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:35:48] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:43:13] <wikibugs>	 10Data-Engineering-Planning, 10Epic, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Create puppet profiles for the new ceph cluster - https://phabricator.wikimedia.org/T328123 (10JArguello-WMF) @EChetty I see that this is tagged as an Epic. Is this an Epic? If so, this one should be broken down...
[18:59:31] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10gmodena) >>! In T328478#8590055, @Ottomata wrote: >  > Or, we just use a different solution for python than for JVM.  As long as th...
[19:14:24] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10gmodena) >>! In T328478#8590253, @dcausse wrote: > yes, the app would then be forced to have a feature to load options from a confi...
[19:15:24] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:20:48] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:45:32] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10Ottomata) > In this scenario, would helm be responsible of parsing my_app_config.properties and merging output in its own dict?  I...
[19:53:04] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Population output rank metrics - https://phabricator.wikimedia.org/T306624 (10JAnstee_WMF) Looking good for labels and the transformations with the exception of two adjustments for column labels in the outputs:    FROM:    population_presence         TO: population    FR...
[19:56:45] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Access input metrics - https://phabricator.wikimedia.org/T324968 (10JAnstee_WMF) @ntsako  and @KCVelaga_WMF - looking on track, but as I commented in the QA workbook - we have a dropped input metric - we should be extracting both the connectivity_index score as well as i...
[20:00:14] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:02:14] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10Ottomata)  Starting a [[ https://grafana-rw.wikimedia.org/d/xp9E_EA4z/flink-enrichment-app-wip?orgId=1&var-datasource=eqiad+prometheus%2Fk8s-dse&var-namespace=stream-enrichment-poc&from...
[20:05:38] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:07:46] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:21:48] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10bking)
[20:22:09] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10bking)
[20:34:56] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Access input metrics - https://phabricator.wikimedia.org/T324968 (10JAnstee_WMF) @KCVelaga_WMF Also one change to the output labels FROM: access_presence_growth      TO    access
[20:40:06] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:45:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:50:30] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:30:14] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:35:38] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:55:36] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:27:54] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:45:24] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:50:48] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:51:51] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e0e96453-af13-467f-a75e-ebd1c4122a32) set by bking@cumin2002 for...
[23:30:14] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:35:36] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state