[04:35:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp5022 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5022%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [04:40:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5022 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5022%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [08:56:10] (03CR) 10Gergő Tisza: [C: 03+2] homepagemodule: Add support for newimpact drawer/tour events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/863385 (https://phabricator.wikimedia.org/T323619) (owner: 10Kosta Harlan) [08:57:04] (03Merged) 10jenkins-bot: homepagemodule: Add support for newimpact drawer/tour events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/863385 (https://phabricator.wikimedia.org/T323619) (owner: 10Kosta Harlan) [09:05:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:12] 10Quarry, 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro, 10cloud-services-team (Kanban): [quarry] quarry-web-02 out of memory - https://phabricator.wikimedia.org/T324438 (10dcaro) [09:06:23] 10Quarry, 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro, 10cloud-services-team (Kanban): [quarry] quarry-web-02 out of memory - https://phabricator.wikimedia.org/T324438 (10dcaro) 05Open→03Resolved [09:31:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:19] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Flink Tables should have a default ROWTIME column. - https://phabricator.wikimedia.org/T324144 (10gmodena) a:03gmodena [10:16:59] 10Quarry, 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro, 10cloud-services-team (Kanban): [quarry] worker-04 down - https://phabricator.wikimedia.org/T324402 (10dcaro) Might be related to T324438 [11:26:39] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): [SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment - https://phabricator.wikimedia.org/T323217 (10gmodena) [11:43:40] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10Stevemunene) @Ottomata This might affect the rare packages using python2 or the deployments that had already set up symlinks to pyt... [11:45:43] !log restarting presto-server.service on an-presto1007 T323783 [11:45:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:45:46] T323783: Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 [11:48:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5026 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5026%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [11:53:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5026 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5026%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [12:06:57] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10Stevemunene) an-presto1007 is now part of the cluster. the delay in joining the cluster was caused by the timing between the puppet... [13:49:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:17] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Flink + Event Platform integration for writing into streams via Table API - https://phabricator.wikimedia.org/T324114 (10JArguello-WMF) [14:08:59] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Flink Tables should have a default ROWTIME column. - https://phabricator.wikimedia.org/T324144 (10JArguello-WMF) [14:17:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp5025 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5025%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:22:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5025 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5025%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:35:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5021 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5021%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:40:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5021 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5021%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:43:12] (03PS1) 10Snwachukwu: Refactor and Expand External referer classification [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) [14:46:19] (03CR) 10CI reject: [V: 04-1] Refactor and Expand External referer classification [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) (owner: 10Snwachukwu) [14:46:26] 10Quarry, 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro, 10cloud-services-team (Kanban): [quarry] worker-04 down - https://phabricator.wikimedia.org/T324402 (10rook) Thanks for dealing with that. The workers have a memory leak and eventually run out. Originally the idea... [14:54:22] (03PS12) 10Mazevedo: Add ios talk page interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/857759 (https://phabricator.wikimedia.org/T321841) [14:55:01] (03CR) 10CI reject: [V: 04-1] Add ios talk page interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/857759 (https://phabricator.wikimedia.org/T321841) (owner: 10Mazevedo) [14:56:13] (03PS13) 10Mazevedo: Add ios talk page interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/857759 (https://phabricator.wikimedia.org/T321841) [15:39:44] 10Analytics-Clusters, 10Analytics-Kanban: Add automata value in agent_type field of the refined table {hawk} - https://phabricator.wikimedia.org/T95693 (10Aklapper) [15:50:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5027 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [16:45:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5027 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [16:45:34] 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering-Planning, 10Event-Platform Value Stream, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10akosiaris) [16:55:55] 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering-Planning, 10Event-Platform Value Stream, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10akosiaris) @Ottomata, @elukey any updates on this?... [17:02:34] 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering-Planning, 10Event-Platform Value Stream, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10Ottomata) I would like to see config management for... [17:07:12] (03PS2) 10Snwachukwu: Refactor and Expand External referer classification [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) [17:09:12] (03CR) 10CI reject: [V: 04-1] Refactor and Expand External referer classification [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) (owner: 10Snwachukwu) [17:19:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:28:43] 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering-Planning, 10Event-Platform Value Stream, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10akosiaris) 05Open→03Stalled Cool, thanks for th... [17:30:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5024 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5024%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:30:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5024 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5024%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:23:21] 10Analytics-Radar, 10Privacy Engineering, 10Product-Analytics: Clarify the data retention extension process - https://phabricator.wikimedia.org/T256776 (10kzimmerman) 05Open→03Declined Both pages have been updated since this task was created, and this has not been a pressing issue for our team. [19:31:37] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Flink Tables should have a default ROWTIME column. - https://phabricator.wikimedia.org/T324144 (10gmodena) According to [the doc]( https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/table/concepts/time_attributes/) we can... [19:35:21] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Flink Tables should have a default ROWTIME column. - https://phabricator.wikimedia.org/T324144 (10Ottomata) > For this reason I went with option 1 and propose a change to the Catalog's getTable() method to add watermark metadata at run tim... [19:40:36] 10Data-Engineering-Kanban, 10Product-Analytics, 10Wmfdata-Python, 10GitLab (Project Migration): Move Wmfdata-Python from Github to Gitlab - https://phabricator.wikimedia.org/T304544 (10nshahquinn-wmf) [19:48:46] 10Analytics-Radar, 10Product-Analytics: Investigate running Stan models on GPU - https://phabricator.wikimedia.org/T286493 (10mpopov) 05Open→03Declined No real need for this or bandwidth in the foreseeable future to make progress on this. [20:19:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:25:39] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Create a shared flink docker image - https://phabricator.wikimedia.org/T316519 (10Ottomata) Status update! [[ https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/858356 | flink and flink-kub... [20:30:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:38:25] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Shared-Data-Infrastructure: [SPIKE] Deploy event driven stateless Flink service to DSE cluster - https://phabricator.wikimedia.org/T320812 (10Ottomata) [21:38:29] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Create a shared flink docker image - https://phabricator.wikimedia.org/T316519 (10Ottomata) [22:46:23] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) a:03BTullis [23:08:14] 10Data-Engineering-Planning, 10Observability-Alerting, 10Shared-Data-Infrastructure, 10Traffic: Reduce/eliminate false positives for VarnishKafkaNoMessages alert - https://phabricator.wikimedia.org/T324522 (10BTullis)