[00:48:53] (SystemdUnitFailed) firing: monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:04:23] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:18:53] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:24:45] 10Data-Platform-SRE: Project future physical host usage for Search Platform-owned services - https://phabricator.wikimedia.org/T350885 (10wiki_willy) Thanks for working on this @bking. I'm mainly looking to see how much future growth you're looking at (a rough estimate is fine), if you have any requests for the... [01:46:10] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10MediaWiki-extensions-Scribunto, and 6 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10matmarex) [01:48:53] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:03:53] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:16:14] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10MediaWiki-extensions-Scribunto, and 6 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10matmarex) [03:58:53] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.325% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:20:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:53] (SystemdUnitFailed) resolved: monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:26:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:28:53] (SystemdUnitFailed) firing: monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:58:53] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.325% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:09:15] (EventgateValidationErrors) firing: ... [08:09:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:28:53] (SystemdUnitFailed) firing: (2) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:21] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10MediaWiki-extensions-Scribunto, and 7 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10ItamarWMDE) [09:46:22] (03CR) 10Phuedx: "This is great." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968714 (https://phabricator.wikimedia.org/T349729) (owner: 10Kimberly Sarabia) [09:56:29] 10Data-Platform-SRE, 10Discovery-Search (Current work): Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10Gehel) 05Open→03Resolved [09:56:33] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader into Kubernetes - https://phabricator.wikimedia.org/T346189 (10Gehel) [09:56:37] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10Gehel) [11:43:30] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10BTullis) Confirmed, both servers can see the full 256 GB of RAM. Thanks again @VRiley-WMF. [11:43:48] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10BTullis) [11:58:53] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.325% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:09:30] (EventgateValidationErrors) firing: ... [12:09:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [12:13:19] (03CR) 10WMDE-Fisch: "This change is ready for review." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/973318 (https://phabricator.wikimedia.org/T350411) (owner: 10WMDE-Fisch) [12:14:33] (03PS2) 10WMDE-Fisch: Remove deprecated tech wish scripts [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/973318 (https://phabricator.wikimedia.org/T350411) [12:19:16] (EventgateValidationErrors) resolved: ... [12:19:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [12:28:53] (SystemdUnitFailed) firing: (2) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:30:45] (EventgateValidationErrors) firing: ... [12:30:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [13:04:46] stevemunene: I was working on https://gerrit.wikimedia.org/r/c/operations/puppet/+/973308 this morning, and found a subtle typo in netboot.cfg related to druid hosts: https://github.com/wikimedia/operations-puppet/blob/production/modules/install_server/files/autoinstall/netboot.cfg#L150 [13:05:01] we're missing an `echo` there [13:08:48] Sharp eyes. Thanks brouberol. [13:09:56] great catch brouberol [13:14:00] that's the beauty of autogenerating it: we notice the existing issues [13:21:51] 10Data-Engineering (Sprint 5), 10Data-Platform, 10Movement-Insights: Iceberg unique devices table reporting incorrect numbers for 2023-10-01 - https://phabricator.wikimedia.org/T350920 (10lbowmaker) [13:31:05] 10Data-Engineering, 10Data Engineering and Event Platform Team: [Iceberg Migration] Migrate pageview tables to Iceberg - https://phabricator.wikimedia.org/T347690 (10lbowmaker) [13:31:07] 10Data-Engineering, 10Data Engineering and Event Platform Team: [Iceberg Migration] P.O.C. on Iceberg sensor using Snapshot metadata to keep status of updates - https://phabricator.wikimedia.org/T340471 (10lbowmaker) [13:31:09] 10Data-Engineering, 10Data Engineering and Event Platform Team: [Iceberg Migration] P.O.C. on Iceberg sensor using Iceberg table to keep status of updates - https://phabricator.wikimedia.org/T340463 (10lbowmaker) [13:31:11] 10Data-Engineering, 10Data Engineering and Event Platform Team: [Iceberg Migration] P.O.C. on Iceberg sensor using Postgres table to keep status of updates - https://phabricator.wikimedia.org/T340466 (10lbowmaker) [13:31:14] 10Data-Engineering, 10Data Engineering and Event Platform Team: [Iceberg Migration] Implement mechanism for automatic Iceberg data deletion and optimization - https://phabricator.wikimedia.org/T338065 (10lbowmaker) [13:31:16] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Epic: [Iceberg Migration] Apache Iceberg Migration - https://phabricator.wikimedia.org/T333013 (10lbowmaker) [13:34:17] 10Data-Engineering: [Iceberg Migration] Migrate pageview tables to Iceberg - https://phabricator.wikimedia.org/T347690 (10lbowmaker) [13:34:19] 10Data-Engineering: [Iceberg Migration] P.O.C. on Iceberg sensor using Snapshot metadata to keep status of updates - https://phabricator.wikimedia.org/T340471 (10lbowmaker) [13:34:21] 10Data-Engineering: [Iceberg Migration] P.O.C. on Iceberg sensor using Postgres table to keep status of updates - https://phabricator.wikimedia.org/T340466 (10lbowmaker) [13:34:23] 10Data-Engineering: [Iceberg Migration] P.O.C. on Iceberg sensor using Iceberg table to keep status of updates - https://phabricator.wikimedia.org/T340463 (10lbowmaker) [13:34:25] 10Data-Engineering: [Iceberg Migration] Implement mechanism for automatic Iceberg data deletion and optimization - https://phabricator.wikimedia.org/T338065 (10lbowmaker) [13:34:27] 10Data-Engineering, 10Epic: [Iceberg Migration] Apache Iceberg Migration - https://phabricator.wikimedia.org/T333013 (10lbowmaker) [13:36:02] 10Data-Engineering: [Data Quality] Implement monitoring and alerting for XYZ - https://phabricator.wikimedia.org/T349457 (10lbowmaker) [13:36:04] 10Data-Engineering: [Data Quality] [NEEDS GROOMING][SPIKE] Define how we can validate that mw.page_content_change is complete - https://phabricator.wikimedia.org/T345917 (10lbowmaker) [13:36:07] 10Data-Engineering, 10Epic: [Data Quality] SDS3.3 - Logging, Monitoring and Alerting Improvements for Data Quality Incidents - https://phabricator.wikimedia.org/T345912 (10lbowmaker) [13:40:33] 10Data-Engineering, 10Event-Platform: [Event Platform] Can we import metrics from logstash to promethues? - https://phabricator.wikimedia.org/T347484 (10lbowmaker) [13:40:39] 10Data-Engineering, 10Event-Platform: [NEEDS GROOMING] schema services should be moved to k8s - https://phabricator.wikimedia.org/T347421 (10lbowmaker) [13:40:41] 10Data-Engineering, 10Event-Platform, 10Wikimedia-production-error: [Event Platform] Error: Call to a member function exists() on null (via EventBus PageChangeEventSerializer) - https://phabricator.wikimedia.org/T346355 (10lbowmaker) [13:40:44] 10Data-Engineering, 10Event-Platform: [SPIKE] Use Flink for batch backfilling - https://phabricator.wikimedia.org/T324108 (10lbowmaker) [13:40:46] 10Data-Engineering, 10Event-Platform: [SPIKE] Should we introduce static typing to Event Platform nodejs codebases? - https://phabricator.wikimedia.org/T345389 (10lbowmaker) [13:40:48] 10Data-Engineering, 10EventStreams, 10Event-Platform: [Event Platform] Event streams don't respect milliseconds UTC unix epoch timestamp in since parameter - https://phabricator.wikimedia.org/T345606 (10lbowmaker) [13:40:50] 10Data-Engineering, 10Epic, 10Event-Platform: [Event Platform] Flink Operations - https://phabricator.wikimedia.org/T328561 (10lbowmaker) [13:40:52] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: [SPIKE] Investigate what happens to deployed Flink clusters if the k8s operator goes down? - https://phabricator.wikimedia.org/T346231 (10lbowmaker) [13:40:54] 10Data-Engineering, 10Discovery-Search, 10serviceops-radar, 10Event-Platform: [Event Platform] [NEEDS GROOMING] Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10lbowmaker) [13:40:56] 10Data-Engineering, 10Epic, 10Event-Platform: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10lbowmaker) [13:40:58] 10Data-Engineering, 10Event-Platform: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 (10lbowmaker) [13:41:00] 10Analytics, 10Data-Engineering, 10Event-Platform: [Event Platform] Add expiry info to mediawiki.page-restrictions-change stream - https://phabricator.wikimedia.org/T282057 (10lbowmaker) [13:47:40] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4): [Data Quality] List out data path options for Prometheus vs. Hive as a metrics backend - https://phabricator.wikimedia.org/T349744 (10lbowmaker) 05Open→03Resolved [13:48:02] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] Enable snappy compression for Flink Kafka producers - https://phabricator.wikimedia.org/T345805 (10lbowmaker) 05Open→03Resolved [13:48:04] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [SPIKE] Should we enable compression on kafka jumbo? - https://phabricator.wikimedia.org/T345657 (10lbowmaker) [13:48:25] 10Data-Engineering, 10Epic: [Data Quality] SDS3.3 - Logging, Monitoring and Alerting Improvements for Data Quality Incidents - https://phabricator.wikimedia.org/T345912 (10lbowmaker) [13:48:40] 10Data-Engineering, 10Epic: [Data Quality] SDS3.3 - Logging, Monitoring and Alerting Improvements for Data Quality Incidents - https://phabricator.wikimedia.org/T345912 (10lbowmaker) [13:48:53] (SystemdUnitFailed) firing: (3) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:08] 10Data-Engineering, 10EventStreams, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10lbowmaker) 05Open→03Resolved [13:49:21] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 9 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10lbowmaker) [13:49:37] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10lbowmaker) 05Open→03Resolved [13:49:56] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors - https://phabricator.wikimedia.org/T326002 (10lbowmaker) 05Open→03Reso... [13:50:22] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10lbowmaker) 05Open→03Resolved [13:50:34] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10lbowmaker) 05Open→03Resolved [13:52:15] 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 (10lbowmaker) [13:53:14] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) Okay, it turns out I set the wrong value for... [13:53:36] 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 (10lbowmaker) [13:55:14] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) I wonder if we sh... [13:56:27] 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 (10lbowmaker) [14:03:53] (SystemdUnitFailed) firing: (3) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:54] ^ brouberol and/or stevemunene - this systemd failure might interest you. I can check it out if you prefer. [14:05:04] 10Data-Engineering (Sprint 5), 10Event-Platform, 10Patch-For-Review: [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 (10lbowmaker) [14:05:31] I'll have a look, I wrote it. stevemunene, do you want to pair? [14:05:37] Iwas actually looking at it, about to ping brouberol it's about skein [14:05:43] haha [14:05:44] `Nov 10 00:00:08 an-airflow1007 prometheus-check-certificate-expiry[23555]: FileNotFoundError: [Errno 2] No such file or directory: '/srv/airflow-wmde/.skein/skein.crt'` [14:06:11] hmm, that;s the new airflow wmde right? [14:06:16] 10Data-Engineering (Sprint 5): [Data Platform] Document proposal for data-product configuration store - https://phabricator.wikimedia.org/T349746 (10lbowmaker) [14:06:19] yes it is [14:06:27] Interesting, so probably skein hasn't ever run on this host yet, without any DAGs. [14:06:44] ok, so, I think you need to run the `regenerate-skein-certificate` systemd service [14:06:53] 10Data-Engineering (Sprint 5), 10Event-Platform: [Data Quality] [SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10lbowmaker) [14:07:08] that will create the certificate, which will then allow the prometheus exporter scheduled service to run [14:07:37] 10Data-Engineering (Sprint 5), 10Event-Platform, 10Patch-For-Review: [Event Platform] mw-page-content-change-enrich should (re)produce kafka keys - https://phabricator.wikimedia.org/T338231 (10lbowmaker) [14:07:57] ack brouberol running it rn [14:09:33] speaking of btullis: we have this renew-skein-certificate systemd service that .. well. renews the skein certificate, that is scheduled to run on Dec 1st next time, on all hosts at the same time. I suggest we test it out on an active instance first, as the worst case scenario is, we crash all airflow instances at the same time [14:09:39] 10Data-Engineering (Sprint 5), 10Observability-Metrics: [Data Quality] Sending Apache Spark metrics to PushGateway - https://phabricator.wikimedia.org/T297231 (10lbowmaker) [14:10:08] 10Data-Engineering (Sprint 5), 10Event-Platform: [Event Platform] mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10lbowmaker) [14:11:04] 10Analytics, 10Data-Engineering (Sprint 5), 10Event-Platform, 10Patch-For-Review, 10User-notice: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10lbowmaker) [14:13:16] 10Data-Engineering (Sprint 5): [Data Quality] [Needs Grooming] Collect requirements to define prioritized data pipeline and data metrics - https://phabricator.wikimedia.org/T350409 (10lbowmaker) [14:13:24] 10Data-Engineering (Sprint 5): [Data Quality] Define persona and user stories for system and data monitoring and alerting - https://phabricator.wikimedia.org/T349454 (10lbowmaker) [14:13:27] 10Data-Engineering (Sprint 5): [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10lbowmaker) [14:13:34] 10Data-Engineering (Sprint 5): [Data Quality] Visualize platform and system alerts on a dashboard - https://phabricator.wikimedia.org/T349765 (10lbowmaker) [14:13:37] 10Data-Engineering (Sprint 5), 10serviceops, 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10lbowmaker) [14:13:39] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10lbowmaker) [14:14:02] brouberol: Sounds like a very good idea. Will it run at midnight UTC? I wonder if it's an idea to make it run on a Weekday? [14:14:30] 10Data-Engineering (Sprint 5), 10Patch-For-Review: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10lbowmaker) [14:14:50] I'll confess that I was extremely lazy and used `@monthly` [14:15:18] meaning I'll run on the 1st of the month, at 00:00 UTC, which in retrospect, is a terrible idae [14:15:29] 10Data-Engineering (Sprint 5): [Maintenance] Understand and inventory change-propagation use cases, deployments, and custom business logic - https://phabricator.wikimedia.org/T350156 (10lbowmaker) [14:15:31] 10Data-Engineering (Sprint 5): [Data Quality] Log Spark metrics and visualize on dashboard - https://phabricator.wikimedia.org/T349764 (10lbowmaker) [14:15:33] 10Data-Engineering (Sprint 5): [Data Quality] Calculate and log post processing record counts metrics for unique devices - https://phabricator.wikimedia.org/T349455 (10lbowmaker) [14:15:35] 10Data-Engineering (Sprint 5): [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables - https://phabricator.wikimedia.org/T347879 (10lbowmaker) [14:15:37] 10Data-Engineering (Sprint 5): [Maintenance] Reduce number of HDFS files - https://phabricator.wikimedia.org/T347975 (10lbowmaker) [14:15:40] 10Data-Engineering (Sprint 5): [Maintenance] Delete sanitized events removed from sanitization list - https://phabricator.wikimedia.org/T347586 (10lbowmaker) [14:15:42] 10Data-Engineering (Sprint 5), 10Event-Platform: [Event Platform] eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10lbowmaker) [14:15:45] (EventgateValidationErrors) resolved: ... [14:15:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [14:15:46] I'm sure I +1d it. I could have thought of this myself back then too :-) [14:17:19] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10lbowmaker) [14:17:26] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10SRE Observability: [Data Platform] Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430 (10lbowmaker) [14:17:30] 10Data-Engineering (Sprint 5), 10Section-Level-Image-Suggestions, 10Structured-Data-Backlog (Current Work): [S] Coalesce section alignment image suggestions output - https://phabricator.wikimedia.org/T347558 (10lbowmaker) [14:17:45] 10Data-Engineering (Sprint 5), 10Data Products, 10Structured-Data-Backlog: [Maintenance] Set up deletion jobs for Structured Data's data pipelines - https://phabricator.wikimedia.org/T347561 (10lbowmaker) [14:17:47] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10lbowmaker) [14:17:53] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10lbowmaker) [14:18:00] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Event-Platform: [Event Platform] Define Flink k8s operator SLO - https://phabricator.wikimedia.org/T345914 (10lbowmaker) [14:18:17] 10Data-Engineering (Sprint 5), 10Machine-Learning-Team, 10Wikimedia Enterprise, 10Epic, 10Event-Platform: [Event Platform] Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10lbowmaker) [14:19:43] 10Data-Engineering: [Maintenance] Add a deletion job for `hdfs_usage` data - https://phabricator.wikimedia.org/T348774 (10lbowmaker) [14:19:45] 10Data-Engineering, 10Data Pipelines (Sprint 14): [Maintenance] Define Migration/Deprecation Plan for Hue - https://phabricator.wikimedia.org/T333011 (10lbowmaker) [14:19:47] 10Data-Engineering: Team Interface page - https://phabricator.wikimedia.org/T348909 (10lbowmaker) [14:20:22] 10Data-Engineering, 10Data Pipelines: [Airflow Migration] Migrate 1+ reportupdater jobs - https://phabricator.wikimedia.org/T307540 (10lbowmaker) [14:20:31] 10Data-Engineering, 10Data Pipelines: [Airflow Migration] Update Airflow Documentation - https://phabricator.wikimedia.org/T340673 (10lbowmaker) [14:21:48] RECOVERY - Check systemd state on an-airflow1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:08] stevemunene: how goes? [14:22:20] ah, the icinga bot beat me to it [14:22:26] all good now brouberol [14:22:28] hehe [14:22:53] 10Data-Engineering, 10Dumps-Generation: Get Data Engineering folks access to hosts and systems needed for maintenance of the existing dumps system - https://phabricator.wikimedia.org/T341045 (10lbowmaker) [14:22:55] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Redesign Data Platform docs on Wikitech - https://phabricator.wikimedia.org/T350911 (10lbowmaker) [14:23:01] 10Data-Engineering, 10Product-Analytics: Creating a Spark session causes a torrent of log spam - https://phabricator.wikimedia.org/T315024 (10lbowmaker) [14:23:05] 10Data-Engineering, 10Data-Platform-SRE: Airflow scheduler and webserver logs should be readable by airflow instance admins - https://phabricator.wikimedia.org/T304615 (10lbowmaker) [14:23:07] 10Data-Engineering: Reduce the number of files generated by geoeditors airflor jobs - https://phabricator.wikimedia.org/T304852 (10lbowmaker) [14:23:13] 10Data-Engineering: Drop event.changeslistfiltergrouping table - https://phabricator.wikimedia.org/T317942 (10lbowmaker) [14:23:15] 10Data-Engineering, 10Product-Analytics: Creation of canonical pageview dumps for users to download - https://phabricator.wikimedia.org/T251777 (10lbowmaker) [14:23:18] 10Data-Engineering, 10Data-Platform-SRE: SPIKE - Will Hadoop 3 container support help us for Airflow deployment pipelines? - https://phabricator.wikimedia.org/T288247 (10lbowmaker) [14:23:20] 10Data-Engineering, 10Technical-Debt: Drop event.flowreplies table - https://phabricator.wikimedia.org/T315857 (10lbowmaker) [14:23:22] 10Data-Engineering: Drop various event.contenttranslation tables - https://phabricator.wikimedia.org/T317943 (10lbowmaker) [14:23:26] 10Data-Engineering: Change the way Refine handles its status (currently flags in partitions) - https://phabricator.wikimedia.org/T312785 (10lbowmaker) [14:23:28] 10Data-Engineering: Provide aggregated user device data per-country - https://phabricator.wikimedia.org/T325306 (10lbowmaker) [14:23:32] 10Data-Engineering, 10Product-Analytics: Present "Notebooks in Airflow" solution to PA and discuss ownership of different steps - https://phabricator.wikimedia.org/T325181 (10lbowmaker) [14:23:35] 10Data-Engineering, 10Cassandra: Audit and update AQS Cassandra roles & grants - https://phabricator.wikimedia.org/T313877 (10lbowmaker) [14:23:37] 10Data-Engineering, 10Foundational Technology Requests: Enable the Marketing Campaigns Reporting plugin for matomo - https://phabricator.wikimedia.org/T319013 (10lbowmaker) [14:23:39] 10Data-Engineering, 10IP Masking, 10Movement-Insights: Clarify analytics and metrics definitions around anonymous and temporary editors - https://phabricator.wikimedia.org/T332205 (10lbowmaker) [14:23:41] 10Data-Engineering, 10WMDE-TechWish-Maintenance: Deprecate WMDE Technical Wishes reportupdater jobs - https://phabricator.wikimedia.org/T333537 (10lbowmaker) [14:23:53] (SystemdUnitFailed) firing: (2) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:53] 10Data-Engineering, 10Data Pipelines, 10Epic: Post Oozie -> Airflow migration refactorings - https://phabricator.wikimedia.org/T336739 (10lbowmaker) [14:23:56] 10Data-Engineering, 10SRE, 10Traffic: Add a rolled-up cache_status field to druid webrequest_sampled_128 - https://phabricator.wikimedia.org/T319344 (10lbowmaker) [14:24:02] 10Data-Engineering, 10Event-Platform, 10User-Elukey: Create EventStream's equivalent to irc.wikimedia.org's #central channel - https://phabricator.wikimedia.org/T240182 (10lbowmaker) [14:24:09] 10Analytics-Radar, 10Data-Engineering, 10Growth-Team, 10Growth-Team-Filtering, and 2 others: Edits to Flow pages result in a page-links-change event with no performer - https://phabricator.wikimedia.org/T216726 (10lbowmaker) [14:24:13] 10Data-Engineering, 10Data Pipelines, 10Epic: Notebook Scheduler for Product Analytics - https://phabricator.wikimedia.org/T322532 (10lbowmaker) [14:24:17] 10Data-Engineering, 10Data Pipelines, 10Documentation, 10Epic: [Airflow] Kick off documentation in wikitech - https://phabricator.wikimedia.org/T302400 (10lbowmaker) [14:24:21] 10Data-Engineering, 10Data Pipelines, 10Spike: Refine Investigation - https://phabricator.wikimedia.org/T296529 (10lbowmaker) [14:24:25] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data Pipelines, 10Epic: [Airflow] User manual and documentation - https://phabricator.wikimedia.org/T295199 (10lbowmaker) [14:24:29] 10Data-Engineering: [SPIKE] Replicating EventGate Validation Errors in Local Environment - https://phabricator.wikimedia.org/T349018 (10lbowmaker) [14:24:33] 10Data-Engineering: Update AQS API automatically with new content data beginning of each month - https://phabricator.wikimedia.org/T348792 (10lbowmaker) [14:24:37] 10Data-Engineering, 10Data Pipelines: Refactor our existing Airflow dags to use EasyDAG & DagProperties - https://phabricator.wikimedia.org/T336738 (10lbowmaker) [14:24:41] 10Data-Engineering, 10CirrusSearch, 10Discovery-Search (Current work): [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046 (10lbowmaker) [14:24:45] 10Data-Engineering, 10Data Pipelines: [Iceberg] Migrate event_sanitized_iceberg to event_sanitized - https://phabricator.wikimedia.org/T311737 (10lbowmaker) [14:24:49] 10Data-Engineering, 10Data Products: Investigate why we consume empty partitions from webrequests - https://phabricator.wikimedia.org/T343238 (10lbowmaker) [14:24:53] 10Analytics, 10Data-Engineering, 10EventStreams, 10Wikidata, and 2 others: Expose rdf-streaming-updater.mutation content through EventStreams - https://phabricator.wikimedia.org/T294133 (10lbowmaker) [14:24:58] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) > In other words prometheus analytics will be configured with multiple jobs, for each airflow instance. T... [14:25:34] 10Data-Platform-SRE: Regenerate the skein certificates during the first buisiness day of the month - https://phabricator.wikimedia.org/T350945 (10brouberol) a:03brouberol [14:29:11] 10Data-Engineering, 10Event-Platform: [NEEDS GROOMING] stream processing: we should have automated integration tests on staging - https://phabricator.wikimedia.org/T347472 (10lbowmaker) [14:29:18] 10Data-Engineering, 10EventStreams, 10Event-Platform: eventgate: eventstreams: services should use common logging schema - https://phabricator.wikimedia.org/T347498 (10lbowmaker) [14:29:21] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Goal: Event Platform: Stream Connectors - https://phabricator.wikimedia.org/T214430 (10lbowmaker) [14:29:23] 10Data-Engineering, 10Event-Platform: flink-app: swift bucket and zookeeper paths should be templated. - https://phabricator.wikimedia.org/T336901 (10lbowmaker) [14:29:25] 10Data-Engineering, 10EventStreams, 10Shared-Data-Infrastructure, 10Event-Platform: Implement server side filtering for EventStreams (if we should) - https://phabricator.wikimedia.org/T152731 (10lbowmaker) [14:29:29] 10Data-Engineering, 10Event-Platform: [TEMPLATE] Onboard request for APPLICATION NAME to Event Platform - https://phabricator.wikimedia.org/T346207 (10lbowmaker) [14:29:31] 10Analytics, 10Data-Engineering, 10DBA, 10Event-Platform: Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10lbowmaker) [14:29:33] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Event-Platform: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 (10lbowmaker) [14:29:35] 10Data-Engineering, 10EventStreams, 10Event-Platform: EventStreams (via KafkaSSE) does not consume from newly added partitions in topic - https://phabricator.wikimedia.org/T173006 (10lbowmaker) [14:29:39] 10Analytics-Kanban, 10Data-Engineering, 10Fundraising-Backlog, 10MediaWiki-extensions-EventLogging, and 2 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10lbowmaker) [14:29:44] 10Data-Engineering, 10Internet-Archive, 10The-Wikipedia-Library, 10Event-Platform, 10Patch-For-Review: page-links-change stream is assigning template propagation events to the wrong edits - https://phabricator.wikimedia.org/T216504 (10lbowmaker) [14:29:47] 10Analytics-Radar, 10Data-Engineering, 10Internet-Archive, 10The-Wikipedia-Library, 10Event-Platform: Store page-links-change data in a database table and make available through a Special page - https://phabricator.wikimedia.org/T221397 (10lbowmaker) [14:29:52] 10Analytics, 10Data-Engineering, 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.10; 2023-05-23), 10User-Elukey: Port architecture of irc-recentchanges to Kafka - https://phabricator.wikimedia.org/T234234 (10lbowmaker) [14:29:57] 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10serviceops, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10lbowmaker) [14:30:03] 10Data-Engineering, 10Event-Platform: Refine drops $schema field values - https://phabricator.wikimedia.org/T255818 (10lbowmaker) [14:30:05] 10Analytics-Radar, 10Data-Engineering, 10ChangeProp, 10WMF-JobQueue, and 3 others: Run EventBus tests in MediaWiki core CI - https://phabricator.wikimedia.org/T257583 (10lbowmaker) [14:30:10] 10Data-Engineering, 10Product-Analytics, 10Event-Platform: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10lbowmaker) [14:30:13] 10Data-Engineering, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform, 10Patch-For-Review: Improve the flink-app chart to provide more useful defaults - https://phabricator.wikimedia.org/T346315 (10lbowmaker) [14:30:16] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform: Duplicated revision_create events - https://phabricator.wikimedia.org/T262203 (10lbowmaker) [14:30:19] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Wikimedia-Performance-recommendation: Avoid extra HTTPS connections for most Event Platform beacons - https://phabricator.wikimedia.org/T263049 (10lbowmaker) [14:30:21] 10Data-Engineering, 10Data-Engineering-Radar, 10Internet-Archive, 10The-Wikipedia-Library, 10Event-Platform: Page-links-change stream doesn't capture duplicated links - https://phabricator.wikimedia.org/T216492 (10lbowmaker) [14:30:23] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform: mw.user.generateRandomSessionId should return a UUID - https://phabricator.wikimedia.org/T266813 (10lbowmaker) [14:30:25] 10Analytics, 10Data-Engineering, 10Metrics Platform Backlog, 10Event-Platform: Client-side error logging should use Elastic Common Schema (ECS) fields when possible - https://phabricator.wikimedia.org/T267602 (10lbowmaker) [14:30:27] 10Analytics, 10Data-Engineering, 10Event-Platform: Automate EventGate validation error reporting - https://phabricator.wikimedia.org/T268027 (10lbowmaker) [14:30:29] 10Analytics-Radar, 10Data-Engineering, 10Data-Engineering-Radar, 10MediaWiki-Recent-changes, and 2 others: Remove deprecated RCFeedEngine support - https://phabricator.wikimedia.org/T250628 (10lbowmaker) [14:30:37] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Release-Engineering-Team (Radar): Stop using puppet + git pull for auto deployment of schema repos - https://phabricator.wikimedia.org/T274901 (10lbowmaker) [14:30:41] 10Analytics, 10Data-Engineering, 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.15; 2023-06-27), 10Patch-For-Review: Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10lbowmaker) [14:30:45] 10Data-Engineering, 10tech-decision-forum, 10Event-Platform: MediaWiki Event Carried State Transfer - Problem Statement - https://phabricator.wikimedia.org/T291120 (10lbowmaker) [14:30:49] 10Analytics, 10Data-Engineering, 10Event-Platform: Deploy schema repos to analytics cluster and use local uris for analytics jobs - https://phabricator.wikimedia.org/T280017 (10lbowmaker) [14:30:53] 10Analytics, 10Data-Engineering, 10Product-Analytics, 10Event-Platform: Develop comprehensive process, guidelines, and roles for Event Platform stream sanitization - https://phabricator.wikimedia.org/T276955 (10lbowmaker) [14:30:57] 10Analytics-Radar, 10Data-Engineering, 10Data-Platform-SRE, 10SRE, and 2 others: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088 (10lbowmaker) [14:31:03] 10Analytics, 10Data-Engineering, 10Event-Platform: mediawiki/page/properties-change schema should use map type for added and removed page properties - https://phabricator.wikimedia.org/T281483 (10lbowmaker) [14:31:07] 10Analytics, 10Data-Engineering, 10Metrics Platform Backlog, 10Event-Platform: Source geolocation directly rather than using IP in schema - https://phabricator.wikimedia.org/T290014 (10lbowmaker) [14:31:11] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10lbowmaker) [14:31:15] 10Analytics, 10Data-Engineering, 10Event-Platform: EventStreams sending same data over and over (page links change) - https://phabricator.wikimedia.org/T290211 (10lbowmaker) [14:31:19] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform: Introduce EventBusSendUpdate - https://phabricator.wikimedia.org/T292123 (10lbowmaker) [14:31:23] 10Data-Engineering, 10Event-Platform: Document and Promote Image Suggestions Feedback > Cassandra Flink Job - https://phabricator.wikimedia.org/T316112 (10lbowmaker) [14:31:27] 10Analytics, 10Data-Engineering, 10Observability-Logging, 10SRE, and 2 others: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10lbowmaker) [14:31:33] 10Data-Engineering: Drop GuidedTour* tables - https://phabricator.wikimedia.org/T317460 (10lbowmaker) [14:31:37] 10Data-Engineering, 10Event-Platform: Add $comment and $performer to ArticleRevisionVisibilitySet params - https://phabricator.wikimedia.org/T321411 (10lbowmaker) [14:31:41] 10Data-Engineering, 10Event-Platform, 10MW-1.40-notes (1.40.0-wmf.8; 2022-10-31): EventBus' stream config destination_event_service setting should move into producers.mediawikI_eventbus specific settings. - https://phabricator.wikimedia.org/T321557 (10lbowmaker) [14:31:45] 10Data-Engineering, 10Event-Platform: Add schema diffing support to jsonschema-tools and run diff in CI - https://phabricator.wikimedia.org/T321850 (10lbowmaker) [14:31:50] 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10observability, and 3 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10lbowmaker) [14:32:00] 10Data-Engineering, 10Beta-Cluster-Infrastructure, 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.12; 2023-06-06): cirrusSearchCheckerJob JobQueueErrors (Could not enqueue jobs) on Beta Cluster - https://phabricator.wikimedia.org/T322491 (10lbowmaker) [14:32:04] 10Data-Engineering, 10Event-Platform: Spark Streaming Dumps POC: Backfill content table - https://phabricator.wikimedia.org/T323641 (10lbowmaker) [14:32:08] 10Data-Engineering, 10Event-Platform: Spark Streaming Dumps POC: Update iceberg tables - https://phabricator.wikimedia.org/T323645 (10lbowmaker) [14:32:12] 10Data-Engineering, 10SRE-OnFire, 10serviceops, 10Event-Platform: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10lbowmaker) [14:32:20] 10Data-Engineering, 10Data-Platform-SRE, 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, and 3 others: Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10lbowmaker) [14:32:30] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform: Proposal: Create a stream end point for Revision Risk Model - https://phabricator.wikimedia.org/T326179 (10lbowmaker) [14:32:34] 10Data-Engineering, 10Event-Platform: Support topics without a schema in Flink Catalog - https://phabricator.wikimedia.org/T328232 (10lbowmaker) [14:32:38] 10Data-Engineering, 10Event-Platform: Support NULL values in RowData in eventutilities - https://phabricator.wikimedia.org/T328211 (10lbowmaker) [14:32:42] 10Data-Engineering, 10Event-Platform: [Flink Operations] Automate Replay of Failed Events - https://phabricator.wikimedia.org/T328565 (10lbowmaker) [14:32:46] 10Data-Engineering, 10Event-Platform: Automated event stream throughput alerting for important state change streams - https://phabricator.wikimedia.org/T329070 (10lbowmaker) [14:32:50] 10Data-Engineering, 10Event-Platform: Refactor Image Suggestions Feedback > Cassandra Flink Job and Deploy to DSE k8s - https://phabricator.wikimedia.org/T329524 (10lbowmaker) [14:32:56] 10Data-Engineering, 10Machine-Learning-Team, 10Event-Platform: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399 (10lbowmaker) [14:33:00] 10Data-Engineering, 10Browser-Support-Microsoft-Edge, 10Event-Platform, 10Wikimedia-Performance-recommendation: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10lbowmaker) [14:33:04] 10Data-Engineering, 10EventStreams, 10Event-Platform: Include image/file changes in page-links-change - https://phabricator.wikimedia.org/T333497 (10lbowmaker) [14:33:12] 10Data-Engineering, 10Edit-Review-Improvements-Integrated-Filters, 10Growth-Team, 10Machine-Learning-Team, and 2 others: Integration of Revert Risk Scores to Recent Changes as a filter - https://phabricator.wikimedia.org/T329071 (10lbowmaker) [14:33:16] 10Data-Engineering, 10Event-Platform: Move eventutiltities-python repo into main wikimedia-eventutilities repository - https://phabricator.wikimedia.org/T337491 (10lbowmaker) [14:33:20] 10Data-Engineering, 10Event-Platform: Move wikimedia-event-utilities to gitlab - https://phabricator.wikimedia.org/T337477 (10lbowmaker) [14:33:24] 10Data-Engineering, 10Event-Platform: mediawiki-event-enrichment: changes to test image seem to be ignored in CI - https://phabricator.wikimedia.org/T340195 (10lbowmaker) [14:33:28] 10Data-Engineering, 10Event-Platform: Make meta.dt required on all schemas that declare it - https://phabricator.wikimedia.org/T340044 (10lbowmaker) [14:33:32] 10Data-Engineering, 10Event-Platform: eventutilities-python: http event process function should report latency. - https://phabricator.wikimedia.org/T338380 (10lbowmaker) [14:33:36] 10Data-Engineering, 10Data-Persistence, 10IP Masking, 10Event-Platform: MediaWiki user types - https://phabricator.wikimedia.org/T336176 (10lbowmaker) [14:33:42] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search, 10Epic, 10Event-Platform: [Epic] Set up multi DC Kafka stretch cluster - https://phabricator.wikimedia.org/T340492 (10lbowmaker) [14:33:52] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10mforns) Oh! The datasets.yaml file of the wmde/config folder does not specify any dataset yet. That's why the loading of the DatasetRegistry is failing, it expects at least 1 datase... [14:34:10] 10Data-Engineering, 10Data-Platform-SRE: Check home/HDFS leftovers of aranyap - https://phabricator.wikimedia.org/T340945 (10Gehel) [14:38:33] 10Data-Platform-SRE: Check home/HDFS leftovers of ntsako - https://phabricator.wikimedia.org/T343189 (10lbowmaker) [14:38:35] 10Data-Platform-SRE: Check home/HDFS leftovers of aranyap - https://phabricator.wikimedia.org/T340945 (10lbowmaker) [14:38:37] 10Data-Platform-SRE: Check home/HDFS leftovers of paramd - https://phabricator.wikimedia.org/T340427 (10lbowmaker) [14:38:40] 10Data-Platform-SRE: Check home/HDFS leftovers of andyrussg - https://phabricator.wikimedia.org/T338234 (10lbowmaker) [14:38:42] 10Data-Platform-SRE: Check home/HDFS leftovers of ryanmax - https://phabricator.wikimedia.org/T325527 (10lbowmaker) [14:38:44] 10Data-Platform-SRE, 10Observability-Alerting: Migrate zookeeper prometheus checks from Icinga to Alertmanager - https://phabricator.wikimedia.org/T309012 (10lbowmaker) [14:38:46] 10Data-Platform-SRE: Use inclusive language in code for private analytics infrastructure - https://phabricator.wikimedia.org/T280268 (10lbowmaker) [14:38:48] 10Data-Platform-SRE: Review druid deep-storage making sure that old segments having been reindexed are deleted - https://phabricator.wikimedia.org/T296207 (10lbowmaker) [14:38:50] 10Data-Platform-SRE: Define a list of exactly which alerts should page the Analytics team in VictorOps - https://phabricator.wikimedia.org/T296552 (10lbowmaker) [14:38:52] 10Data-Platform-SRE: Druid loading of navigationtiming gets stuck - https://phabricator.wikimedia.org/T273216 (10lbowmaker) [14:38:55] 10Data-Platform-SRE: Set yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds - https://phabricator.wikimedia.org/T269616 (10lbowmaker) [14:38:57] 10Data-Platform-SRE: Archive /home/ezachte data on stat1007 - https://phabricator.wikimedia.org/T238243 (10lbowmaker) [14:43:17] 10Data-Platform-SRE: Regenerate the skein certificates during the first buisiness day of the month - https://phabricator.wikimedia.org/T350945 (10brouberol) It seems that this is non-trivial to do with systemd: ` ~/wmf/puppet renew-skein-…business-day *3 !1 ❯ systemd-analyze calendar 'Mon..Fri *-*-01 10:00' 'Mo... [14:43:19] 10Data-Engineering, 10Product-Analytics: Suspicious user pageview activity in India during June from Android mobile web browsers - https://phabricator.wikimedia.org/T315267 (10Anoop) >>! In T315267#8162605, @Mayakp.wiki wrote: > I also checked if there were any translation events happening on kn.wikipedia and... [14:45:08] 10Data-Platform-SRE: Regenerate the skein certificates during the first buisiness day of the month - https://phabricator.wikimedia.org/T350945 (10brouberol) However, we could simplify things by eg running the service every monday. for example. [14:46:26] 10Data-Engineering, 10Data-Platform, 10Movement-Insights: Iceberg unique devices table reporting incorrect numbers for 2023-10-01 - https://phabricator.wikimedia.org/T350920 (10lbowmaker) [14:46:46] 10Data-Engineering (Sprint 5), 10Data-Platform, 10Movement-Insights: Iceberg unique devices table reporting incorrect numbers for 2023-10-01 - https://phabricator.wikimedia.org/T350920 (10lbowmaker) [14:48:07] 10Data-Engineering (Sprint 5), 10Data-Platform, 10Movement-Insights: Iceberg unique devices table reporting incorrect numbers for 2023-10-01 - https://phabricator.wikimedia.org/T350920 (10xcollazo) From [[ https://wikimedia.slack.com/archives/CLKDS4MG9/p1699565490518549?thread_ts=1699558128.604389&cid=CLKDS4... [14:48:50] 10Data-Engineering: Use hive metastore when registering views - https://phabricator.wikimedia.org/T350617 (10lbowmaker) [14:48:53] 10Data-Engineering: Airflow DAG mediawiki_history_denormalize failed with NPE - https://phabricator.wikimedia.org/T350489 (10lbowmaker) [14:48:56] 10Data-Engineering, 10ChangeProp, 10observability, 10service-runner, 10Event-Platform: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics - https://phabricator.wikimedia.org/T350180 (10lbowmaker) [14:48:58] 10Data-Engineering, 10Event-Platform: Modify mediawiki.revision.visibility-change to include unsuppressed data - https://phabricator.wikimedia.org/T349845 (10lbowmaker) [14:49:00] 10Data-Engineering, 10Data-Platform-SRE, 10Privacy Engineering, 10Patch-For-Review, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10lbowmaker) [14:49:03] 10Data-Engineering: ProduceCanaryEvents job should be scheduled by Airflow - https://phabricator.wikimedia.org/T341229 (10lbowmaker) [14:49:05] 10Data-Engineering, 10Data Products, 10Dumps 2.0: Epic: Quality of new Dumps 2.0 output - https://phabricator.wikimedia.org/T345385 (10lbowmaker) [14:49:07] 10Data-Engineering: Allow retry on Airflow druid_load_webrequest_sampled_128_daily.remove_temporary_directory - https://phabricator.wikimedia.org/T345232 (10lbowmaker) [14:49:09] 10Data-Engineering, 10EventStreams, 10stewardbots, 10Event-Platform: Frequent `429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange` errors in SULWatcher - https://phabricator.wikimedia.org/T329327 (10lbowmaker) [14:49:12] 10Data-Engineering, 10Data Products (Sprint 01): [Spike] Identify and mitigate risks associated with MediaWiki History pipeline - https://phabricator.wikimedia.org/T345208 (10lbowmaker) [14:49:14] 10Data-Engineering, 10Data Pipelines, 10Privacy Engineering: Add cswiki to clickstream - https://phabricator.wikimedia.org/T339805 (10lbowmaker) [14:49:17] 10Data-Engineering, 10EventStreams, 10Pywikibot, 10Event-Platform: Error 429: too many requests for stream.wikimedia.org - https://phabricator.wikimedia.org/T308931 (10lbowmaker) [14:49:22] 10Data-Engineering: [opsweek] Airflow DAGs with Spark jobs should always include Spark tuning variables - https://phabricator.wikimedia.org/T343154 (10lbowmaker) [14:49:26] 10Data-Engineering, 10Data Pipelines: [Airflow] Simplify application and java_class parameters in SparkSqlOperator - https://phabricator.wikimedia.org/T338036 (10lbowmaker) [14:49:30] 10Data-Engineering, 10Data Products, 10serviceops-radar: Use config-master.wikimedia.org/mediawiki.yaml to automatically switch code that depends on active datacenter - https://phabricator.wikimedia.org/T338796 (10lbowmaker) [14:49:34] 10Data-Engineering, 10Data Pipelines: Benchmark Iceberg tables with SNAPPY vs GZIP vs ZSTD - https://phabricator.wikimedia.org/T338050 (10lbowmaker) [14:49:42] 10Data-Engineering, 10Data Pipelines: [Iceberg] performance test to align row group to HDFS block size - https://phabricator.wikimedia.org/T337416 (10lbowmaker) [14:49:46] 10Data-Engineering, 10Data Pipelines: Make sure all partitions sensors are using the Dataset helpers - https://phabricator.wikimedia.org/T336741 (10lbowmaker) [14:49:50] 10Data-Engineering, 10Data Pipelines: HDFS utils on Airflow to handle actions on hdfs files - https://phabricator.wikimedia.org/T336771 (10lbowmaker) [14:49:54] 10Data-Engineering, 10Data-Platform-SRE, 10Data Pipelines: Add support for Iceberg to the Spark Docker Image - https://phabricator.wikimedia.org/T336012 (10lbowmaker) [14:49:58] 10Data-Engineering, 10Data-Platform-SRE, 10Data Pipelines, 10Data-Platform: Figure out a way to automatize deployment of the spark assembly file - https://phabricator.wikimedia.org/T336513 (10lbowmaker) [14:50:02] 10Data-Engineering, 10Data Pipelines, 10Datasets-General-or-Unknown: Missing NS0 dumps in 20230420 and 20230501 and 20230520 - https://phabricator.wikimedia.org/T335887 (10lbowmaker) [14:50:06] 10Data-Engineering, 10Data Pipelines, 10Section-Topics: Auto clean old section topics data - https://phabricator.wikimedia.org/T335778 (10lbowmaker) [14:50:10] 10Data-Engineering, 10Data Pipelines: [datahub] Implement automatic deletion of datasets with deleted data sources - https://phabricator.wikimedia.org/T335528 (10lbowmaker) [14:50:14] 10Data-Engineering, 10Data Pipelines: Wrong file names for 2 month files in pageview_complete/monthly - https://phabricator.wikimedia.org/T335685 (10lbowmaker) [14:50:18] 10Data-Engineering, 10Event-Platform: [Event Platform] eventutilities-python should convert pyflink Instants to python DateTimes - https://phabricator.wikimedia.org/T349640 (10lbowmaker) [14:50:22] 10Data-Engineering, 10Data Pipelines: webrequest / webrequest raw quality check - https://phabricator.wikimedia.org/T334678 (10lbowmaker) [14:50:26] 10Data-Engineering, 10Data Pipelines: [NEEDS GROOMING] Support migration of simple (Hive > Hive) jobs - https://phabricator.wikimedia.org/T333006 (10lbowmaker) [14:50:30] 10Data-Engineering, 10Data Pipelines, 10Product-Analytics: Delete the leading question mark from uri_query in the webrequest table - https://phabricator.wikimedia.org/T334495 (10lbowmaker) [14:50:34] 10Data-Engineering, 10Data Pipelines: Investigate datahub stack trace on an-airflow1004.eqiad.wmnet - https://phabricator.wikimedia.org/T332822 (10lbowmaker) [14:50:38] 10Data-Engineering, 10Data Pipelines: Investigate dangling tables after Airflow 2.5.1 upgrade of an-airflow1004.eqiad.wmnet - https://phabricator.wikimedia.org/T332820 (10lbowmaker) [14:50:42] 10Data-Engineering, 10Data Pipelines, 10Product-Analytics: Add log_search to monthly sqoop list - https://phabricator.wikimedia.org/T332621 (10lbowmaker) [14:50:46] 10Data-Engineering, 10Data Pipelines: Airflow skein hook shouldn't fail when not managing to gather yarn logs - https://phabricator.wikimedia.org/T332215 (10lbowmaker) [14:50:50] 10Data-Engineering, 10Data Pipelines: Update HiveToDruid job -to not use intermediary data - https://phabricator.wikimedia.org/T329257 (10lbowmaker) [14:50:54] 10Data-Engineering, 10Data Pipelines: Use RDD checkpointing in Mediawiki-History spark job - https://phabricator.wikimedia.org/T331003 (10lbowmaker) [14:50:58] 10Data-Engineering, 10Data Pipelines: Load wmf.unique_editors_by_country_monthly into Druid for access in Turnilo & Superset - https://phabricator.wikimedia.org/T330436 (10lbowmaker) [14:51:02] 10Data-Engineering, 10Data Pipelines: [Airflow] Gather dataset information from DataHub - https://phabricator.wikimedia.org/T327816 (10lbowmaker) [14:51:06] 10Data-Engineering, 10Data Pipelines: SPIKE: Adapt our pipelines to codfw switch - https://phabricator.wikimedia.org/T328365 (10lbowmaker) [14:51:10] 10Data-Engineering, 10Data Pipelines: Refactor analytics_test/dags/custom_operators_tryout_dag.py - https://phabricator.wikimedia.org/T327869 (10lbowmaker) [14:51:14] 10Data-Engineering, 10Data Pipelines: Airflow operator to manage old data deletion - https://phabricator.wikimedia.org/T326826 (10lbowmaker) [14:51:18] 10Data-Engineering, 10Data Pipelines: Migrate custom gitlab runner that runs Dockerfiles to releng's new production infra - https://phabricator.wikimedia.org/T326570 (10lbowmaker) [14:51:22] 10Data-Engineering, 10Data Pipelines: Use uap-core browser-family for bot detection - https://phabricator.wikimedia.org/T326339 (10lbowmaker) [14:51:26] 10Data-Engineering, 10Data Pipelines, 10Product-Analytics: Add TikTok's in-app browser to ua-parser library - https://phabricator.wikimedia.org/T325611 (10lbowmaker) [14:51:30] 10Data-Engineering, 10Data Pipelines: Set up a repository to generate packaged conda environments via CI for Jupyter notebooks - https://phabricator.wikimedia.org/T325195 (10lbowmaker) [14:51:34] 10Data-Engineering, 10Data Pipelines: Update refinery-source PageviewDefinition to better handle `Special:` pages - https://phabricator.wikimedia.org/T325544 (10lbowmaker) [14:51:38] 10Data-Engineering, 10Data Pipelines: Increase mypy coverage in airflow-dags - https://phabricator.wikimedia.org/T325213 (10lbowmaker) [14:51:42] 10Data-Engineering, 10Data Pipelines: [Airflow] Implement a NotebookOperator - https://phabricator.wikimedia.org/T325185 (10lbowmaker) [14:51:46] 10Data-Engineering, 10Data Pipelines: Prune raw HDFS FSImages stored on HDFS - https://phabricator.wikimedia.org/T325103 (10lbowmaker) [14:51:50] 10Data-Engineering, 10Data Pipelines: Make defaults immutable for Airflow confs - https://phabricator.wikimedia.org/T325014 (10lbowmaker) [14:51:54] 10Data-Engineering, 10Data Pipelines, 10Documentation: Document the new Airflow backend: PostgreSQL - https://phabricator.wikimedia.org/T325138 (10lbowmaker) [14:51:58] 10Data-Engineering, 10Data Pipelines: When moving oozie webrequest-load to airflow/spark avoid the error-check corner case - https://phabricator.wikimedia.org/T324757 (10lbowmaker) [14:52:02] 10Data-Engineering, 10Data Pipelines, 10Spike: [SPIKE] Webrequest migration - https://phabricator.wikimedia.org/T324488 (10lbowmaker) [14:52:06] 10Data-Engineering, 10Data Pipelines: [Migrate] Oozie jobs migration for Webrequest - https://phabricator.wikimedia.org/T324484 (10lbowmaker) [14:52:10] 10Data-Engineering, 10Data Pipelines: [Migration] migrate simple oozie jobs - https://phabricator.wikimedia.org/T324486 (10lbowmaker) [14:52:14] 10Data-Engineering, 10Data Pipelines: Reimage an-test-client1001.eqiad.wmnet - https://phabricator.wikimedia.org/T324127 (10lbowmaker) [14:52:18] 10Data-Engineering, 10Data Pipelines: Improve docs around JupyterLab and conda-analytics - https://phabricator.wikimedia.org/T324025 (10lbowmaker) [14:52:22] 10Data-Engineering, 10Data Pipelines: Data Warehouse Evaluation Spike. - https://phabricator.wikimedia.org/T323994 (10lbowmaker) [14:52:26] 10Data-Engineering, 10Data Pipelines: NEW FEATURE REQUEST: Dataset with active and non-active Wikis - https://phabricator.wikimedia.org/T323662 (10lbowmaker) [14:52:30] 10Data-Engineering, 10Cassandra, 10Data Pipelines: Create puppet defined type for adding/updating/deleting secrets or other small files on HDFS - https://phabricator.wikimedia.org/T323692 (10lbowmaker) [14:52:36] 10Data-Engineering, 10Data Pipelines: Spark Streaming Dumps POC: Backfill metadata table - https://phabricator.wikimedia.org/T323642 (10lbowmaker) [14:52:40] 10Data-Engineering, 10Data Pipelines: Improve job to prune old dataset partitions - https://phabricator.wikimedia.org/T322754 (10lbowmaker) [14:52:44] 10Data-Engineering, 10Data Pipelines: NEW FEATURE REQUEST: sqoop (all) user properties from mariadb to wmf_raw.mediawiki_user_properties - https://phabricator.wikimedia.org/T323456 (10lbowmaker) [14:52:48] 10Data-Engineering, 10Data Pipelines: Add support for repository artifacts in Airflow - https://phabricator.wikimedia.org/T322690 (10lbowmaker) [14:52:52] 10Data-Engineering, 10Data Pipelines: wmf.virtualpageview_hourly's language_variant field is corrupted - https://phabricator.wikimedia.org/T322545 (10lbowmaker) [14:52:56] 10Data-Engineering, 10Data Pipelines: MVP for Notebook Scheduler - https://phabricator.wikimedia.org/T322533 (10lbowmaker) [14:53:04] 10Data-Engineering, 10Data Pipelines: Puppetize custom gitlab runner that can launch Docker containers - https://phabricator.wikimedia.org/T322251 (10lbowmaker) [14:53:08] 10Data-Engineering, 10Data Pipelines: Implement periodical cleaning of Airflow databases - https://phabricator.wikimedia.org/T322036 (10lbowmaker) [14:53:12] 10Data-Engineering, 10Data Pipelines: Back-fill Wikidata reliability Graphite metrics - https://phabricator.wikimedia.org/T321838 (10lbowmaker) [14:53:16] 10Data-Engineering, 10Data-Platform-SRE, 10Data Pipelines: Install jupyterhub separately from conda-analytics - https://phabricator.wikimedia.org/T321512 (10lbowmaker) [14:53:20] 10Data-Engineering, 10Data Pipelines: Make mediawiki-history page and user sorting complete for denormalization - https://phabricator.wikimedia.org/T321493 (10lbowmaker) [14:53:24] 10Data-Engineering, 10Data Pipelines: Fix mediawiki-history page computation for deleted pages having the same title - https://phabricator.wikimedia.org/T320860 (10lbowmaker) [14:53:28] 10Data-Engineering, 10Data Pipelines: Make launcher an explicit parameter on SparkSubmitOperator() - https://phabricator.wikimedia.org/T319688 (10lbowmaker) [14:53:32] 10Data-Engineering, 10Data Pipelines, 10Patch-For-Review: Prototype Spark Streaming Job for Content Dumps - https://phabricator.wikimedia.org/T322326 (10lbowmaker) [14:53:36] 10Data-Engineering, 10Data Pipelines, 10Product-Analytics: Improved edit summary data in mediawiki_history - https://phabricator.wikimedia.org/T318010 (10lbowmaker) [14:53:40] 10Data-Engineering, 10Data Pipelines: Finding root cause of a second spike of text requests on Sept 8th - https://phabricator.wikimedia.org/T317396 (10lbowmaker) [14:53:44] 10Data-Engineering, 10Data Pipelines: geoeditors public version is not available for non-Wikipedia projects - https://phabricator.wikimedia.org/T317040 (10lbowmaker) [14:53:48] 10Data-Engineering, 10Data Pipelines, 10Product-Analytics: Review why total_edits on Mediawiki_History differs from the total_edits on Editors_Daily - https://phabricator.wikimedia.org/T316896 (10lbowmaker) [14:53:52] 10Data-Engineering, 10Data Pipelines: [Airflow] Add log rotation to scheduler logs - https://phabricator.wikimedia.org/T315326 (10lbowmaker) [14:53:56] 10Data-Engineering, 10Data Pipelines: Convert to pure Docker the gitlab CI pipeline to build debianized conda - https://phabricator.wikimedia.org/T315475 (10lbowmaker) [14:54:00] 10Data-Engineering, 10Data Pipelines: Airflow does not send SLA emails nor update SLA misses in the db - https://phabricator.wikimedia.org/T314181 (10lbowmaker) [14:54:04] 10Data-Engineering, 10Data Pipelines: Sanitize network_flows_internal dataset - https://phabricator.wikimedia.org/T312915 (10lbowmaker) [14:54:08] 10Data-Engineering, 10Data Pipelines: Bug: Deleted pages are accidentally excluded from mediawiki_history_reduced - https://phabricator.wikimedia.org/T313955 (10lbowmaker) [14:54:16] 10Data-Engineering, 10Data Pipelines: Review iceberg settings and document choices - https://phabricator.wikimedia.org/T312151 (10lbowmaker) [14:54:20] 10Data-Engineering, 10Data Pipelines: [Iceberg] Migrate event_santised to Iceberg - https://phabricator.wikimedia.org/T311765 (10lbowmaker) [14:54:24] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data Pipelines: Investigate why airflow sensor tasks fail without sending errors - https://phabricator.wikimedia.org/T311976 (10lbowmaker) [14:54:28] 10Data-Engineering, 10Data Pipelines: [Iceberg] Modify superset dashboards to use timestamp based (instead of partition based) queries - https://phabricator.wikimedia.org/T311764 (10lbowmaker) [14:54:32] 10Data-Engineering, 10Data Pipelines, 10Epic: [Iceberg] Epic: Icebergify event_sanitized database - https://phabricator.wikimedia.org/T311743 (10lbowmaker) [14:54:36] 10Data-Engineering, 10Data Pipelines, 10Pageviews-Anomaly, 10Product-Analytics, and 2 others: Analyze possible bot traffic for frwiki article Cookie (informatique) - https://phabricator.wikimedia.org/T313114 (10lbowmaker) [14:54:42] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data Pipelines: Investigate Gobblin dataloss during namenode failure - https://phabricator.wikimedia.org/T311263 (10lbowmaker) [14:54:46] 10Data-Engineering, 10Data Pipelines: Drop MediaViewer and MultimediaViewer* tables - https://phabricator.wikimedia.org/T311229 (10lbowmaker) [14:54:50] 10Data-Engineering, 10Data Pipelines: Add Ukrainian Wikipedia to Clickstream dataset - https://phabricator.wikimedia.org/T310972 (10lbowmaker) [14:54:54] 10Data-Engineering, 10Data Pipelines: Generate data to count langswitches for every article - https://phabricator.wikimedia.org/T310975 (10lbowmaker) [14:54:59] 10Data-Engineering, 10Data Pipelines: Drop ArticleCreationWorkflow data - https://phabricator.wikimedia.org/T310863 (10lbowmaker) [14:55:03] 10Data-Engineering, 10Data Pipelines, 10Patch-For-Review: [Iceberg] Update Refine Sanitize to insert into Iceberg tables - https://phabricator.wikimedia.org/T311739 (10lbowmaker) [14:55:06] 10Data-Engineering, 10Data Pipelines: Automatically monitor schema changes that would break sqoop - https://phabricator.wikimedia.org/T310824 (10lbowmaker) [14:55:11] 10Data-Engineering, 10Cassandra, 10Data Pipelines: Encrypt Spark-Cassandra connection - https://phabricator.wikimedia.org/T310820 (10lbowmaker) [14:55:17] 10Data-Engineering, 10Data Pipelines: Add scheduler pid file - https://phabricator.wikimedia.org/T310042 (10lbowmaker) [14:55:25] 10Data-Engineering, 10Data Pipelines: [NEEDS GROOMING] We should improve and automate python linting - https://phabricator.wikimedia.org/T310541 (10lbowmaker) [14:55:29] 10Data-Engineering, 10Data Pipelines, 10Discovery-Search, 10Java-Scala-Standardization: We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, - https://phabricator.wikimedia.org/T309097 (10lbowmaker) [14:55:33] 10Data-Engineering, 10Data Pipelines: Dataset Tagging: Curated dataset tag - https://phabricator.wikimedia.org/T307706 (10lbowmaker) [14:55:37] 10Data-Engineering, 10Data Pipelines: Migrate 1+ Druid load jobs - https://phabricator.wikimedia.org/T307508 (10lbowmaker) [14:55:41] 10Data-Engineering, 10Data Pipelines: [Airflow] Refactor anomaly detection DAG factory into a TaskGroup factory - https://phabricator.wikimedia.org/T308011 (10lbowmaker) [14:55:45] 10Data-Engineering, 10Data Pipelines: Airflow: pin dependency versions to prevent long installs - https://phabricator.wikimedia.org/T309046 (10lbowmaker) [14:55:49] 10Data-Engineering, 10Data Pipelines: airflow-dags directory hierarchy refactor for CI testing - https://phabricator.wikimedia.org/T305955 (10lbowmaker) [14:55:53] 10Data-Engineering, 10Data Pipelines, 10Data-Catalog: Spike: Integrate Spark with DataHub - https://phabricator.wikimedia.org/T306896 (10lbowmaker) [14:55:57] 10Data-Engineering, 10Data Pipelines: Oozie Job Migration: browser/general - https://phabricator.wikimedia.org/T305379 (10lbowmaker) [14:56:01] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data Pipelines: Improvements of artifacts cache - https://phabricator.wikimedia.org/T307115 (10lbowmaker) [14:56:05] 10Data-Engineering, 10Data Pipelines: CI/CD Pipeline Implementation - https://phabricator.wikimedia.org/T304929 (10lbowmaker) [14:56:09] 10Data-Engineering, 10Data Pipelines: Refine jobs should be scheduled by Airflow - https://phabricator.wikimedia.org/T307505 (10lbowmaker) [14:56:13] 10Data-Engineering, 10Data Pipelines, 10Product-Analytics, 10Research: Update HDFS links tables as Mediawiki changes - https://phabricator.wikimedia.org/T304979 (10lbowmaker) [14:56:17] 10Data-Engineering, 10Data Pipelines, 10Product-Analytics: [REQUEST] Add new Fundraising dimensions to druid.pageviews_daily & druid.pageviews_hourly - https://phabricator.wikimedia.org/T304571 (10lbowmaker) [14:56:21] 10Data-Engineering, 10Data Pipelines: [NEEDS GROOMING] We should provide a migration path from client-mode spark jobs - https://phabricator.wikimedia.org/T303189 (10lbowmaker) [14:56:25] 10Data-Engineering, 10Data Pipelines: Data pipelines should support conda environment files and integrate conda dist - https://phabricator.wikimedia.org/T303839 (10lbowmaker) [14:56:29] 10Data-Engineering, 10Data Pipelines: [Airflow] Spike investigate of better ways to organize/access Airflow logs - https://phabricator.wikimedia.org/T302500 (10lbowmaker) [14:56:33] 10Data-Engineering, 10Data Pipelines: Variabilization of existing jobs - https://phabricator.wikimedia.org/T303473 (10lbowmaker) [14:56:37] 10Data-Engineering, 10Data Pipelines: [NEEDS GROOMING} Airflow development instances should be available on demand - https://phabricator.wikimedia.org/T295814 (10lbowmaker) [14:56:41] 10Data-Engineering, 10Data Pipelines, 10Product-Analytics: Add Product-Analytics Announcements to Airflow job for notifications - https://phabricator.wikimedia.org/T301281 (10lbowmaker) [14:56:45] 10Data-Engineering, 10Data Pipelines: [Anomaly detection] Allow for custom email alert content - https://phabricator.wikimedia.org/T301571 (10lbowmaker) [14:56:49] 10Data-Engineering, 10Data Pipelines: Production Airflow dags should be moved to the shared repo - https://phabricator.wikimedia.org/T295807 (10lbowmaker) [14:56:53] 10Data-Engineering, 10Data Pipelines: [NEEDS GROOMING] Data pipelines should be published to archiva - https://phabricator.wikimedia.org/T295812 (10lbowmaker) [14:56:57] 10Data-Engineering, 10Data Pipelines: [PLACEHOLDER] Pipelines Rep Structure Changes after RFC - https://phabricator.wikimedia.org/T295364 (10lbowmaker) [14:57:01] 10Data-Engineering, 10Data Pipelines: Implement Image Recommendations Algorithm Performance Metrics - https://phabricator.wikimedia.org/T294478 (10lbowmaker) [14:57:05] 10Data-Engineering, 10Data Pipelines: Implement Image Recommendations DAG Performance Metrics - https://phabricator.wikimedia.org/T294480 (10lbowmaker) [14:57:09] 10Data-Engineering, 10Data Pipelines: [Airflow] Research, discuss and decide on DAG/task dependencies VS. success/failure files (Oozie style) - https://phabricator.wikimedia.org/T301568 (10lbowmaker) [14:57:13] 10Data-Engineering, 10Data Pipelines: Define and Create Logging Routines - Airflow UI - https://phabricator.wikimedia.org/T292747 (10lbowmaker) [14:57:17] 10Data-Engineering, 10Data Pipelines: [SPIKE][PLACEHOLDER] we need to estimate the effort required to migrate Similarusers' backend to Cassandra - https://phabricator.wikimedia.org/T287274 (10lbowmaker) [14:57:21] 10Data-Engineering, 10Data Pipelines, 10SRE, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10lbowmaker) [14:57:29] 10Analytics-Radar, 10Data-Engineering, 10Data Pipelines, 10Editing-team, and 4 others: WikiEditor records all edits as platform = desktop in EventLogging - https://phabricator.wikimedia.org/T249944 (10lbowmaker) [14:57:35] 10Data-Engineering, 10Data Pipelines, 10Patch-For-Review, 10Technical-Debt: Productionize HDFS fsimage data analysis job - https://phabricator.wikimedia.org/T261283 (10lbowmaker) [14:57:43] 10Data-Engineering, 10Data Pipelines: Refine: Use Spark SQL instead of Hive JDBC - https://phabricator.wikimedia.org/T209453 (10lbowmaker) [14:57:47] 10Data-Engineering, 10Data Pipelines: Gather all data-purge into a single job - https://phabricator.wikimedia.org/T262201 (10lbowmaker) [14:58:01] 10Data-Engineering: [Maintenance] Reduce number of HDFS files - https://phabricator.wikimedia.org/T347975 (10lbowmaker) [14:58:05] 10Data-Engineering: [Maintenance] Delete sanitized events removed from sanitization list - https://phabricator.wikimedia.org/T347586 (10lbowmaker) [15:09:24] btullis: turns out, running a service exactly once on the first business day of a month is remarkably complex in systemd. If we want to favor running the command on, say the first Monday of the month, we could however do something like this: https://www.reddit.com/r/systemd/comments/jfayw1/calendar_expression_for_1st_and_3rd_wednesday_of/ [15:10:32] (Additional details in https://phabricator.wikimedia.org/T350945) [15:41:01] 10Data-Engineering, 10WMDE-FUN-Team, 10WMDE-Fundraising-Tech, 10Event-Platform, 10WMDE-FUN-Funban-2023: Validation Error for eventlogging_WMDEBannerSizeIssue - https://phabricator.wikimedia.org/T344027 (10gabriel-wmde) a:03gabriel-wmde [15:58:53] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.325% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:07:22] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) I made a mistake when attempting to split my patch in two, so I have abandoned https://gerrit.wikimedia.or... [16:22:51] 10Data-Platform-SRE: Regenerate the skein certificates during the first buisiness day of the month - https://phabricator.wikimedia.org/T350945 (10BTullis) Yeah, let's not over-complicate it then. How about every Tuesday? Fewer national holidays on Tuesdays. [16:30:55] 10Data-Platform-SRE, 10superset.wikimedia.org, 10Wikimedia-production-error: Prod Superset down, showing HTTP 500 instead - https://phabricator.wikimedia.org/T350718 (10VirginiaPoundstone) [16:31:33] 10Data-Platform-SRE, 10superset.wikimedia.org, 10Wikimedia-production-error: Prod Superset down, showing HTTP 500 instead - https://phabricator.wikimedia.org/T350718 (10BTullis) 05Open→03Resolved [18:08:24] 10Data-Engineering, 10All-and-every-Wikisource, 10ArticlePlaceholder, 10BetaFeatures, and 54 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Daimona) [18:08:45] 10Data-Engineering, 10All-and-every-Wikisource, 10ArticlePlaceholder, 10BetaFeatures, and 54 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Daimona) [18:23:53] (SystemdUnitFailed) firing: monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:02:16] (EventgateValidationErrors) firing: ... [19:02:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:07:16] (EventgateValidationErrors) resolved: ... [19:07:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:58:53] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.325% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:02:16] (EventgateValidationErrors) firing: ... [20:02:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:32:16] (EventgateValidationErrors) resolved: ... [20:32:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:41:16] (EventgateValidationErrors) firing: ... [20:41:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:46:15] (EventgateValidationErrors) resolved: ... [20:46:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:47:15] (EventgateValidationErrors) firing: ... [20:47:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:57:16] (EventgateValidationErrors) resolved: ... [20:57:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:02:06] 10Data-Engineering, 10WMDE-FUN-Team, 10WMDE-Fundraising-Tech, 10Event-Platform, 10WMDE-FUN-Funban-2023: Validation Error for eventlogging_WMDEBannerSizeIssue - https://phabricator.wikimedia.org/T344027 (10gabriel-wmde) a:05gabriel-wmde→03None [21:03:13] 10Data-Engineering, 10WMDE-FUN-Team, 10WMDE-Fundraising-Tech, 10Event-Platform, 10WMDE-FUN-Funban-2023: Validation Error for eventlogging_WMDEBannerSizeIssue - https://phabricator.wikimedia.org/T344027 (10gabriel-wmde) PR: https://github.com/wmde/fundraising-banners/pull/273 When merged, the changes need... [22:23:53] (SystemdUnitFailed) firing: monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:58:53] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.325% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace