[00:16:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:55:16] 10Data-Engineering, 10Data-Services, 10Growth-Team, 10PageTriage, 10cloud-services-team: Clean up pagetriage_log views - https://phabricator.wikimedia.org/T331844 (10Marostegui) [08:31:00] o/ we are wondering if it's worth thinking about "testing" hql queries that go into an airflow dag, do you run any kind of automated tests for the queries written in refinery:/hql ? [08:32:13] 10Data-Engineering, 10Data-Services, 10Wikimedia Enterprise: Data request: make rendered HTML page dumps available on stats machines or labs - https://phabricator.wikimedia.org/T331018 (10awight) 05Open→03Resolved a:03awight >>! In T331018#8684712, @Legoktm wrote: > @awight it's already on Toolforge an... [08:36:47] (03CR) 10Phedenskog: [C: 03+2] "Looks good!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/896370 (https://phabricator.wikimedia.org/T264032) (owner: 10Barakat Ajadi) [08:37:20] (03Merged) 10jenkins-bot: Navtiming: Add total longtask and total longtask duration [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/896370 (https://phabricator.wikimedia.org/T264032) (owner: 10Barakat Ajadi) [10:46:44] Hi dcausse - we curretnly don't have automated tests for HQL queries in refiner - this is something I'd love we do though [10:52:31] joal: hi, thanks! fyi there're some discussions at https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/266#note_20357 if you are interested/have some ideas (we used to run simple hql syntax checks on some queries). [10:52:57] Thanks for the pointer dcausse :) [11:04:12] (03PS2) 10Jennifer Ebe: T330206 - Create Mediacounts Load Hourly HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/896334 [11:06:43] (03PS3) 10Jennifer Ebe: T330206 - Create Mediacounts Load Hourly HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/896334 [11:19:46] aqu, regarding the airflow package on the stat hosts. You were right I believe, just pushing out the upgrade across stat100[4-8] will be best. It doen't need to be pinned in puppet. [11:21:10] aqu: Do you know if we can do it at any time? Will we want to announce it anywhere in particular? [11:57:35] Yes, I think we should make an announcement in the slack channel data-engineering. Maybe, we could provide some days of notice. And provide the workaround till then (create a conda-env to clone from). Would you like me to do it? [11:59:28] aqu: Yes please. I'm happy to push out the upgrade whenever you like and I'm around pretty much all week. However, I think you'll probably describe the workaround better than I would. [12:16:19] PROBLEM - Host an-worker1140 is DOWN: PING CRITICAL - Packet loss = 100% [12:25:50] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 4 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Marostegui) [12:36:29] (03CR) 10Joal: "Minor nits" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/896334 (owner: 10Jennifer Ebe) [12:37:27] (03CR) 10Joal: "Another one I had forgotten" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/896334 (owner: 10Jennifer Ebe) [13:21:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:25] I've checked an-worker1004 and the CPU appears to be soft locking. [13:24:38] !log restarting an-worker1140 [13:24:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:29:41] RECOVERY - Host an-worker1140 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [13:29:51] RECOVERY - Check systemd state on an-worker1140 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:25] RECOVERY - SSH on an-worker1140 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:31:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:38] (03PS3) 10Aqu: Migrate refine webrequest to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) [13:34:22] (03CR) 10Aqu: "One interrogation about where to put the coalesce hint." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [13:39:55] (03CR) 10Joal: Migrate refine webrequest to Airflow (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [13:51:12] hi milimetric - seeing you're going for the pageview deploy now - let me know if ou need any help [13:51:44] thanks... I think I have it sorted out, it's not too bad, just needed _SUCCESS flags unexpectedly [13:52:08] Yeah I read that [13:53:14] !log killing pageview-monthly_dump-coord, pageview-daily_dump-coord, and pageview-hourly-coord oozie jobs to migrate to airflow [13:53:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:55:43] 10Data-Engineering, 10Data Pipelines: Airflow Hackathon (May 2022) - https://phabricator.wikimedia.org/T307500 (10JArguello-WMF) [13:55:45] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 09), 10Patch-For-Review, 10SecTeam-Processed, 10Vuln-VulnComponent: Upgrade Puppet code to make Airflow configuration files compatible with version 2.5.0 - https://phabricator.wikimedia.org/T315580 (10JArguello-WMF) 05Open→03Resolved [13:56:54] 10Data-Engineering, 10Data Pipelines (sprint 10): Differential privacy airflow-dags merge request - https://phabricator.wikimedia.org/T330234 (10JArguello-WMF) [14:12:17] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 09): Upgrade flink-kubernetes-operator to 1.4.0 - https://phabricator.wikimedia.org/T331282 (10JArguello-WMF) 05Open→03Resolved [14:12:22] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic, 10Patch-For-Review: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10JArguello-WMF) [14:12:27] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): Add event dt field to error event schema - https://phabricator.wikimedia.org/T330918 (10JArguello-WMF) 05Open→03Resolved [14:12:32] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Platform Team Workboards (Clinic Duty Team): Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10JArguello-WMF) [14:12:37] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): [Flink Operation] How to handle app upgrades - https://phabricator.wikimedia.org/T328569 (10JArguello-WMF) 05Open→03Resolved [14:12:40] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Flink Operations - https://phabricator.wikimedia.org/T328561 (10JArguello-WMF) [14:12:44] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): [Flink Operations] How to handle restarting a Flink application - https://phabricator.wikimedia.org/T328563 (10JArguello-WMF) 05Open→03Resolved [14:12:47] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Flink Operations - https://phabricator.wikimedia.org/T328561 (10JArguello-WMF) [14:12:53] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Cite, 10Reference Previews, and 2 others: Remove or simplify tracking metrics - https://phabricator.wikimedia.org/T242127 (10thiemowmde) [14:12:58] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): Flink + Event Platform integration for writing into streams via Table API - https://phabricator.wikimedia.org/T324114 (10JArguello-WMF) 05Open→03Resolved [14:13:00] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): Flink SQL queries should access Kafka topics from a Catalog - https://phabricator.wikimedia.org/T322022 (10JArguello-WMF) [14:13:03] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): Flink SQL queries should access Kafka topics from a Catalog - https://phabricator.wikimedia.org/T322022 (10JArguello-WMF) 05Open→03Resolved [14:15:00] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review: mediawiki-event-enrichment should support the latest eventutilities-python changes - https://phabricator.wikimedia.org/T330994 (10JArguello-WMF) [14:17:21] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 10): eventutilities-python should support using Kafka TLS ports - https://phabricator.wikimedia.org/T331526 (10JArguello-WMF) [14:17:23] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Document Flink job deployment to k8s - https://phabricator.wikimedia.org/T329629 (10JArguello-WMF) [14:17:25] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10JArguello-WMF) [14:17:27] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Spark Streaming Dumps POC: Backfill metadata table - https://phabricator.wikimedia.org/T323642 (10JArguello-WMF) [14:17:33] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10JArguello-WMF) [14:27:47] joal: maybe I do need help with something that seems stuck... [14:28:08] I have this plan to drop the hour column from the unexpected values table: [14:28:18] https://www.irccloud.com/pastebin/nUPhFYxA/ [14:28:58] and it's just hanging at `drop table pageview_unexpected_values_old;` [14:34:59] hm... maybe it's ok to just leave that data behind? [14:49:57] oh ok... it just has TONS of partitions... so I'm slowly going to drop them, drop the table, and probably just insert with spark - easier :) [15:06:47] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 09), 10SecTeam-Processed, 10Vuln-VulnComponent: Upgrade Puppet code to make Airflow configuration files compatible with version 2.5.0 - https://phabricator.wikimedia.org/T315580 (10sbassett) [15:10:39] ok, pageview_unexpected_values has been re-created, with daily partitions, and the old data has been saved in _old. I'll wait for standup to see if people want to migrate that or not. [15:16:07] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) [15:17:11] 10Data-Engineering, 10DBA, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Marostegui) [15:20:27] 10Data-Engineering, 10DBA, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Marostegui) [15:21:50] 10Data-Engineering, 10DBA, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Marostegui) Moved dbproxy1018 as it belongs to #cloud-services-team [15:22:10] 10Data-Engineering, 10DBA, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Marostegui) [15:22:24] hey milimetric - sorry I missed your ping [15:22:28] 10Data-Engineering, 10DBA, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Marostegui) [15:22:53] 10Data-Engineering, 10DBA, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MoritzMuehlenhoff) [15:23:41] np joal, I'm going into a meeting in 7 min but I'll see you at standup. Nothing urgent, normal deploy stuff [15:23:55] ack milimetric - thanks a lot [15:24:25] 10Data-Engineering, 10DBA, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Marostegui) [15:41:56] 10Data-Engineering, 10Event-Platform Value Stream: eventutilities-python should support using Kafka TLS ports - https://phabricator.wikimedia.org/T331526 (10JArguello-WMF) [15:45:37] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): EventStreamCatalog removes 'topic' table option if connector = upsert-kafka - https://phabricator.wikimedia.org/T330769 (10JArguello-WMF) [15:47:12] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10JArguello-WMF) [16:07:37] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10EBernhardson) confirmed the instance seems to be working, remaining updates are to be made in the data-engineeri... [16:20:44] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate DB connection issues faced from airflow on an-launcher1002 - https://phabricator.wikimedia.org/T331265 (10nfraison) [16:26:06] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MPhamWMF) [16:39:10] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (sprint 10): 13 new wikis missing from mediawiki_history - https://phabricator.wikimedia.org/T329119 (10JArguello-WMF) [16:39:51] (03PS1) 10Milimetric: Add new wikis to mw history pipeline [analytics/refinery] - 10https://gerrit.wikimedia.org/r/897947 (https://phabricator.wikimedia.org/T329119) [16:40:55] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (sprint 10), 10Patch-For-Review: 13 new wikis missing from mediawiki_history - https://phabricator.wikimedia.org/T329119 (10Milimetric) [16:41:04] (03CR) 10Joal: [C: 03+1] "LGTM" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/897947 (https://phabricator.wikimedia.org/T329119) (owner: 10Milimetric) [16:41:51] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (sprint 10), 10Patch-For-Review: 13 new wikis missing from mediawiki_history - https://phabricator.wikimedia.org/T329119 (10Milimetric) a:03Milimetric [16:43:15] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add new wikis to mw history pipeline [analytics/refinery] - 10https://gerrit.wikimedia.org/r/897947 (https://phabricator.wikimedia.org/T329119) (owner: 10Milimetric) [16:44:54] (03PS4) 10Aqu: Migrate refine webrequest to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) [17:07:10] milimetric: do you have minute before next call? [17:08:24] !log restart jobhistory in test cluster to take in account https://gerrit.wikimedia.org/r/c/operations/puppet/+/896305 [17:08:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:09:21] joal: yes, omw cave? [17:09:26] OMW to [17:09:30] +o [17:13:03] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 10): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10achou) > +1, but, do you think we would want some kind of common naming convention for '... [17:14:03] !log restart jobhistory in prod cluster to take in account https://gerrit.wikimedia.org/r/c/operations/puppet/+/896305 [17:14:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:37:49] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 10): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10achou) > We should probably make a mediawiki/page/prediction_change schema that can and... [18:00:07] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 10): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Isaac) > Good point. Starting with predicted_ might be a good idea, so there are predict... [18:06:35] 10Data-Engineering, 10CirrusSearch, 10Discovery-Search (Current work): Fix permissions in hdfs://analytics-hadoop/wmf/data/discovery - https://phabricator.wikimedia.org/T331580 (10EBernhardson) >>! In T331580#8683625, @nfraison wrote: > Command to change right is runnning > > To prevent that we can remove w... [19:24:11] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10colewhite) [21:35:08] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10JArguello-WMF) @EBernhardson can I close the ticket?