[00:16:21] 10Data-Engineering, 10CirrusSearch, 10Discovery-Search (Current work): Fix permissions in hdfs://analytics-hadoop/wmf/data/discovery - https://phabricator.wikimedia.org/T331580 (10EBernhardson) [04:30:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:08] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 4 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Marostegui) [10:18:33] (03CR) 10Ottomata: [C: 03+2] Remove development/ mediawiki page change schemas [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/885886 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [10:19:13] (03Merged) 10jenkins-bot: Remove development/ mediawiki page change schemas [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/885886 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [10:38:47] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09), 10Patch-For-Review: Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 (10Ottomata) So: since we decided to use plain on `datastream.map` when `error_destination = False` (inst... [10:39:12] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09), 10Patch-For-Review: Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 (10Ottomata) [10:51:39] 10Data-Engineering-Planning, 10MediaWiki-extensions-EventLogging, 10Metrics-Platform-Planning, 10Wikimedia-Hackathon-2023: Allow JavaScript errors to fail CI builds - https://phabricator.wikimedia.org/T318902 (10kostajh) [11:15:01] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09), 10Patch-For-Review: Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 (10gmodena) > So: since we decided to use plain on datastream.map when error_destination = False (instead... [11:21:14] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 10): Store Flink HA metadata in Zookeeper for mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T331283 (10akosiaris) I am guessing flink supports natively writing to zookeeper it's state ? And that this... [11:25:23] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 (10BTullis) >>! In T331115#8668478, @BTullis wrote: > Then, to my mind, next step should be to... [11:48:13] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 10): Store Flink HA metadata in Zookeeper for mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T331283 (10Ottomata) > I am guessing flink supports natively writing to zookeeper it's state ? Yes, but more... [12:24:17] (03PS1) 10Nmaphophe: GDI Equity Landscape Tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/895737 [12:43:55] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 (10BTullis) Having discussed this with @Vgutierrez in #wikimedia-traffic I think we have a plan... [13:05:29] (03PS2) 10Nmaphophe: GDI Equity Landscape Tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/895737 [13:06:30] !log upgrading analytics airflow to 2.5.1 on an-launcher1002 [13:06:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:13:53] Hello, we are upgrading our analytics airflow to version 2.5.1 expect some downtime for one hour or so. T326193 [13:13:53] T326193: Airflow upgrade (refactor deb creation + version bump + switch to PostgreSQL) - https://phabricator.wikimedia.org/T326193 [13:58:50] Ack steve_munene - It'd be great if you and others log your operation steps on the chan, for the sake of following and documenting [13:58:57] (using the !log mechanism) [14:00:09] !log running puppet on an-launcer1002 to pull the new package after https://gerrit.wikimedia.org/r/c/operations/puppet/+/896098 is merged. [14:00:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:04:03] !log airflow services were started automatically. airflow db check was successful. [14:04:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:09:03] thanks a lot btullis :) [14:35:43] I think that the airflow upgrade on the analytics instance is now complete. Is that correct aqu? Have we deployed the updated airflow_dags and unpaused all of the jobs? [14:37:25] Waiting for confirmation before sending kudos :) [14:38:44] Yes, airflow-dags repo is deployed. All dags are unpaused. I'm checking dangling tables. But it shouldn't prevent you from using the brand new UI! [14:39:49] Notice that we are deploying airflow-dags from the T326194_airflow_deb_creation_with_gitlab_ci branch till all instances are upgraded. [14:48:21] Dangling tables checked and removed. [15:00:26] 10Data-Engineering-Planning: Add optional TLS encrption to the druid-public-broker - https://phabricator.wikimedia.org/T331631 (10BTullis) [15:00:48] 10Data-Engineering-Planning: Add optional TLS encryption to the druid-public-broker - https://phabricator.wikimedia.org/T331631 (10BTullis) [15:02:10] 10Data-Engineering-Planning: Add optional TLS encryption to the druid-public-broker - https://phabricator.wikimedia.org/T331631 (10BTullis) @lbowmaker I'd like to put this ticket forward as a candidate for the next #shared-data-infrastructure sprint please. [15:04:09] 10Data-Engineering-Planning: Add optional TLS encryption to the druid-public-broker - https://phabricator.wikimedia.org/T331631 (10BTullis) [15:10:00] 10Data-Engineering-Planning: Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 (10BTullis) p:05High→03Medium Removing this unplanned work from the current sprint and lowering the pririty of the ticket. The immediate issue of puppet not running on the new... [15:34:09] 10Data-Engineering-Planning: Data Engineering Pairing system - https://phabricator.wikimedia.org/T327790 (10JArguello-WMF) [16:09:45] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [16:16:38] (03CR) 10Milimetric: GDI Equity Landscape Tables (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/895737 (owner: 10Nmaphophe) [16:17:31] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10Ottomata) Just did a little work on the [[ https://grafana.wikimedia.org/goto/ID-QOV-Vz?orgId=1 | Flink Cluster grafana dashboard ]]. [16:24:07] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [16:38:45] btullis, nfraison, steve_munene: We're seing a lot of NodeManager errors since 1/2 hour, with failing jobs and all - Would one of you be around to help troubleshooting? [16:39:09] joal: Looking now. [16:42:52] joal: I'm seeing a lot of emails, but on initial inspection they seem to be old emails, as if they have been queued since March 7th. Have you got evidence for the failing jobs as well? [16:43:39] yes it looks to be old mail [16:43:54] joal: I think it is an issue with mailman. There is mention of it in #wikimedia-sre [16:44:27] the RM doesn't show much FAiled job [16:44:27] +2 btullis it should be linked [16:47:03] nfraison: Shall we deploy the updated spark-operator now? [16:47:44] yes we can [16:49:43] !log deploying updated spark-operator to dse-k8s cluster. [16:49:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:54:39] ack btullis and nfraison - thanks for checking [17:16:21] 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail, 10SRE: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10BTullis) Sorry, these two patches are unrelated to this patch. Added by mistake. [17:46:02] Hey aqu - would you have a minute to talk about branches for airflow dag creation? [17:46:57] !log deploying spark-operator once more [17:46:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:49:09] spark-operator now has the correct cli parameters applied, thanks nfraison [17:49:17] `exec /usr/bin/tini -s -- /usr/bin/spark-operator -v=2 -logtostderr -namespace=spark -enable-ui-service=false -controller-threads=10 -resync-interval=30 -enable-batch-scheduler=false -label-selector-filter=` [17:55:44] !log Force kill druid indexing task to unlock druid_load_navigationtiming_daily__load_to_druid__20230228 [17:55:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:16:15] The new spark-operator is running. Work has now shifted to T318924 to make it work with a test application. [18:16:16] T318924: Submit a spark job to the dse-k8s cluster - https://phabricator.wikimedia.org/T318924 [18:16:41] this is great btullis - kudos! [18:17:22] Thanks joal. We'll get there. :-) [18:17:41] for sure we will! [18:29:15] mforns: before you leave - the druid navigation_timing is fixed - there a blocked task in druid preventing new ones to succeed - I killed the blocked tasks and the pilled ones, restarted the job - success [18:29:35] ????? [18:29:50] joal: the Druid ingestion was taking forever [18:30:15] aaah, a blocked task in druid ingestion, I see [18:30:30] but how come no other ingestion was being blocked? only that one? [18:51:08] !log deployed airflow analytics (2.5) with the T326194_airflow_deb_creation_with_gitlab_ci branch [18:51:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:34:22] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10bking) I believe this is complete, but tagging @EBernhardson to confirm. [19:40:40] joal, would you like to talk about branches this evening or at morning? [19:47:30] !log shutting down an-worker1078 for RAID BBU replacement T331544 [19:47:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:47:35] T331544: hw troubleshooting: RAID BBU for an-worker1078 - https://phabricator.wikimedia.org/T331544