[03:10:13] 10Quarry: Users get logged out from Quarry every day (or two) - https://phabricator.wikimedia.org/T362025#9699626 (10GTrang) [08:13:37] (03CR) 10Gmodena: [V:03+2 C:03+2] "stream config is aligned and has been tested in dev (stat boxes). Merging." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983926 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [09:08:19] (03CR) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment of account creation (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [09:09:21] (03PS49) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment of account creation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [09:23:55] (03PS50) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment of account creation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [10:19:51] 06Data-Engineering, 10Dumps-Generation, 06SRE, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9700186 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapsh... [10:47:43] 06Data-Engineering, 10Dumps-Generation, 06SRE, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9700307 (10BTullis) [10:51:00] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9700320 (10BTullis) [11:01:18] 06Data-Engineering, 10Dumps-Generation, 06SRE, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9700341 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot10... [11:01:43] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9700349 (10BTullis) a:03BTullis [12:06:33] !log starting a refinery deployment for 2024-04-09 [12:06:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:13:26] (03CR) 10Urbanecm: [C:03+1] "schema lgtm, pending WikimediaEvents" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [12:34:48] btullis: sorry just seen the ping from yesterday! I hope that no cleanup is needed (Re: grafana and aqs/cassandra) [12:35:01] (03PS24) 10Joal: Extract RefineSingleApp code from Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) [12:35:11] IIUC the old rules that I added because of different cassandra versions (IIRC) is not needed anymore [12:35:19] and it causes some labels to be dropped etc.. [12:40:36] 10Quarry: Error in web instances. - https://phabricator.wikimedia.org/T362157 (10rook) 03NEW [12:54:19] btullis: the dashboards seem to work now :) [12:55:38] elukey: awesome. Thanks for that, both past and present Luca :-) [12:55:52] ahahahh np :D [12:55:53] <3 [12:56:14] I am doing the roll restart of aqs cassandra instances in eqiad for the new truststore [12:56:20] !log successfully deployed refinery to hadoop and hadoop-test [12:56:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:56:28] a few things to fix and in theory we'll be able to move to pki [12:57:57] gmodena: There may be a hiccup in the `refinery-deploy-to-hdfs` script. If it fails, let us know. [12:58:18] btullis no failures, as long as I can tell [12:59:31] gmodena: OK, great. [13:01:01] (03CR) 10Joal: Extract RefineSingleApp code from Refine (035 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) (owner: 10Joal) [13:01:24] (03PS25) 10Joal: Extract RefineSingleApp code from Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) [13:20:44] !log shut down stat1010 to have the GPU power connected for T336040 [13:20:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:20:48] T336040: Bring stat1010 into service with GPU from stat1005 - https://phabricator.wikimedia.org/T336040 [13:53:11] 10Data-Engineering (Q4 2024 April 1st - June 30th), 06Data-Platform, 13Patch-For-Review: Unique devices tables have missing or incorrect data for January and February 2024 - https://phabricator.wikimedia.org/T361242#9700964 (10lbowmaker) a:03JAllemandou [13:58:14] joal: o/ [13:58:18] bonsoir! [13:58:28] (better bonjour, too early I know) [13:59:22] when you have a moment - after a chat with Eric IIUC I'd need to find the spark loader workflow that pushes data to AQS' Cassandra [13:59:41] to update the client's TLS settings (if needed) to allow PKI [13:59:46] do you know where I should look? [14:01:12] (if I have to guess I'd say WmfCassandraAuthConfFactory.scala in refinery source) [14:01:32] 10Quarry: Error in web instances. - https://phabricator.wikimedia.org/T362157#9701028 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/quarry/pull/37 [14:20:07] elukey: that looks pretty new: https://phabricator.wikimedia.org/T356400#9506834 but I searched everywhere and it doesn't look like this is deployed: https://codesearch.wmcloud.org/search/?q=cassandra.auth.conf.factory&files=&excludeFiles=&repos=analytics%2Frefinery%2Canalytics%2Frefinery%2Fsource%2Crepos%2Fdata-engineering%2Fairflow-dags [14:20:25] nevertheless, I think it would be the right place to change, and then it's up to us to make it so on all the airflow jobs [14:31:08] milimetric: o/ thanksss - I don't recall 100% what we do but IIRC we load some dataset from hadoop to cassandra/aqs, or am I wrong? [14:31:16] if so even the code related to that needs to be fixed [14:31:29] basically I am trying to not break your workflows upgrading TLS certs :D [14:31:54] yes, we load using airflow, all the config is handled in a central default place, so should be easy to fix to add this conf factory [14:32:13] yeah, I think until we make that change you're right, that would break it [14:32:42] ahh okok, can you point me to the airflow config? Is it in puppet? [14:33:02] 10Quarry: Error in web instances. - https://phabricator.wikimedia.org/T362157#9701132 (10rook) Seems like this error is coming from the sqlite file not existing, all the requests seem to be for csv or json files. It is not clear who is calling for these files. This error should probably be caught as a file not f... [14:33:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [15:04:27] mmm so I am checking the airflow configs on an-launcher1002 [15:04:32] and I don't see traces of AQS [15:04:53] I see hive and jumbo [15:09:48] (03PS5) 10Aqu: Add CLI to create or update Iceberg tables [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016808 (https://phabricator.wikimedia.org/T356762) [15:10:33] elukey: Does this help? https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/tree/main/analytics/dags/cassandra_load?ref_type=heads [15:13:11] btullis: yes definitely! I was checking the dag logs, didn't know it about the repo.. are the settings about how to connect to cassandra stored in there too? [15:13:17] or elsewhere? [15:14:25] from a quick glance it doesn't seem to me that there are any specific TLS settings (like specifying the CA bundle to use to verify the cassandra instances certs etc..) [15:14:49] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/config/dag_default_args.py#L182-190 [15:16:47] I was wondering if there was something in the `/srv/airflow-analytics/connections.yaml` file for airflow, which is generated from puppet, but it looks like there isn't. [15:19:21] elukey: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-ssl-connection-options [15:22:35] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9701497 (10BTullis) [15:25:07] btullis: thanksss! Am I reading it right that we don't use TLS now? [15:27:55] ahh yes afaics we allow both type of conns [15:28:55] so in theory we should be good in moving the first node to PKI [15:29:08] Yes, I believe you're right. [15:29:15] we can force TLS in airflow when we are on PKI [15:29:29] the cacerts etc.. handling will be way easier [15:29:32] thanks a ton btullis [15:33:21] A pleasure. [15:36:38] 06Data-Engineering, 10Cassandra, 06Data-Persistence: Encrypt Airflow connections to AQS Cassandra - https://phabricator.wikimedia.org/T362181 (10elukey) 03NEW [15:36:38] opened a task :) [16:33:10] (03PS6) 10Mforns: WIP: Clean up and parameterize SQL code for Common Impact Metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1016796 (https://phabricator.wikimedia.org/T358681) (owner: 10Xcollazo) [18:33:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [18:41:37] deploying airflow dag to fix mediawiki_history_metrics_monthly dag [18:58:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [19:12:57] (03PS1) 10Aleksandar Mastilovic: WMCS unpivoted data HQL script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1018331 [19:32:33] (03PS7) 10Xcollazo: WIP: Clean up and parameterize SQL code for Common Impact Metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1016796 (https://phabricator.wikimedia.org/T358681) [19:35:44] (03CR) 10Xcollazo: "Done" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1016796 (https://phabricator.wikimedia.org/T358681) (owner: 10Xcollazo)