[00:12:39] 10Data-Engineering, 10Generated Data Platform: Set up regular-repairs for AQS cassandra cluster tables - https://phabricator.wikimedia.org/T297944 (10JJMC89) 05duplicate→03Open [01:02:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [01:12:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [05:39:39] 10Data-Engineering: Hive query failure in Jupyter notebook on stat1005 - https://phabricator.wikimedia.org/T297734 (10odimitrijevic) p:05Triage→03High [05:40:03] 10Data-Engineering, 10Data-Engineering-Kanban: Hive query failure in Jupyter notebook on stat1005 - https://phabricator.wikimedia.org/T297734 (10odimitrijevic) [11:15:44] 10Data-Engineering, 10Data-Engineering-Kanban: Hive query failure in Jupyter notebook on stat1005 - https://phabricator.wikimedia.org/T297734 (10BTullis) a:03BTullis [11:35:08] (03PS1) 10GoranSMilovanovic: filter our CurrentEvents [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/749159 [11:35:32] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] filter our CurrentEvents [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/749159 (owner: 10GoranSMilovanovic) [11:37:12] 10Data-Engineering, 10Data-Engineering-Kanban: Hive query failure in Jupyter notebook on stat1005 - https://phabricator.wikimedia.org/T297734 (10BTullis) I have managed to replicate this issue using the hive CLI, although interestingly it skipped `/tmp/parquet-0.log` and tried to open `/tmp/parquet-1.log` for... [11:44:30] 10Data-Engineering, 10Infrastructure-Foundations, 10netops: Do we still need the Analytics vlans exception? - https://phabricator.wikimedia.org/T298087 (10ayounsi) [13:19:49] (03PS1) 10GoranSMilovanovic: T296926 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/749172 [13:20:23] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T296926 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/749172 (owner: 10GoranSMilovanovic) [13:31:14] 10Analytics-Radar, 10Infrastructure-Foundations, 10SRE, 10netops: Review the Analytics Firewall rules on cr1/cr2 - https://phabricator.wikimedia.org/T157806 (10elukey) [13:42:21] (03PS1) 10GoranSMilovanovic: T297354 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/749175 [13:42:36] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T297354 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/749175 (owner: 10GoranSMilovanovic) [13:56:59] o/ [14:03:03] 10Data-Engineering, 10Data-Engineering-Kanban: Hive query failure in Jupyter notebook on stat1005 - https://phabricator.wikimedia.org/T297734 (10Ottomata) Right, and it isn't a race condition, since the existent parquet-1.log was created Nov 2 or before. As a workaround, perhaps we could modify the FileHandl... [15:08:27] mforns: o/ [15:08:34] heya ottomata :] [15:08:48] am thikning the best way to add wmf_airrflow_common to pythonpath [15:08:54] aha [15:08:57] is going to be to symlink it from analytics/dags [15:09:12] rather than doing some stuff to airflow processes [15:09:25] but i am having trouble testing i thnk [15:09:49] wanna bc? [15:10:02] k [15:44:53] mforns: i think i need the change in airflow-dags before i can merge the puppet stuff [15:45:40] I'm on it, do you want to create an origin branch with your needed changes and I can push on top of that, or you can wait? [15:46:14] no my stuff is in puppet [15:46:34] for this mforns i'd say just push directly to airflow-dags, no need for merge request [15:46:50] ok [15:48:54] this is much better, analytics-test can now define its own artifacts too [15:51:38] aha! [15:52:12] ottomata: I just pushed analytics-test to main [15:52:20] It should be ready to go [15:52:38] greatt [15:52:43] ottomata: oh, the symlink is missing [15:53:23] k add it! itll take me 2 mins [15:53:23] ... [15:53:31] to get this new deployment ready [15:53:33] k [15:56:10] ottomata: done [16:25:51] ok mforns also done. [16:26:00] updated https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow#analytics-test [16:26:03] it is its own deploy now [16:26:09] so no more scap deploy -e hadoop test [16:51:52] ottomata: ok [16:52:28] will redeploy to test it, and check that the job is running there correctly [16:52:44] ottomata: after that, can I deploy it to prod cluster? [17:03:00] mforns: sure thang! oh wait. [17:03:06] no i need to update the airflow env there [17:03:09] will do right after standup [17:03:16] ok, no probl! [18:55:36] ok mforns analytics instance ready to go [18:55:40] deploy away [19:20:11] ok ottomata! I'm still waiting for the test-cluster DAG to execute a couple more dates, and send me an email, but will do! [19:34:19] okay! [20:03:12] ottomata: I finished testing the thing in test cluster, and copied the job for prod, with the modified properties. [20:03:18] awesome [20:03:25] feel free to merge away [20:03:26] ottomata: then, deployed to prod, and run the dag [20:03:30] ok good [20:03:42] it failed, says analytics-hive does not exist? [20:06:09] ottomata: ^ [20:06:33] checking mforns [20:06:43] it certinaly does not! [20:06:45] making it! [20:08:34] k :] [20:14:56] mforns: try now [20:15:02] k [20:17:20] ottomata: working! [20:18:26] yeehaw! [20:24:03] ottomata: hm, there's is a weird problem with tasks executing before they should... [20:24:58] oh? [20:25:03] mforns: ? [20:25:15] I think it's related to task concurrency [20:25:33] let me check that the related properties are set correctly [20:37:22] k [20:46:59] ottomata: it's looking good now [20:47:06] gr8! [20:48:21] ottomata: I think the dev environment has concurrency=1 by default, and the test cluster hid the issue because data was already present [20:48:56] just added a concurrency property to the anomaly detection dag factory [20:48:57] hmmm [20:49:18] doesn't sound right, they aren't configured differently afaik [20:49:49] mforns: what setting controls default concurrency? [20:50:05] btw on both machines [20:50:09] in this case: max_dag_runs [20:50:09] you can acess the airflow CLI [20:50:10] like [20:50:12] sudo -u analytics airflow-analytics-test [20:50:14] or sudo -u analytics airflow-analytics [20:51:49] mforns: max_active_runs ? [20:52:07] max_active_runs_per_dag [20:52:21] same on both [20:52:22] max_active_runs_per_dag = 16 [20:52:29] it is then overriden optionally at a DAG level by max_active_runs [20:52:32] 20:51:09 [@an-test-client1001:/srv/airflow-analytics-test] $ sudo -u analytics airflow-analytics-test config list | grep max_active [20:52:32] max_active_runs_per_dag = 16 [20:52:37] 20:51:22 [@an-launcher1002:/srv/airflow-analytics] $ sudo -u analytics airflow-analytics config list | grep max_active [20:52:37] max_active_runs_per_dag = 16 [20:52:50] yes, that is the default for the test cluster as well as the prod cluster [20:53:01] however it'd not the case for the dev instance that the script launches [20:53:05] OH_dev instances [20:53:07] sorry [20:53:09] that makes sense [20:53:23] because probably the dev isntances uses ummm, what is it? sequential executor? [20:53:37] yes [20:53:42] https://airflow.apache.org/docs/apache-airflow/stable/executor/sequential.html [20:53:43] in my tests in the test-cluster the issue was hidden by the fact that data was already there for the anomaly_detection table [20:53:49] yea [20:53:49] aye [20:55:57] ottomata: so, I guess I will leave it running for tonight, and if all goes well, tomorrow I will add the 2 remaining anomaly detection jobs [20:56:18] ok great! [20:56:32] mforns: i'm off tomorrow until pretty much in 2022 sometime [20:57:13] start of the year, we can improve the jobs by using proper dependency management, and also jos-eph's cluster mode SparkSQL [20:57:24] cool! [20:57:39] yeahhh! [20:57:54] thanks ottomata for the huge help :D [20:59:11] <3 [21:55:44] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) @BTullis Have you had a chance to look and see if we're using the correct partman recipe? [23:36:20] 10Data-Engineering-Kanban, 10Airflow: Tooling for Deploying Conda Environments - https://phabricator.wikimedia.org/T296543 (10Ottomata) Update. Stuff not quite ready for review, but I've done a lot of bootstrapping work in https://gitlab.wikimedia.org/otto/workflow_utils/. - conda.py module with a conda-dist...