[00:10:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [02:01:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [02:11:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [02:36:01] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [06:36:01] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [10:36:01] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [11:59:55] I have created a CR to increase the threshold for this EventgateLoggingExternalLatency alarm: https://gerrit.wikimedia.org/r/c/operations/alerts/+/748704 [12:00:13] I have also silenced it for 3 weeks on alertmanager. [12:01:24] I will create another CR to stop the Java heap alerts from firing on an-test-coord1001 - Apologies for that noise. [12:04:13] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, and 2 others: Apparent latency warning in 90th centile of eventgate-logging-external - https://phabricator.wikimedia.org/T294911 (10BTullis) Thanks for the feedback @Joe - I'm sure you're right. I'll keep searching for the cause... [14:19:35] o/ [14:33:45] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, and 2 others: Apparent latency warning in 90th centile of eventgate-logging-external - https://phabricator.wikimedia.org/T294911 (10Ottomata) > As we know, the primary use case for eventgate-logging-external is that of Network Er... [14:35:57] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, and 2 others: Apparent latency warning in 90th centile of eventgate-logging-external - https://phabricator.wikimedia.org/T294911 (10Ottomata) > BadRequestError: request aborted at IncomingMessage.onAborted https://github.com/expr... [14:39:59] (03CR) 10Ottomata: MEP schema for IOS Notification Interaction (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747967 (https://phabricator.wikimedia.org/T290920) (owner: 10Sharvaniharan) [14:40:57] (03CR) 10Ottomata: [C: 03+1] Add network_internal_flows to gobblin netflow job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/748099 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal) [14:41:17] ottomata: hi! [14:41:23] mforns: helloOO [14:41:40] I've moved the airflow code from workflow_utils to airflow-dags [14:42:03] and tested an'all, works well, so I created a couple merge requests for both repos [14:42:07] kay! [14:42:10] can you please have a look? [14:42:13] yes! [14:42:16] https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/3 [14:42:20] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/6 [14:42:57] The only thing remaining before we can merge and run jobs in prod is to symlink the common directory in airflow-dags no? [14:43:23] The imports in the dag code look like: from common import blah. Is that OK? Or should it be just: import blah? [14:43:54] from common fine, but I wonder if we should namespace this a little better, [14:44:02] maybe [14:44:05] ok! [14:44:12] what do you have in mind? [14:44:25] wmf_airflow_dags/common [14:44:25] ? [14:44:26] or [14:44:32] airflow_dags/common [14:44:33] or [14:44:34] ...if we do that [14:44:37] maybe common isn't needed? [14:44:40] what do you think? [14:45:09] you mean creating an 'airflow_dags' directory within the airflow-dags repository? [14:45:14] yes [14:45:25] i think the python imports should make sense [14:45:29] 'common' is not very obvious [14:45:36] yea, you're right [14:46:09] oh also, myabe add __init__.py files where appropriate in these dirs? [14:46:20] I would avoid using 'dags', because it will not contain DAGs (even if it contains some DAG templates) [14:46:24] okay [14:46:38] perhaps we shouldn't call this repo 'airflow-dags' :p [14:46:45] I removed the __init__.py files, because I read that after Python3.4 they are not needed [14:46:49] oh! [14:46:49] really! [14:46:52] i did not know that [14:46:53] but file to readd them [14:46:54] reading... [14:46:56] fine [14:47:12] mforns: how about just wmf_common [14:47:12] then? [14:47:16] kiss we can rename later [14:47:25] they say you should/can only use them if you want some custom initialization code [14:47:40] wmf_airflow? [14:47:44] wmf_airflow_utils? [14:47:55] airflow_utils? [14:49:05] wmf_airflow_common [14:49:05] ? [14:50:53] ok! [14:51:08] changing [14:51:17] ottomata: are you OK then with no __init__.py files? [14:51:20] https://stackoverflow.com/a/48804718 [14:51:22] and [14:51:31] https://stackoverflow.com/a/56277323 [14:51:42] so, maybe we should have __init__.py files? [14:51:50] I see [14:51:54] OK, changing both [15:04:41] mforns: not to do now, but i wonder if we should have a think about how to namespace wmf python modules we make! [15:04:54] yesss [15:04:55] after reading about namespace packages, might be have a consistent namespacing [15:04:59] even in different repos [15:05:02] its easy for java [15:05:04] aha [15:05:06] org.wikimedia [15:05:30] not to do now though...just lets keep it in mind [15:05:39] ok, pushing code then [15:05:58] ottomata: done: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/6 [15:06:53] mforns: did you mean to have a /home/mforns path in that code? [15:07:23] ottomata: yes, it was on purpose, given that next step will be to use dependency management for that file [15:07:38] ottomata: but I can change that, and look for a better location for that file [15:07:40] ah okay, hmmm. [15:07:53] yeah i guess tha'ts fine for now then, i think i thought you were just going to commit that to the repo with this for now [15:07:55] but yeah. [15:07:58] we can commit it to refinery repo? [15:07:58] this is fine too [15:08:01] both are temproary [15:08:14] hmm mforns is it possible to put it in hdfs? [15:08:27] yes, I guess it should! [15:08:29] then it'll jsut work from both test and prod instances [15:08:39] aha, that's good [15:09:07] do you think we can deploy that via refinery repo? [15:10:02] ok, testing it with hdfs, we can always deploy it via refinery [15:11:34] oh duh wait thats not true; it has to be in the test hdfs cluster [15:11:42] mforns: deploying it with refinery makes sense [15:11:46] or i mean... [15:11:49] just commit it to this repo for now [15:11:51] then we only need one deploy? [15:13:06] hm [15:13:30] it should be quick to deploy to refinery no? [15:17:20] ottomata: where do you thing airflow queries should go within refinery repo? [15:17:33] airflow directory or hive directory or a new hql directory? [15:18:04] queries [15:18:06] directory? [15:18:12] or maybe [15:18:13] jobs [15:18:13] ? [15:18:21] jobs//... [15:18:25] like we do for oozie [15:18:29] but now just 'jobs' to be generic? [15:18:41] I see [15:18:47] just a suggestion, [15:18:51] what you think? [15:18:52] i dunno! [15:18:52] yea yea [15:18:53] :) [15:19:08] :D [15:19:14] yes, I like it [15:19:44] I was just wondering if some of those queries can also be used outside of the job context [15:19:55] but they could be nevertheless... [15:20:26] the good thing about jobs is that we could have other static files there [15:20:32] not just queries [15:21:41] kay [17:44:31] (03PS3) 10Sharvaniharan: MEP schema for IOS Notification Interaction [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747967 (https://phabricator.wikimedia.org/T290920) [17:48:10] (03CR) 10Sharvaniharan: MEP schema for IOS Notification Interaction (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747967 (https://phabricator.wikimedia.org/T290920) (owner: 10Sharvaniharan) [17:51:11] (03PS1) 10Mforns: Add anomaly detection hql for Airflow jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/748767 (https://phabricator.wikimedia.org/T295201) [17:54:08] ottomata: if you give the OK to that, I will deploy refinery, test the airflow job again and push the final code to airflow-dags, to be tested in the test cluster [18:09:26] (03CR) 10Ottomata: [C: 03+2] Add anomaly detection hql for Airflow jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/748767 (https://phabricator.wikimedia.org/T295201) (owner: 10Mforns) [18:09:30] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add anomaly detection hql for Airflow jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/748767 (https://phabricator.wikimedia.org/T295201) (owner: 10Mforns) [18:09:48] mforns: +1 and feel free to merge in airlfow-dags too [18:09:49] and proceed [18:17:11] ottomata: thanks! [18:39:58] !log started to deploy refinery, adding anomaly detection hql for airflow job [18:40:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:59:04] !log finished deployment of refinery, adding anomaly detection hql for airflow job [18:59:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:16:43] ottomata: I deployed refinery, adapted the airflow-dags change to point to the query within refinery in HDFS, and merged that [19:17:04] okay [19:17:08] ottomata: however, the refinery deployment to the test cluster failed, it actually has been failing for weeks to me [19:17:15] oh [19:17:19] for why? [19:17:20] :) [19:17:37] let me get the error message [19:17:44] argh I closed the tab [19:17:58] I'll try again one sec [19:18:59] ottomata: 19:18:34 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'analytics/refinery', '-g', 'default', 'fetch', '--refresh-config'] (ran as analytics-deploy@an-test-client1001.eqiad.wmnet) returned [255]: Host key verification failed. [19:19:48] key out of date? [19:21:56] hm [19:22:04] that's strange [19:22:12] maybe puppet is paused there.. [19:22:17] oh [19:22:22] it is [19:22:24] and by me. :o [19:22:37] one sec [19:22:41] :] [19:24:10] Hm, now I'm thinking that the refinery deploy "version", which I hardcoded in the DAG code, to point to the versioned code, will not work in the test cluster, since the timestamp is different [19:24:20] like: 2021-12-20T18.52.37+00.00--scap_sync_2021-12-20_0001-dirty [19:24:48] mforns try now [19:24:52] ok [19:24:56] oh yeah [19:25:03] that's true [19:25:04] ottomata: same [19:25:08] mforns: i mean... you can use local path after all for now. [19:25:19] same? [19:25:21] checking [19:25:23] yes [19:26:03] oh a second puppet run changed more ssh host stuff [19:26:04] try again. [19:26:10] ok [19:26:20] same :S [19:30:37] ok i try [19:32:39] hm, puppet needed run on deployment host too. just got it to work [19:32:43] i thkn [19:34:03] aha [19:34:09] its deploying... [19:34:13] yay [19:34:59] ottomata: I imagine I have to ssh into the test cluster coord and execute refinery-deploy-to-hdfs? [19:36:42] mforns: yes if you want to use an hdfs path [19:37:06] hangon its still deploying....syncing fat jars [19:37:11] oh ok [19:38:03] mforns: ok done deploying [19:38:17] k, deploying to nodes [19:40:52] started... [19:53:21] finished [20:00:23] ok, ottomata I think it's ready to deploy to test cluster [20:00:34] okay! [20:00:46] mforns: [20:00:46] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow#Deployment_of_airflow-dags [20:01:04] ok, doing! [20:01:14] oh and we need wmf_airflow_common in pythonpath [20:01:16] on it... [20:02:43] oh [20:02:44] no we don't [20:02:51] or wait... [20:02:55] ok will figure it out... [20:02:58] k [20:04:07] deploying [20:04:15] k [20:05:05] ottomata: gave the following error: [20:05:11] 20:04:08 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'airflow-dags/analytics', '-g', 'default', 'promote', '--refresh-config'] (ran as analytics@an-test-client1001.eqiad.wmnet) returned [1]: Could not chdi [20:05:11] r to home directory /nonexistent: No such file or directory [20:05:36] hmmm mmMMMM [20:05:59] going to try too [20:06:15] k [20:09:25] mforns: do you also get the other error about artifacts and kerberos? [20:09:56] ottomata: yes [20:10:06] ok yeah i see why [20:10:16] lets see [20:11:56] ok mforns , fixed, you deploy too just to make sure it works for you too [20:12:07] ok doing [20:12:11] FYI this was the fix: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-analytics/-/commit/40b49fe93417617268f49d5a6dd22096b0a5a5b9 [20:12:27] ottomata: yesss it vvvorked [20:12:29] nice [20:12:37] okay, working on wmf_airflow_common in python path [20:12:56] aha, makes sense [20:12:59] ok [20:13:46] ottomata: I will edit dag_config.py to change hadoop_name_node to test and also change the .hql file path, is that ok? [20:14:29] k [20:15:20] mforns: [20:15:26] you probably don't need hadoop_name_node [20:15:32] you could just set hdfs:/// [20:15:39] and let the default one be picked up [20:15:53] sure, but then I need to change the paths to the jars as well [20:15:58] I can do that too [20:16:11] mforns: path to jars? [20:16:14] this way I just edit 1 file [20:16:23] yes, the spark job and hive udfs ones [20:16:35] oh [20:16:39] that's the same on all the clusters, right? [20:16:43] on both clusters. [20:16:47] ? [20:17:11] yes, but they have the analytics-hadoop prefix [20:18:00] {dag_config.artifacts_directory}/org/wikimedia/analytics/refinery/refinery-job-0.1.23-shaded.jar [20:18:47] in this case dag_config.artifacts directory is: hdfs://analytics-hadoop/wmf/refinery/current/artifacts [20:22:44] mforns: [20:22:50] just change hadoop_name_node to hdfs:/// [20:22:56] and all will just work, right? [20:22:59] they all reference that var? [20:23:11] Yesss [20:23:28] ottomata: but the deployment timestamp for refinery differs [20:23:37] that's why I also need to change the hql path explicitly [20:23:56] yes, in hdfs [20:24:03] mforns: maybe just use local path for now to the hql file? [20:24:08] oh but for the jars [20:24:27] it's fine, I'll change the dag file with the correct paths [20:24:31] i see i see... [20:24:38] mforns: how do you vary between the different instances [20:24:40] test vs prod? [20:24:58] ideally this job would work without manual intervention when deployed to both, right? [20:25:05] I don't undertand... :] [20:25:10] i guess, when we sync artifacts to cache it won't matter [20:25:11] mforns: [20:25:14] I see... [20:25:15] there are two hadoop clusters [20:25:21] yes yes [20:25:23] and we have 2 airflow instances [20:25:31] airflow-analytics-test [20:25:34] uses the hadoop-test cluster [20:25:37] when we do it with oozie, we have to change the hadoop name node property too [20:25:45] right, but we do it when we submit the jbo manually, right? [20:25:49] yes [20:25:51] are we submitting manually with airflow? [20:25:54] no [20:26:08] we can not change that as a parameter... hmmmmmm [20:26:39] guess we need environment specific config? [20:26:48] yes [20:26:59] hmm, you know the AIRFLOW_HOME [20:27:05] that will be different on the different instances :/ [20:27:22] why? [20:28:19] they are different instances [20:28:25] with different instance names [20:28:33] like hadoop cluster has different names [20:28:48] i suppose we could name them the same... [20:28:56] but, i was suggesting you could use that to vary the config [20:29:07] if AIRFLOW_HOME == '/srv/airflow-analytics-test', etc. [20:29:13] kinda hacky [20:29:26] ah! [20:29:28] it sounds like a better thing to do would be to introduce some kind of airflow environment name concept [20:29:52] yes [20:31:24] ottomata: maybe it could be specified in the airflow.cnf file? [20:31:29] we could do that via puppet [20:31:37] and you can read that from airflow code [20:32:22] mforns: oh ? in what var? [20:32:34] we can make one up [20:32:37] we can?! [20:32:46] I guesssss...??!?!? [20:34:15] maybe not :( [20:35:13] hmm, mforns i wonder if it might be better to symlink in wmf_airflow_common into analytics [20:35:21] oh but dags... [20:35:45] hm, yeah it'd have to be symlink in analytics/dags/ [20:36:09] aha [20:36:16] hm [20:38:59] ok mforns I haven't done it in puppet [20:39:02] but manually on an-test-client1001 for now [20:39:08] ok [20:39:23] sudo -u analytics airflow-analytics-test info | grep python_path [20:39:36] OHHH [20:39:40] that' not quite right [20:39:50] because we want to do from wmf_airflow_common import... [20:39:51] one sec [20:39:58] yes [20:40:26] ok [20:40:55] ottomata: have you switched on the dAG? [20:41:09] it's running [20:42:27] will stop it, it might be a remains of the last test DAG... [20:44:15] ottomata: I think dependency management, as we imagined it, will help with the test environment, because everything will be an outside source, and will be pulled to the correct place [20:44:27] i did not [20:44:33] indeed [20:45:00] ok, should I switch it on? [20:45:10] the dag? sure! try it out [20:45:15] i'm working on some puppet stuff to make this better [20:45:15] k [20:45:25] actually, i could set an environment variable here. [20:45:34] we could make one up that you could look up via os.environ [20:45:53] AIRFLOW_ENVIRONMENT ? [20:45:58] AIRFLOW_ENVIRONMENT_NAME [20:46:56] aha, that makes sense, I like both [20:47:19] ottomata: the hive metastore connection doesn't match the name I gave it in the code [20:47:53] was the connection created by puppet or manually?> [20:48:12] mforns: i think the connection is not configured [20:48:16] it should be created by puppet [20:48:24] oh [20:48:28] probably when i reenabled puppet [20:48:32] it wiped some manual stuff we made before [20:48:38] what are the deets? i'll add to puppet now [20:49:15] ummmmm [20:49:20] ottomata: the connection is there! [20:49:32] the thing is the name is: analytics-test-metastore [20:49:42] and this doesn't match the name I gave it in the code [20:49:43] its probably some manual entry done in the db [20:49:47] we should do it in puppett [20:49:51] right [20:49:55] what name? [20:50:14] should it change name depending on environment? [20:50:22] hm i'd call it analytics-test-hive [20:50:24] and yes it should [20:50:30] they are different hive metastores [20:51:04] hmmm [20:51:15] maybe analytics-test-metastore is a good name! [20:51:22] erg. [20:51:22] ok [20:51:24] lets see.. [20:51:25] then will change the code [20:52:04] we call the name of the hive catalog in pfersto 'analytics-hive' [20:52:14] analytics-test-hive is probably consistent then [20:52:18] ok [20:52:22] e.g. analytics-hadoop, vs analytics-test-hadoop [20:52:27] ok [20:52:33] analytics-test-hive then [20:53:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [20:53:39] hm! [20:56:14] ooops [21:03:03] ok so mforns the question is [21:03:03] https://puppet-compiler.wmflabs.org/pcc-worker1001/33058/an-test-client1001.eqiad.wmnet/index.html [21:03:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [21:03:35] why do these connections not show up in airflow connection list! [21:03:37] brb... [21:03:50] atually merging real quick, then back in 30 [21:04:45] is extra_dejson the expected name? [21:05:04] I'll also pause for a bit [21:05:14] no, that is how you pass the extra conig [21:05:15] config [21:05:24] OH, but i might need to have it there as a string. [21:05:53] ah no [21:05:57] extra_dejson is good [21:05:57] https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/secrets-backend/local-filesystem-secrets-backend.html#storing-and-retrieving-connections [21:06:49] ok done. but we need to figure out why we only see the UI ones? maybe we need a way to disable the UI entered connections. [21:06:52] ok back in 30 [21:12:27] k [21:12:31] me too [21:15:14] oh mforns [21:15:15] https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/secrets-backend/local-filesystem-secrets-backend.html#storing-and-retrieving-variables [21:15:18] for config ^ [21:15:40] that might be right answer to all your config questions [21:15:48] https://airflow.apache.org/docs/apache-airflow/stable/howto/variable.html [21:17:25] hm, but airflow docs make clear that variables should not be (ab)used at DAG interpretation time, because each variable lookup is a database query [21:18:10] if we have 100 dags, and a DAG interpretation interval of 10 seconds.. then we might add some overhead. But maybe it's good [21:18:26] maybe it's fine, if we have all config go into the same variable, as a json blob [21:39:36] not if they in a file? [21:39:38] or, maybe not. [21:42:24] back [21:43:12] hmm, yeah we don't need these variables in tasks, so yeah. maybe not. [21:55:35] ok mforns the local connections file does work [21:55:51] heya [21:55:56] you just can't see them in the ui [21:55:59] or via connections list [21:56:08] sudo -u analytics airflow-analytics-test connections get analytics-test-hive [21:56:24] you should be able to refer to the connection by name [21:56:29] in your dag though [21:57:32] ok [21:58:41] https://github.com/apache/airflow/issues/17852 [21:58:41] and [21:58:42] ottomata: can you see the UI now though, it says the scheduler does not appear to be running [21:58:44] https://github.com/apache/airflow/issues/10867 [21:58:47] oh [21:59:12]