[02:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [04:04:51] (03CR) 10Sharvaniharan: "Hi Otto and Jason.. please review this new schema we are creating to track toolbar customization. This is not a migration." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747226 (owner: 10Sharvaniharan) [06:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [09:18:13] (03CR) 10Joal: Update refine netflow_augment transform function (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747561 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal) [09:18:45] (03PS2) 10Joal: Update refine netflow_augment transform function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747561 (https://phabricator.wikimedia.org/T263277) [10:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [11:01:19] !log upgrading druid on the test cluster with new packages to test log4j changes. [11:01:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:01:32] !log btullis@an-test-druid1001:~$ sudo apt-get install druid-broker druid-common druid-coordinator druid-historical druid-middlemanager druid-overlord [11:01:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:01:49] o/ [11:02:24] out of curiosity was there a change in the kafka-main cluster in codfw yesterday? [11:03:12] seeing some unusual failures in flink running in codfw ("Timeout expired after 60000milliseconds while awaiting InitProducerId") [11:03:24] dcausse: o/ I took down kafka-main2003 for firmware upgrades, and I am upgrading the os now :) [11:04:37] elukey: thanks! seems unlikely that one host down could cause this, will investigate a bit more [11:05:56] dcausse: I can give you more precise timings, but if the timeouts happened today too it may be the root cause [11:06:33] checking logstash for the first flink failure [11:18:50] first occurence of the error is yesterday 2021-12-15T16:17:22 and it caused 6 failures since (without user facing consequences, just an alert flapping) [11:21:20] dcausse: https://sal.toolforge.org/log/rrPdvn0B8Fs0LHO5syZL [11:21:49] hm seems close [11:22:13] I have executed the command a little later [11:22:17] so it matches in my opinion [11:22:30] what kind of failures are we talking about? [11:22:51] I need to reimage 2 more main nodes in codfw, and 3 in eqiad (probably next year( [11:22:59] and also upgrade their firmware etc.. [11:23:14] if you want I can ping you when the work happens [11:24:19] I'm still trying to understand why that could happen [11:24:33] perhaps it's just a matter of increasing some timeouts [11:25:04] or to shorten them, the above seems 60s right? [11:25:13] yes [11:25:16] it smells like dangling tcp connection [11:25:23] ah perhaps [11:29:01] I'll open a task to keep track of things but does not seem like a very urgent thing to investigate [11:30:50] ack! [13:28:56] (03PS3) 10Joal: Update structured_data dumps parsing job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747508 (https://phabricator.wikimedia.org/T258834) [13:42:16] (03CR) 10Joal: [V: 03+1] "Successfully tested on both jobs!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747508 (https://phabricator.wikimedia.org/T258834) (owner: 10Joal) [13:55:43] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Apache atlas build fails due to expired certificate (https://maven.restlet.com) - https://phabricator.wikimedia.org/T297841 (10Ottomata) You might be able to workaround this by manually downloading and then uploading this dependency to our archiva,... [14:09:31] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Apache atlas build fails due to expired certificate (https://maven.restlet.com) - https://phabricator.wikimedia.org/T297841 (10elukey) @razzi I suggest to file a pull request (if you have a fix) and ask a feedback in their dev@ mailing list, it is... [14:16:39] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, and 5 others: Create a Commons equivalent of the wikidata_entity table in the Data Lake - https://phabricator.wikimedia.org/T258834 (10JAllemandou) Code is ready: - Import `commons-mediainfo` json dumps to HDFS (https://gerri... [14:18:33] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10JAllemandou) [14:19:03] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10JAllemandou) QA done on samples of multiple days, no missing data. [14:25:44] ottomata: o/ yt - I'm having some issues with this Druid rebuild. I got some debs built and deployed this morning to test, but in the end they still used log4j 2.8. [14:25:55] oh [14:26:00] that's weird [14:26:21] Have you got time to look at it with me, by any chance? [14:26:26] yeah am looking in your home on codfw now [14:26:56] btullis: in /home/btullis/wmf/druid/lib i see log4j-core-2.8.2.jar [14:27:00] I've just done a `git reset --hard origin/debian` in wmf/druid to get back. [14:27:08] oh [14:29:00] ok in the /var/cache/pbuilder/result/buster-amd64/druid-common_0.19.0-2_all.deb i do see ./usr/share/druid/lib/log4j-core-2.8.2.jar [14:29:02] Originally I did a merge from master to debian branch, then a `dch -i` to update the changelog, then updated the debian/sources/include-binaries, then committed, then gbp buildpackage [14:29:35] https://www.irccloud.com/pastebin/3bt0Ya31/ [14:30:56] This is what I'm seeing when I build. The log4j2-2.8.2 jars are still in the druid_0.19.0.orig.tar.gz file. At build time they are extracted and end up in the build environment, but they are no longer mentioned in `debian/source/include-binaries` so it fails. [14:31:09] hmm btullis i think gbp is annoyed with your manual changes to the master git history errmmm yeah [14:31:43] btullis: is the tag for 0.19.0 still the same on the master branch? [14:32:02] it needs to be bumped to the last commit in theory [14:32:37] ok yeah [14:32:44] this is a source change, not just a debian packaging change [14:32:48] so you need to bump the source version [14:32:53] mabye 0.19.0-wmf0 [14:32:54] or something [14:33:07] i think you'll want to make that a tag [14:33:15] from your current master with your changes [14:33:30] that way the orig.targ.gz file will be created from your tag with your changes [14:34:50] * btullis not ... quite ... following ... sorry. [14:34:55] joal: not sure you saw, but 0.1.23 is up, but I forgot to sync it to hdfs, do you need that? [14:35:10] https://www.irccloud.com/pastebin/UrCiuf03/ [14:35:42] milimetric: heya - I think it'll be needed yes - I can take of it if you wish [14:35:46] joal: I got it [14:35:48] This is what the current master looks like. I can hard reset it to origin/master if that maskes it easier. I'm not sure where those commits came from. [14:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [14:36:34] > bump the source version [14:36:34] What does this bit mean? [14:38:41] btullis: that one is the master branch right? [14:38:44] err the debian one [14:39:20] It was the master branch I just pasted above. I've just reset --hard to origin/master [14:39:54] btullis: bump the source version meaning in changelog [14:39:58] via dch if you like [14:40:07] gbp.conf is telling git-buildpackage what to do [14:40:17] upstream-tree=tag [14:40:17] upstream-branch=master [14:40:29] and it uses the debian version to figure out the tag to use [14:40:30] yeah but I don't get why there is "Merge branch 'master' into debian" in the master branch [14:41:14] ok I get it [14:41:14] https://gerrit.wikimedia.org/r/c/operations/debs/druid/+/747499 [14:41:23] this one should have been filed for the "debian" branch [14:41:29] so, you want master a tag that is the version that is specified in changelog [14:41:33] sorry [14:41:37] so, you want a tag that is the version that is specified in changelog [14:41:48] orig.tar.gz will be built from the tag [14:42:03] (not sure if it matters where master is at, but to be safe probably you want it to be the same as the tag) [14:42:10] OHHHH [14:42:17] that is a gotcha for sure! [14:42:29] btullis: not sure if you use git review to send patches to gerrit [14:42:42] but, if you do, you need to tell it which branch to use if it is not master [14:42:43] Yes I do. [14:42:44] so instead of just git review [14:42:46] git review debian [14:43:13] mforns: btw i can work on stuff with ya if you want to! [14:43:14] Ah, right. I see. [14:43:36] btullis: I think that the master branch needs to be reset to 0303bec31206e08f9525b08d606c38e52ccc83a0, and then force-pushed (to delete the last commit) [14:43:57] or something similar [14:44:02] then we can add the new tag [14:44:10] and finally merge to debian etc.. [14:44:28] in theory with a brutal git push -f it should be easily fixable :D [14:45:25] elukey: Great. I will reset that now. I always tend to use `--force-with-lease` for safety, if that's possible. [14:45:39] ahhh yes yes anything that you prefer [14:47:15] heya ottomata :] I'm also here [14:49:49] (03PS4) 10Joal: Update structured_data dumps parsing job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747508 (https://phabricator.wikimedia.org/T258834) [14:51:34] joal / mforns: the artifacts were synced to hdfs [14:51:49] thank you milimetric :] [14:52:05] ottomata: I pushed the code for workflow_utils: https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/2 [14:52:40] thanks milimetric :) [14:53:21] nice [14:53:23] looking [15:03:00] OK, operations/debs/druid master branch has been force-pushed back to 0303bec31206e08f9525b08d606c38e52ccc83a0 [15:03:06] mforns: reviewed, only one chabnge for now about making an airflow submodule [15:03:39] oh ottomata I forgot to implement the changes we spoke about in the airflow sync meeting: passing the refinery_* jar paths as parameters [15:03:59] ottomata: yea, airflow submodule makes sense [15:04:13] k, changing [15:04:33] yeah, also probably shouldn't use 'current' [15:04:37] well no [15:04:38] current is ok [15:04:41] but not versionless jar [15:07:24] Here is the newly merged debian branch - pushed using `git review debian` this time: https://gerrit.wikimedia.org/r/c/operations/debs/druid/+/747863 [15:11:01] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10BTullis) With the above change the server shows up in `confctl` and is available to be pooled. ` btullis@pupp... [15:15:33] mforns: feel free to merge whenever and we can work on making it deployed it test cluster [15:15:44] ottomata: cool [15:49:20] I think that I've sorted out the git branches, but I'm still getting some errors when building on deneb. [15:49:30] https://www.irccloud.com/pastebin/0gbgYmqc/ [15:50:14] Do I need to remove these files from the source tarball manually? [15:50:38] btullis: your changelog version is still 0.19.0 [15:50:47] which matches the tag upstream/0.19.0 [15:51:04] so i think it is using that tag to create the orig.tar.gz file [15:51:11] and then comparing that to what is in your debian [15:51:21] so [15:51:27] change your changelog version to a new one [15:51:30] since this is basically a fork [15:51:35] 0.19.0-wmf0-1 maybe [15:51:37] then [15:51:39] try [15:52:13] hmm [15:52:28] My orig.tar.gz is just a symlink to the downloaded file. [15:52:30] not sure if the tag needs upstream/ in front [15:53:01] hm pretty sure gbp will create one if it doesn't exist [15:53:05] and if you change the version bnumber in changelog [15:53:09] it will think it doesn't exist [15:53:14] I deleted and recreated the upstream/0.19.0 tag. [15:53:17] https://www.irccloud.com/pastebin/RM2ZnJFC/ [15:53:23] hmm [15:53:25] ottomata: I updated the directory structure, and took care of all the comments I think [15:53:26] btullis: better to make a new version [15:53:34] OK, will do. Thanks. [15:53:47] whenever you make a source change, a new version is kind of required [15:54:09] if you are just changing debian packaging bits, a debian version bump is fine, e.g the -2 bit [15:55:20] mforns: merging [15:55:20] OK, got it. [15:55:27] btullis: ya? [15:55:52] I reckon. [15:55:53] oh got it == understand, not fixed :) [15:56:07] Yes, I reckon I understand. [15:56:16] i reckon you do! [15:56:17] :) [15:56:19] I could use some help with jupyter / pyhive / file paths. Trying to do something simple and getting lost [15:56:54] if I do "LOAD DATA LOCAL INPATH "file:///tmp/dan.csv" OVERWRITE INTO TABLE milimetric.test_load_csv" in hive, it works fine [15:57:21] but if I run a jupyter notebook on the same server, and issue that command through pyhive, from the notebook, it says it's an invalid path [15:57:47] OperationalError: TExecuteStatementResp(status=TStatus(statusCode=3, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error while compiling statement: FAILED: SemanticException Line 1:23 Invalid path \'"file:///tmp/dan.csv"\': No files matching path file:/tmp/dan.csv:28:27', 'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:380' [15:58:09] hmm [15:58:33] dan just to try something, what if you put that file in hdfs and try [15:58:40] e.g. hdfs:///user/dan/dan.csv [15:58:40] ? [15:58:52] trying (doesn't it need the namenode and other stuff there?) [15:58:54] or ha hdfs:///user/milimetric/dan.csv [15:58:56] milimetric: no [15:59:05] it picks up defatul from hdfs-site.xml if not provided [15:59:16] ah cool [16:01:11] ok ottomata yep, that works. So it's just the LOCAL part... hm... [16:01:29] milimetric: which stat box you running on? [16:01:35] I thought "local" in this case meant the server you started the jupyter notebook on, is it my local machine for some reason? [16:01:38] stat1008? [16:01:40] yea [16:02:18] ottomata: the companion dag change is: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/5 [16:02:19] it's got +r for everyone, and it's in /tmp/dan.csv [16:02:37] milimetric: maybe [16:02:37] https://community.cloudera.com/t5/Support-Questions/Load-data-local-inpath-says-quot-invalid-path-quot-when-my/td-p/8704 [16:02:42] When using the JDBC driver, the command executes on the HiveServer2 side. The file is evaluated to locally exist on the server, which is not true in your case (it exists on the local client program machine). [16:03:18] mforns: do you need both refinery-hive and refinery-job jars? [16:03:26] maybe not? [16:03:27] yes [16:03:29] refinery-job is no longer shaded [16:03:33] refinery-job-shaded is [16:03:40] which includes reifnery-hive [16:03:41] buuuut [16:03:45] your way is probably better [16:03:50] if it works [16:03:51] 2 smaller jars [16:03:55] rather than one big huge jar with everything [16:04:02] they might be different versions too [16:04:19] mforns: [16:04:22] it is analytics dags [16:04:26] not data_engineering/dags [16:04:30] these are analytics airflow instances [16:04:39] ?? [16:04:53] the instances are named 'analytics' [16:04:55] we said we'd give team names to folders no? [16:04:57] they are for doing analytics [16:05:07] yes..>...> but we've had analytics since june? [16:05:36] hm, wish I could do like "LOCAL LS" [16:05:40] other teams will also do analytics, but won't use our folder no? [16:05:55] they might? not sure, do we want product-analytics to manage their own airflow instance? [16:06:01] but, you are probably right [16:06:07] however, the username of the instance is 'analytics' [16:06:24] mforns: we can revisit this later, but for now it won't work if you rename the directory [16:06:27] we have to repuppetize [16:06:28] I think it would be cool that PA has their own instance [16:06:34] I see [16:07:47] aside tangent: i know team names are convenient and we don't have a better way of differentiating this, but they are artificial and change! I really wish we could do this by some kind of functional grouping, rather than human grouping [16:10:25] ottomata: agree that team names are not ideal [16:11:34] ottomata: the directory split should match the instance split [16:11:39] do you agree? [16:14:40] mforns: ? [16:15:16] you mean a dir in airflow-dags for each airflow instance? [16:15:50] ya [16:15:51] i agree! [16:19:23] ottomata: is it nasty to subprocess -> hdfs dfs -put {path} /tmp/{path} and then change the local load into an HDFS load? [16:19:47] (this would be a utility function in wmfdata to load csv files into hive tables) [16:20:34] milimetric: no that sounds fine [16:20:39] but maybe [16:20:43] instead of subprocess [16:22:02] you could use pyarrow via fsspec (or just pyarrow) [16:22:11] so for just pyarrow https://arrow.apache.org/docs/python/filesystems.html#hadoop-distributed-file-system-hdfs [16:22:12] but [16:22:19] fsspec has a nicer file like interface [16:22:34] https://filesystem-spec.readthedocs.io/en/latest/usage.html [16:22:39] so you could do [16:24:49] with open('/tmp/path') as input: [16:24:50] with fsspec.open('hdfs:///tmp/path') as output: [16:24:50] output.write(input) [16:24:54] or something likethat [16:24:59] (or just shell out, whatever :)_) [16:28:45] ottomata: do you think all our jobs would fit into 'analytics'? [16:28:55] 10Analytics-Radar, 10wmfdata-python, 10Product-Analytics (Kanban): Create a wmfdata-python test script - https://phabricator.wikimedia.org/T247261 (10nshahquinn-wmf) 05Open→03Resolved Okay, @Milimetric tested and approved, so I've gone ahead and merged it. This is done! [16:29:03] mforns: i they don't we could make another instance? [16:29:06] but [16:29:19] i'm not opposed to making a new one named after our tteam [16:29:26] but it will take some work [16:34:19] ottomata: there's also data access, no? Teams who use a given instance will generate datasets under a given user. And they will be able to read the data, because they belong to a certain team no? [16:34:48] ottomata: I'm also not opposed to use analytics, just trying to gather all the implications [16:36:00] mforns: yeah, its a problem [16:36:03] ideally it wouldn't be [16:36:04] but for now it is [16:36:10] my aside tangent was an aside tangent [16:36:18] for right now, our team uses an 'analytics' user [16:36:21] hehe, no but it's true [16:37:03] ok, that is consistent, user name = folder name = instance name [16:37:07] hm, and for tangent: i suppose there isn't anything stopping multiple teams from being able to use the same system user [16:37:15] Gah! I'm still struggling with this. Latest error is: [16:37:19] https://www.irccloud.com/pastebin/o1kw75Oi/ [16:37:43] ok ottomata changing to analytics [16:38:36] These are the files present in the directory above; [16:38:40] https://www.irccloud.com/pastebin/L0ld0wQg/ [16:39:41] I've tried renaming the downloaded file to apache-druid-0.19.wmf0-bin.tar.gz [16:39:46] btullis: i think your chosen debian version name might not be a valid version [16:39:49] i think you want somethin glike [16:40:52] ah maybe that is my fault [16:40:52] oik [16:40:53] yeah [16:40:55] something like [16:40:59] 10Analytics-Radar, 10Anti-Harassment, 10CheckUser, 10Privacy Engineering, and 4 others: Deal with Google Chrome User-Agent deprecation - https://phabricator.wikimedia.org/T242825 (10DAbad) [16:41:42] hmm actually not sure [16:41:42] Oh I see, you suggested `0.19.0-wmf0-1` - I tried `0.19.wmf0-1` [16:41:55] reading [16:41:55] https://serverfault.com/questions/604541/debian-packages-version-convention [16:42:04] you want to change the uptream version [16:42:04] The upstream_version may contain only alphanumerics[36] and the characters "." (full stop), "+" (plus), "-" (hyphen), ":" (colon), "~" (tilde) and should start with a digit. If there is no debian_revision then hyphens are not allowed; if there is no epoch then colons are not allowed. [16:42:38] ok [16:42:45] you want the tag to ONLY match the upstream version [16:42:47] in t [16:42:52] in changelog [16:42:56] that is the source's version [16:43:14] i thnk your .wmf0 would have worked [16:43:19] but your tag is upstream/0.19.wmf0-1 [16:43:22] you want your tag to be [16:43:23] upstream/0.19.wmf0 [16:43:26] without the debian revision [16:43:36] which, it looks like is being done wrong too [16:43:37] so [16:43:40] make your tag [16:43:52] 0.19.0.wmf0 (i think the . is fine) [16:43:57] and your changelog version [16:44:10] 0.19.wmf0~1 [16:44:13] maybe? [16:44:37] oh no [16:44:41] [-debian_revision] [16:44:42] so [16:44:44] yeah [16:44:48] tag 0.19.0.wmf0 [16:44:56] and changelog [16:44:57] 0.19.wmf0-1 [16:45:03] (which is what you have in your changelog) [16:45:19] yeah btullis i think if you just make your tag be upstream/0.19.wmf0 [16:45:21] it'll work [16:45:42] OK, trying that now. Moritz was suggesting that in future we don't bother with the dual branches for WMF packages. (He spotted the old log4j jars on the test druid server :-) ) [16:49:50] Looking hopeful. `gbp:info: Creating druid_0.19.wmf0.orig.tar.gz from 'upstream/0.19.wmf0'` [16:50:32] dual branches? [16:50:58] A master and a debian branch. [16:51:16] Failed again. [16:51:20] https://www.irccloud.com/pastebin/F0wK6l7N/ [16:52:24] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) a:05Cmjohnson→03Ottomata @Ottomata Can you verify that this is using the correct partman recipe, the installer fails during the install at th... [16:53:10] btullis: remove all your orig.tar.gz files [16:53:15] let gbp create them [16:53:41] or, did gbp create druid_0.19.wmf0.orig.tar.gz ? [16:53:55] if so, it still has 2.8.2 in it! [16:55:00] It did create that file. Previously there was only a symlink to the downloaded file. Trying again. [16:56:32] Same result. [16:57:34] hmmm [16:57:39] so why is it creating that... [16:58:30] btullis: [16:58:36] i just checked out your upstream/0.19.wmf0 tag [16:58:39] it has log4j-core-2.8.2.jar [17:00:02] Gah! Deleting it and trying again. How the hell did it end up pointing to that commit. Sorry for wasting your time. [17:01:28] 10Analytics, 10Analytics-Kanban, 10Data-Engineering-Kanban, 10wmfdata-python, 10Product-Analytics (Kanban): wmfdata-python's Hive query output includes logspam - https://phabricator.wikimedia.org/T275233 (10Milimetric) a:05nshahquinn-wmf→03Milimetric [17:01:48] np man! [17:01:53] this stuff sucks [17:02:02] really, if i were re-doing this i'd just forget gbp [17:02:08] and just manually create the debian build tree from the dist [17:02:10] in a script [17:02:15] a-team: planning time [17:04:31] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Ottomata) a:05Ottomata→03BTullis I'm not familiar with what is going on with this node atm, pinging @btullis! [17:09:33] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10BTullis) I'm happy to look at it. It's likely that I've set the wrong partman recipe, so sincere apologies if I've wasted your time. I'll look at it asap. [17:14:18] (03CR) 10Joal: [C: 04-1] "This patch makes gobblon fail due to:" [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/730282 (owner: 10ODimitrijevic) [17:20:41] (03PS3) 10Sharvaniharan: Android MEP schema for customizing toolbar [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747226 (https://phabricator.wikimedia.org/T297818) [18:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [19:55:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [20:15:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [20:27:01] (03CR) 10DLynch: Add new EditAttemptStep integrations for mobile apps (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747205 (owner: 10DLynch) [20:51:43] heya ottomata :] I've tested the airflow job with the development instance using the latest code, and putting the workflow_utils lib in the PYTHONPATH. It worked! However I had to put the config.py file in the PYTHONPATH manually as well. Where do you think that file can go, in order for it to be included in the path by airflow? [20:52:31] Also, I had to use the shaded refinery-job jar instead of the regular one, since the latter does not contain LogHelper (which is in core) [20:52:51] (my previous tests were with the non rebased old patch) [21:03:30] ah [21:03:38] makes sense, you don't need refinery-hive then either [21:03:42] hmm config.py [21:04:14] mforns: import config? [21:04:17] how does that work? [21:07:31] hey team! we're going thru onboarding steps for brian king (our new senior search SRE). in typical fashion this has made me realize some permissions i'm missing, so I made a ticket to request `analytics-privatedata-users` access w/ kerberos for myself: https://phabricator.wikimedia.org/T297908 [21:07:47] could i get a sign-off from someone with the power to rubber-stamp my request? :D (no rush, just wanted to mention here) [21:08:15] done ryankemper , you just also need approval from your manager, and then if you like (since you are SRE) you can add yourself! [21:08:19] we are happy to help of course [21:08:53] ottomata: thanks <3...and then by extension, if you could sign off on the analytics side of brian's request here too: https://phabricator.wikimedia.org/T297910 [21:10:57] also done! :) [21:11:15] and cool, will add myself once I get approval from g.ehel and will come ask questions here if I can't figure something out [21:12:03] ottomata: yes, the code says import config.py [21:12:13] but the place where the file is now is not in the PYTHONPATH [21:12:45] is there any directory that is put in the pythonpath by airflow config? [21:13:00] i guess the dags directory is? [21:13:09] lemme see... [21:13:15] ottomata: no it's not [21:13:38] i should make a helper wrapper script to launch ipython in the airflow env :) [21:13:40] then we could easily find out [21:13:49] yea [21:15:30] sorry ottomata, yes the dags directory is in the pythonpath [21:16:17] oh it is? [21:16:38] I think by default airflow puts the dags directory and the plugins diretory in the pythonpath [21:16:47] no? [21:16:52] testing right now [21:16:54] oh mforns [21:16:55] The config folder: It is configured by setting AIRFLOW_HOME variable ({AIRFLOW_HOME}/config) by default. [21:16:57] https://airflow.apache.org/docs/apache-airflow/stable/modules_management.html [21:17:19] so [21:17:28] we can make a config dir in the instance folder [21:17:39] and then symlink that from airflow home [21:17:45] that seems to b the right thing to do [21:17:52] let me make that happen in puppet real quick [21:18:14] isn't it 'conf' right now? [21:18:42] yeah, lets change it [21:18:45] since airflow uses 'config' [21:18:47] k [21:20:04] ottomata: will change airflow-dags to rename conf to config, and add the config.py file there, maybe change its name.. and also use the shaded jar [21:20:51] hmm, mforns q [21:20:55] yep [21:20:56] is config.py specifically dag config? [21:21:03] yes [21:21:36] lets put it in dags [21:21:44] ok [21:21:54] maybe call it dag_config.py then [21:22:10] ok [21:28:40] ottomata: I think this should be ready: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/5 [21:30:44] (03CR) 10Sharvaniharan: [C: 03+1] "Looks good to me @DLynch :)" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747205 (owner: 10DLynch) [21:39:03] mforns: do you ned refinery-hive jar? [21:39:13] oh you have that in the parent DAG so you do? [21:39:36] oh i see [21:39:37] TODO [21:39:38] ottomata: the hql query needs refinery-hive jar for a UDF [21:39:39] okay_:) [21:39:45] # TODO: Use dependency management for jar paths. [21:39:57] merged mforns [21:40:12] I used the same shaded jar for both the RSVDAnomalyDetection spark app, plus the UDF [21:40:12] okay so you need update workflow_lib on test-client, ya? [21:40:17] yesss [21:40:26] annoyingly i have to make a deb [21:40:32] hopefully this stuff won't have to change often [21:40:46] oh, now that I think, the configs are pointing to prod hadoop [21:41:02] ottomata: I think it will, at least our team, will touch that repo a lot :( [21:42:17] yeah [21:42:24] i mean...maybe dags shouldn't got in workflow_utils? [21:42:51] why not in airflow-dags/common [21:43:03] and then symlink airflow-dags/analytics/common -> airflow-dags/common [21:43:13] dags_common maybe [21:57:04] but there's also custom operators [21:57:08] and utils [21:57:21] maybe we want to add plugins in the future? [21:58:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [22:00:14] (03CR) 10Sharvaniharan: [C: 03+2] Add new EditAttemptStep integrations for mobile apps [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747205 (owner: 10DLynch) [22:01:11] (03Merged) 10jenkins-bot: Add new EditAttemptStep integrations for mobile apps [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747205 (owner: 10DLynch) [22:13:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [22:20:02] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:22:44] (03PS1) 10Sharvaniharan: MEP schema for IOS Notification Interaction [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747967 [22:30:13] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [22:52:27] (03PS2) 10Sharvaniharan: MEP schema for IOS Notification Interaction [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747967 (https://phabricator.wikimedia.org/T290920) [23:07:43] mforns: sorry! hmm [23:08:02] ok then not dags_common :) [23:08:12] just common then :) [23:08:50] ottomata: ok, will work on that tomorrow :] [23:09:01] k :)