[10:56:37] Lunch [11:28:40] lunch [11:51:35] o/ interesting: duplicate-finder-maven-plugin lists dependencies that are not part of the project according to dependency:tree [11:52:24] That's weird ! [11:53:09] Different scopes ? [11:55:11] I exported the full dependency:tree -DoutputFile (w/o scope -> all scopes); Then I started looking for dependencies listed by the duplicate-finder [11:55:47] duplicate-finder lists compile-scope duplicates and test-scope duplicates separately. [12:09:47] On the rdf project? [12:10:04] I'll have a look later [13:14:39] Any ideas why our elasticsearch node is taking >5mins between loading the "wikimedia extra" plugin and starting discovery? Is this a common thing? [13:15:44] some log lines where from our limited point of observability nothing seems to happen for ages https://www.irccloud.com/pastebin/bDmbayNL/ [13:15:52] tarrow: the context is an elasticsearch instance restarting, and in the logs, you can see it load the wikimedia extra plugin, and then you have to wait 5' to see the next line? [13:16:05] And that paste answers my questions. [13:16:55] yes! [13:17:59] Not sure if this is super important to us but given the whole time this is happening the cluster is yellow, and it then rotates through all the nodes it gives us a long waiting time whenever we slightly perturb the system and want to see a node restarted [13:18:25] I really don't know and I find this suspicious. You might want to take a few thread dumps to see what's going on at that time, it might give a clue [13:19:29] example how to take thread dumps: https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Further_analysis (that's for Blazegraph, but should be easy to adapt to Elasticsearch) [13:20:38] Ah, perfect; I was just about to ask how to do that :) [13:21:54] if my image doesn't have `jcmd` is `kill -3` also suitable: (nicked from https://www.baeldung.com/java-thread-dump#1-kill--3-command-linuxunix) [13:22:47] might work, I'm always using jstack or jcmd [13:24:15] I can of course bake some diagnostics into the image if it doesn't work; just would rather keep the images as simple and light as I can [13:38:02] tarrow: increasing log verbosity might help, the time between these two steps is ~5sec on our machines [13:38:21] ok! wow! that is mega different [13:38:27] thanks for the info [13:38:54] I suspect network related things but can't be sure [13:42:29] it's also looking up data paths right after loading the plugins so perhaps disk related too... [14:18:02] o/ [14:48:31] quick errand [15:12:39] wdqs1010 failed its reload again, this time due to network partition [15:12:51] starting over with --reuse-munge [15:13:58] connection between cumin host and wdqs was broken apparently [15:14:11] sigh... does this kill the process? [15:14:41] we can try to continue by hand [15:52:12] \o [15:52:51] sounds like a bad match if the network connection has to stay open ... maybe could do something janky with a screen/tmux window to guarantee it stays open but dunno if worth the effort [15:56:43] o/ ebernhardson: Thanks for your input, once I sorted the person_info column, the test runs again. I still feels slightly wrong to change a test in that way but let’s hope that we don’t rely on DFs being sort-agnostic. [15:58:24] gehel: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/882691 - tests work now, but duplicate-finder is on the rampage: mvn -pl rdf-spark-tools duplicate-finder:check@duplicate-classes-check [15:59:41] pfischer: i'm not familiar with the code that uses those bits, but we certainly shouldn't be depending on the order of collected arrays in spark anywhere, the exact ordering is arbitrary and probably varies depending on the things like the number of executors, the size of the input data, etc. [16:00:20] i imagine anywhere that uses those arrays if it wants an ordering it would have to sort it itself, it's a shame the assertion library we use doesn't match that [16:24:55] hmm, is search a data service i guess? [16:28:02] I think so? [16:28:29] i guess i never thought of it as a data service, but why not :) [16:33:19] Search does not fit well in any of those buckets. Data service makes some sense [17:00:44] workout, back in ~40 [17:03:25] going to branch the rdf repo for testing flink 1.16, I'll need images and not sure I want to build docker-images out of snapshot jars... [17:06:28] Do you need to branch? Or just create a release? [17:07:30] I want a branch because I don't want to update the master branch with 1.16 while we still run 1.12 in prod [17:07:45] so branch+release [17:08:44] unless there's another option? [17:13:19] ideally I think we should decorelate completely the release&deploy process of the various artifacts in the rdf repo... [17:14:47] network is terrible this morning...checked a speedtest since the meeting was occasionally stalling. 3Mbit :S [17:36:07] back [17:36:31] was some wifi-wonkery, turning it off and back on restored 150Mbit :S [17:41:26] it's a cliche because it works ;) [18:05:40] dinner [18:19:00] lunch, back in time for pairing [18:34:57] hmm, been pondering if there is a way to fit our AutoSizeSpark handling for mjolnir (which scales the spark tasks based on the size of input, so that we don't have to scale everything to enwiki size) in a different way but not coming up with anything. I guess will make some sort of AutoSizingSparkSubmitOperator [19:13:35] dcausse: no, creating a branch makes sense. The other option is to stay on the main branch and create a branch for 1.12 if we ever need to [19:13:50] delay the cost of branching (not that it is a major cost= [19:23:48] back [19:29:11] * inflatador really needs to get some matte displays or filters. Why is everything so reflective? [19:58:41] i get the same problem...this time of year i mostly keep my curtains drawn because the sun comes in too early. By summer though it works well as the sun streams in the window 4:30 or 5 reminding me to finish up [20:19:20] ebernhardson do you have any preferences as far as disk size for the new airflow VM? I have 8 GB vRAM/4 vCPU, let us know if that's OK [20:19:30] current airflow VMs have 100 GB disk, ~11G used [20:20:35] inflatador: hmm, more disk would be nice. the current instance often runs out of disk [20:20:52] inflatador: i only see 42G available on the instance though [20:21:08] ebernhardson I'm looking at an-airflow1004 [20:21:31] inflatador: ahh, we hav an-airflow1001 which has 42G, if the new instances have 100G that makes sense for us too [20:22:19] ACK, will link you the VM request phab ticket when it's ready, you can always edit there if more is needed [20:22:43] otherwise 8g mem/4 cores seems normal and fine [20:25:58] ebernhardson one more question, does 'internal IP - analytics vlan' sound correct for network connection? [20:26:08] ref https://phabricator.wikimedia.org/T314319 [20:26:35] inflatador: yup, it will primarily be talking to things in analytics vlan. [20:26:48] thats also, in theory, what the an- prefix to the hostname means [20:27:32] probably best to stop right there when it comes to shortening 'analytics' ;P [20:27:52] an - Analytics Network [20:28:16] but yea, any shorter would be hard to interpret :) [20:35:35] quick break, back in ~15-20 [21:18:13] sorry, been back [21:54:55] hmm, what would you call accessing a nested dictionary with a dotted str? like 'a.b.c' resolves into data['a']['b']['c']. I was calling it a path, but then that kinda conflicts with literal file paths being passed in the same constructor