[07:56:30] Need to grab a prescription tomorrow morning, so will miss about the first half of retro if there isn't too much of a line [09:55:49] I created a ticket to investigate what's up with wdqs1013 - T301953 [09:55:50] T301953: Investigate wdqs1013 stability issues - https://phabricator.wikimedia.org/T301953 [09:55:57] https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=36&orgId=1&refresh=1m&var-cluster_name=wdqs [09:56:24] based on this wdqs1012 and wdqs1013 are polar opposites when it comes to stability, with wdqs1012 being a best performing host [09:56:40] (in the active DC, that is) [09:58:44] good morning all [09:59:46] dcausse: when would you like us to continue [10:00:09] o/ [10:06:32] ejoseph: sure [10:10:05] https://www.irccloud.com/pastebin/770cuTaw/ [11:19:47] ryankemper, inflatador: I've prepared a few patches for the elasticsearch 6.8 upgrade: https://gerrit.wikimedia.org/r/q/topic:%2522elastic68%2522+owner:glederrey%2540wikimedia.org [11:40:14] lunch [11:58:11] break [14:11:19] greetings [14:11:27] inflatador: o/ [14:11:34] :eyes on PRs [14:12:09] inflatador: fairly trivial PRs, I was just checking that everything is ready for our pairing session tomorrow (I sent you an invite) [14:12:34] I did create a few phab tasks as well, to track how we upgrade each cluster [14:12:50] for tomorrow: T301954 [14:12:51] T301954: Upgrade deployment-prep to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301954 [14:15:40] o/ [14:23:24] reminder for everyone who forgot like me: the Open Hangout is now opened: https://meet.google.com/ugw-nsih-qyw [14:25:59] o/ [15:48:02] \o [15:50:15] * ebernhardson chuckles at the sla mail. Always funny that airflow doesn't special case that [16:00:21] yes sorry :/ [16:00:33] will be late for the retro [16:03:36] ryankemper, ejoseph, dcausse: retro: https://meet.google.com/ssh-zegc-cyw [16:10:45] I am finally able to register for Elastic training [16:11:24] I had to join Chicago timezone though has mine is fully booked already [16:52:05] ejoseph: glad you got registered! bummer about the timezone though [17:58:30] mpham: do you have a notes doc for the WDQS Scaling meeting? [17:58:50] Yes, I see https://etherpad.wikimedia.org/p/R5n382Ld0Vvykc7Ak3iH [17:59:09] yeah, just created that [18:47:09] * ebernhardson opens https://integration.wikimedia.org/zuul/ and then understands why CI isn't working [19:31:08] * ebernhardson now remembers that for some reason i can't ssh over the wifi bridge [19:33:31] just had to kill ipv6 spuport [19:36:39] backfilling a hourly job from november 2021 is not going to work... will run this "backfill" by hand setting the partition to year=2021 instead... [19:45:58] dcausse: curious, which part isn't working? [19:46:07] you an answer tomorrow as well, it's late :) [19:46:24] it's working but it's going to take weeks [19:46:28] ahh [19:48:02] the job is super fast but scheduling in yarn can adds a couple minutes of delay in yarn [19:48:37] fast is ~30sec [19:49:30] dcausse: huh, in that case could we just increase airflow concurrency, let it schedule 10 in parallel? [19:50:19] i feel like it shouldn't be too bad, i backfilled hourly query_clicks a week or two ago with a concurrency of maybe 5 [19:50:23] I've made this job "depends_past=true" on purpose but I guess in this backfill situation it's not a great choice :/ [19:50:29] oh, well then yes :) [19:51:13] going to disable the sla, let it run for wcqs which has "only" 30 days to backfill and will run wdqs manually [19:52:12] sounds reasonable [19:52:38] sigh. i turned off ipv6 and now ssh works, but google.com can't be found :P [19:52:51] using spark here was perhaps not a great idea either, it's 10 to 20 events per hour... [19:52:59] ddg, bing same. looks like my browser didn't like turning off ipv6 :P [19:53:40] dcausse: hmm, yea spark is a bunch of overhead for that. Perhaps we could think about skein more, but i don't know how much that really helps. [19:54:09] yes and how to access hive without spark from python [19:54:50] hql -> tsv perhaps but that does not sound great... [19:54:54] reading parquet files from an exact hdfs path is straight forward with pyarrow, i believe research is doing things that way. [19:55:02] ah [19:55:48] my understanding is research is finding skein + pyarrow much more useful for their needs than spark [19:56:03] interesting [19:56:18] I guess you get a single worker but that's exactly what I need [19:56:41] skein can actually do distributed compute, and unlike spark you can have as many worker types with different resources/setup/etc as you want [19:57:20] i guess distributed compute is a bit much, it provides distributed primitves you have to implement compute :) [19:58:12] i suppose the thing is though, skein is still yarn, still going to have the spin-up delay [20:01:10] yes... well I guess it won't be a problem once it has caught up, but it feels a bit strange to use these technologies with so little data :) [20:01:41] yes it's a total mismatch, using it because it's there not because it's the right tool [20:02:02] yes completely [20:02:04] but thats life, we make tradeoffs :) It's not clear we have a better way for that today [20:03:18] yes, could have been worse, like a mw maint script :) [20:04:13] lol, indeed [20:05:45] * dcausse wonders what does the "delete" button on the DAG in airflow [20:06:26] will it delete the source file? or just all the data associated to its runs [20:12:25] well I'll figure that out tomorrow, going offline [20:17:53] delete will drop it from the SQL database, the next run of the scheduler will bring it back as a "new" dag