[00:49:57] g.ehel i.nflatador d.causse i dropped https://phabricator.wikimedia.org/T347647 into Incoming assigned to me, hope this is okay. I didn't want to mess with WIP limit columns so was hazarding a guess on where to put it (sorry if I messed up). I can't usually make all of backlog grooming, but will be interested to learn the process. [07:48:01] pfischer: o/ when you have a couple sec: https://gitlab.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater/-/merge_requests/10 [07:48:38] 👀 [07:51:29] dcausse: +2, shall I merge? [07:51:41] pfischer: sure! :) [07:53:33] thanks! :) [08:49:00] Errand, back in a few [09:03:42] pfischer: another quick one if you have a minute: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/961985 [09:59:54] lunch [10:00:15] lunch 2 [13:02:23] weekly update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2023-09-29 [13:20:13] o/ [13:34:23] gehel WMCS team wants to move the cloudvirt hosts. I told them it was OK but tagged you just in case: https://phabricator.wikimedia.org/T346948 [13:38:16] dr0ptp4kt re: T347647 that's totally fine. I won't be much help with the data validation issues but I can give you an update on the JNL file stuff [13:38:17] T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'` - https://phabricator.wikimedia.org/T347647 [13:44:17] JNL file finished; took ~8h to download. I compressed it with zstd which I didn't time, but it took at least 4h and reduced size to ~300 GB. Currently uploading to my Cloudflare acct which should be done in ~13h [13:44:49] Probably should've downloaded to an NVME-backed cloud server with better thruput, but oh well [13:57:16] dcausse for the new rdf-streaming-updater image, should we build with flink1.17 or stay on 1.16? [13:58:01] inflatador: for the rdf-streaming-updater I think it's fine to stick to 1.16 for now, not sure we have the bandwidth to upgrade it now [13:58:32] but for the SUP we'd like to have a 1.17.1 base image (https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/961877) [13:59:39] dcausse ACK, will build/deploy flink 1.17.1 img shortlyu [13:59:40] inflatador: relatedly I wanted to deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/961985 (re-enable the rdf-streaming-updater in dse-k8s) [14:00:11] do you mind having a quick look, I'll try to deploy it [14:00:16] dcausse does that depend on the new rdf-streaming-updater img? [14:00:55] I made a new image with https://gitlab.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater/-/merge_requests/10 this morning [14:01:36] oh cool, I thought we had to go thru the docker-img publishing process. Anyway, I merged your patch [14:02:03] only the base image requires that [14:02:14] thanks [14:02:33] thx inflatador . for kicks i installed axel and am running a download of the big one - that works much better. i posted a note about axel with props to you on wordpress. will be interested to get a copy of that compressed one, too, to see how fast it will deflate and that it loads nicely - what'll be the date on that one? the new .ttl munged cleanly, finishing this morning on my desktop. [14:03:49] one thing's for sure - loadData.sh on a 6 gb ram debian 11 image would take forever based on the first of the munge files [14:04:39] gonna head out onto the road in a bit, we're driving out of town to cowork with one of our colleagues! [14:05:01] dr0ptp4kt the compressed version is exactly the same JNL as add-shores...I can get a newer JNL up straight from wdqs1016 though if you need one, would be ready Monday [14:05:15] no worries, have a good wkend [14:07:37] yeah, if it isn't too much pain and suffering that'd be great. could i do a ride along to see how you do it? is your thought to initiate that on monday in fact? if you'd be going about commands i could hop on a meet today afternoon or, if better, whenever there's mutual availability monday [14:08:52] I was planning on starting today, but I can wait. I'll just be using the same rclone cmd as add-shore did in https://addshore.com/2023/08/wikidata-query-service-blazegraph-jnl-file-on-cloudflare-r2-and-internet-archive/ [14:09:27] so not much to learn...I'm going to use zstd compression though as it's supposed to be much faster at compress/decompress [14:17:35] oh okay - don't wait on me if it'll be this morning already. otherwise i'm around this afternoon. okay, heading out the door for real now :) [14:23:29] new flink img is building [14:26:05] sigh... app failed with Cannot map checkpoint/savepoint state for operator 1569c05272b3ada0036c5425ae58508a, not something I was expecting, I removed 3 non transactional kafka producer I hope it's because of that but I'm surprised they have a state... [14:42:22] \o [14:43:09] dcausse flink1.17.1 img is ready [14:43:27] inflatador: thanks! :) [14:43:29] o/ [14:44:23] np! Quick workout, back in ~40 [14:44:49] flink's truly annoying, even if you set human readable string for operator uuids it'll hash those and use this in error message... [14:45:52] lol [14:46:10] yea i've been getting the feeling with plenty of software that it wasn't developed with developer experience in mind... [14:52:58] truly not... [14:56:13] and 1569c05272b3ada0036c5425ae58508a is "failed-events-output" according to Hashing.murmur3_128(0).newHasher().putString("failed-events-output", StandardCharsets.UTF_8).hash().toString [14:56:16] so it's expected [15:17:40] ebernhardson: if you have a sec https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/962051 [15:18:02] this is related to this "Cannot map checkpoint/savepoint state for operator 1569c05272b3ada0036c5425ae58508a" error mentioned above [15:20:00] looking [15:22:11] dcausse: looks like the right flag [15:22:59] thanks! [15:28:53] back [15:31:47] oh well... another error now [15:32:17] oof, looks like mw-page-content-change-enrich fell over again [15:32:43] :( [15:32:58] tempted to set env.getConfig().disableAutoGeneratedUIDs() for the SUP [15:34:53] would that basically mean we have to manually give everything id's? [15:35:00] it's not a particularly big transformation, probably possible [15:37:59] in theory only things that have a state [15:38:39] but knowing what is declaring a state does not seem obvious, for instance I thought that a non transactional kafka producer would not create a state but it does [15:39:10] seems like the same initialize code path for both transaction and non-transactional use-cases :/ [15:39:34] might not be true for KafkaSource but without looking into the code it's hard to know... [15:39:55] huh, that is curious [15:41:08] it seems like turning it off might be best though, as you said before we'd prefer to name everything that has a state. It seems like the only way to be sure we got them all would be to turn of auto-id's [15:43:01] yes might be annoying but might save some painful debugging whenever the shape of the pipeline changes [15:44:09] yea...i was thinking it will almost certainly make the code messier :S [15:46:59] no objections to be intentional and select only sources/sinks/asyncio & windows [15:48:02] we survived like that for a couple years with the wdqs updater, it's just annoying now that I have to change the shape of this job (with kafka connector changes) [15:57:25] dcausse or anyone else , I'm presenting RDF streaming updater to the DPE SRE's in a few wks. I'm starting at https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater but if you have any other links/suggestions for a presentation on this topic LMK [15:58:25] inflatador: sure I'll send you few links [16:02:44] Thanks, will def want to debut it w/you before I give to the group if that is OK [16:03:14] ebernhardson I added T347075 as the bug for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/951960 , if there is a more appropriate task LMK [16:03:14] T347075: Deploy test instance of cirrus updater in k8s - https://phabricator.wikimedia.org/T347075 [16:03:32] seems reasonable [17:03:10] * ebernhardson wonders why more things don't have negative filters. In airflow i can see all failed tasks, or all queued tasks. But what i really want to see is every task that is not success [17:30:51] sigh, some sort of skew in the sparql query metrics. Typical partition ha 500kb and 6k records. It's currently 500MB (42M records) into the final partition...and while it hasn't failed yet the reason i'm looking into it is it failed in airflow so...sigh [17:33:30] the problem is spark doesn't make it easy to understand what code backs a partiticular stage...i always had a pretty good idea with the ones i wrote, but here its a bit mysterious :P [17:38:40] get swiftk8s_op_test_dse/wikidata_test/checkpoints/fb15d3590dbcc2855d9bdddb26fc2120/chk-32421/_metadata [17:39:10] swift download /wikidata_test/checkpoints/fb15d3590dbcc2855d9bdddb26fc2120/chk-32421/_metadata [17:39:15] view k8s_op_test_dse/wikidata_test/checkpoints/fb15d3590dbcc2855d9bdddb26fc2120/chk-32421/_metadata [17:39:40] yea thats the stage thats failing...gc time has climbed 5 minutes over the last 7 for that instance...time to dig into whatever subgraph mapping does :) [17:40:45] jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjkkkkkkkkkkkkkkkkkkkkkkkkkk:q [17:44:05] <3 chatgpt: The code example I provided in my previous response was a hypothetical example for illustrative purposes, and it does not represent actual functionality available [17:47:28] * ebernhardson is getting into an odd habit of not finding desired functionality, asking chatgpt if there is something i missed, and then having it just invent what it could look like if that did exist :P [18:01:55] lunch, back in time for training [18:13:42] wow just realized my terminal was setup to broadcast all :/ [18:18:52] turned on adaptive query execution in the subgraph mapping....the auto-coalesce after shuffle actually made things a bit faster. But the skew join optimization did nothing :P [18:39:05] back [21:17:52] * ebernhardson now realizes spark auto-skew can't fix this because in a right join, the typical salt key approach only work if the right side is skewed. But here the left side is skewed...still solvable but not the generic easy way [21:18:09] only realized after half implementing and inspecting test failures :P [21:19:01] instead we get the fun answer of filtering the set of wikidata triples based on a modulus of the Q item, and then unioning together the result :P [21:20:18] doh, that also might not be 100% right, since right join will keep rows in the right that don't match the left...sigh...more thinking required :P