[10:14:31] lunch [13:53:57] Dr appt this morning so likely won’t be able to make the triage meeting [16:30:27] hmm, are cloudelastic ports actually mixed up? 'cloudelastic-omega' uses localhost:6106, which reports as omega cluster. [16:30:53] (although the comment says cloudelastic.wikimedia.org:9643, which is wrong [16:37:56] ebernhardson: isn't envoy in play here as well? I think that's where the comments come from, saying where it ultimately ends up [16:38:00] (I could be totally off base here) [16:39:10] ryankemper: yea, i'm reading the envoy ports our of mediawikis ProductionServices.php and checking those. This is the result of checking just now: https://phabricator.wikimedia.org/T262630#8322132 [16:40:08] s/our of/out of/ [16:40:45] interesting...yeah that seems like there isn't actually a mismatch [16:41:06] i suppose could check index names between the clusters as well as a secondary check [16:42:47] oh, yeah we should do that as a final validation [16:44:49] if it does end up that everything is correct, then the only small nit I'd have is to swap these two blocks https://github.com/wikimedia/operations-mediawiki-config/blob/e40e1ba9b83c0a49ab63c12a95540b29c3571d23/wmf-config/ProductionServices.php#L91-L104 (and ofc remove the comment about them being mixed up) [16:52:01] yea it looks fine, naively comparing the first 5 sorted indices by name between the different clusters it all lines up. 8 point task fixed by changing some comments :P [17:00:44] I think that means the next 1 point task we pick up is going to inexplicably turn into an 8 :P [17:11:09] dinner [17:45:13] heh, so what happened is https://gerrit.wikimedia.org/r/632683, which added the note about mixed up ports, also didn't manage to maintain the mixup and instead when migrating to envoy sent everything to the correctly named clusters. In T279009 a cloudelastic user reported we had indices in psi and omega, and we cleaned up the ones in the wrong places [17:45:14] T279009: Cleanup duplicate indices in cloudelastic - https://phabricator.wikimedia.org/T279009 [17:50:02] is it appropriate to change the story points? It would have been an 8, except we already did it accidentally [17:52:19] * ebernhardson goes back to fighting jvm direct memory. While limiting it worked for the read side of the cirrus dumps, on the write side it appears numerous people have reported bad interactions with parquet + multi-mb fields + snappy compression [17:53:07] the tl/dr is that large buffers get allocated, are free'd when the GC decides to clean up that area, but since the buffers are off-heap they don't induce any memory pressure and don't trigger a GC [17:53:49] was intending to test avro for this use case anyways [18:54:24] to read cirrus dumps in parquet I had to set spark.executor.memoryOverhead to 8g (the options you gave me limiting mapped mem did not quite work) [18:57:27] to be precise: https://phabricator.wikimedia.org/P35545 [18:58:28] I'm sure I don't 8g on the driver but I tested so many variations... :/ [18:58:44] s/dont't/don't need/ [19:03:13] hmm, i suppose i only tried a basic read and not actually using the data...maybe avro will be better who knows [19:03:24] the code in case this helps: https://people.wikimedia.org/~dcausse/cirrus_doc_sizes.html [19:03:32] yes I hope avro makes this a bit easier [19:04:29] I mean having to tune the mem settings to read the dataset makes it hard to use [19:04:53] so far avro is more painfull, it didn't like the null values, and now it's saying i can't put an array in a string column :P I have to clean up the input data a bit more, or maybe a generic type coercion step [19:05:22] not sure why parquet was ok with the bad data...it should have complain about the array in a string too [19:05:47] why not plain json lines? [19:06:24] i guess we could, seems more difficult to use for later steps [19:06:29] that makes the dataset not usable directly tho [19:06:31] yes [19:06:48] you'd always need to transform it [19:07:14] i suppose another downside of avro, you have to load all the data to count the links, when you only need two or three columns [19:07:59] maybe it's smart enough to not decompress and deserialize the other fields, unsure, but it would have to pull the full row data [19:08:37] perhaps we need several dataset based on these dumps? [19:09:00] hmm, perhaps bulk text in a second table? [19:09:07] text / source_text, maybe opening_text too? [19:09:16] one more focused on the content aspects (avro) and another one on the metadata (parquet) [19:09:18] yes [19:09:36] opening_text is quite big actually [19:10:00] huh, wasn't expecting that [19:10:07] on the pl books it's 2.5M (we should do something about that I think) [19:10:25] https://docs.google.com/spreadsheets/d/1NTOhfw5pRPZBxZ017G-SblfSvvJ7hLIpSNtA_IUO2iQ/edit?usp=sharing [19:10:44] oh wow, yea we should do something about that :) 2.5MB is more than was intended there :) [19:10:54] yes... [19:11:30] 125k statement_keywords ? sound suspicious as well :) [19:11:46] probably from the image stuff [19:11:48] haven't look at those yet :P [19:50:45] ryankemper & inflatador: can I do some reindexing, or is anything going on that I should wait to finish up? [20:04:00] curious, apparently in some places file_text is boolean false (https://commons.wikimedia.org/wiki/?curid=9280008&action=cirrusdump). Sometimes file_text is an empty array (https://meta.wikimedia.org/wiki/?action=cirrusdump&curid=752655). [20:38:55] Trey314159: go ahead [20:39:09] cool! thanks!