[01:42:33] dcausse: i noticed some duplicates resulting from some confined cross-join like behavior within inner joins, so added .distinct() for simplicity for each graph. run time is now 2h30 minutes. i think i can get that lower, although it's working. new patch posted - you can find the data in dr0ptp4kt.wikibase_rdf_scholarly_split_refactor_w_cache_n_distinct [01:44:59] after talking some more optimization stuff with Joseph, i added in some .cache() calls to verify that they don't harm, and possibly help, run time. i am interested to try it again without the .cache() altogether and just see if there's much of a difference. tomorrow, perhaps [10:01:25] dcausse: I'll be 2' late [10:01:35] np! [10:34:48] dr0ptp4kt: thanks! running tests on this new table then [10:34:54] errand+lunch [14:15:23] o/ [14:34:44] o/ [14:37:26] I wrote a bash script that compares the contents of the SUP artefacts (*.class in …-jar-with-dependencies.jar) against the contents of all JARs coming with flink 1.17.1 (lib/*.jar). If that script works as expected, there would be 8k of duplicate classes. [14:38:57] ouch [14:39:07] I wonder if the maven-assembly-plugin gives us what we need. [14:39:46] Flink docs recommend the shade plugin. Was there a reason for using assembly (instead)? [14:40:55] the shade plugin provides more flexibility but tends to not encourage you to deal with jar hell but rather hide them [14:41:46] if we have to stop shipping some classes as part of the fat jar we should treat the problem dependency scope and or exclusion [14:43:12] some duplication are probably harmless I guess we need to look at them closely? [14:43:14] So far this didn’t cause any problems, so maybe my script or assumptions are faulty. flink remains vague what exactly is part of the runtime they provide. Anything in flink-dist.jar and logging (log4j2 + slf4 1.x) [14:44:49] pfischer: can you extract this list somewhere so we can check if there are any big/concerning offender? [14:44:50] https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/configuration/overview/#running-and-packaging [14:45:27] dcausse: sure, one moment [14:50:35] Script: https://phabricator.wikimedia.org/P53422; Duplicates: https://phabricator.wikimedia.org/P53424 [14:51:02] thanks! [14:52:22] hm... the flink/api should not be there [14:52:46] nor kryo stuff, flink metrics, shaded guava & jackson [14:52:56] org/apache/flink/shaded yes [14:53:01] seems suspicious [14:54:35] I remember Erik fixing the build at some point decreasing its size from 70+Mb to something a bit more reasonable [14:56:50] According to the docs, job and task manager would log their class path on startup. Would be interesting to see what’s in it. According to the docs, no connector is provided, for some reason the table connector shows up in the classes in flink/lib/*.jar [15:01:07] hm.. can't find the commit where Erik told maven to stop shipping flink in the fatjar, might not have been merged in the end (was at the time we had this weird serialization issue) [15:01:40] for the connector yes... it's weird, we can try as provided if we want [15:57:43] \o [15:58:20] hmm, i think i changed the flink deps in pom.xml to provided which shrank the consumer jar in testing, but never actually tried running it [15:59:26] o/ [15:59:35] dcausse are checkpoints/savepoints useful after the kafka retention expires? I'm looking into setting TTLs for the objects a la https://docs.openstack.org/swift/latest/overview_expiring_objects.html [16:03:04] ebernhardson: I added a dedicated assembly descriptor to exclude dependencies that are already part of the runtime and also exclude classed that to not get caught on a per-dependency basis (other uber jars, shaded classes etc.). That shrinks the producer from 55mb to 35mb [16:03:48] inflatador: we do use incremental checkpoints so some files might be unchanged for a period greater than kafka retention [16:04:20] pfischer: nice! [16:04:36] e.g. swift list -l rdf-streaming-updater-eqiad -p wikidata/checkpoints/224443f0230a92da72ed2685c72ae31b/shared/ have files touched in october [16:05:27] perhaps if we'd force restart with savepoint every once in a while we could use a expiry technique? [16:10:03] (other than debugging purposes) savepoints are unusable after the kafka retention of the topics the job is consuming from [16:47:27] hmm, enwiki reindex in cloudelastic failed again :S [16:48:08] something about completing it must be odd, it got to 5395415 / 5395768 and then had runtime errors [16:50:40] "runtime errors" you mean at the end of the elastic background reindex Task? [16:51:23] dcausse: https://phabricator.wikimedia.org/P53439 [16:52:32] the exception is a bit odd, stack trace starts at a place that doesn't throw exceptions, so i'm not completely sure [16:52:59] oh nevermind, i just can't read :P [16:53:36] it means elasticsearch responded with an 'error' key [16:55:46] oh, at around that time elasticsearch reported: ScriptException[runtime error]; nested: NumberFormatException[For input string: "_analyze"]; [16:55:59] workout, back in ~40 [16:56:03] :/ [16:56:14] some poisoned doc [16:56:48] Looks like the sript was: if (ctx._source.page_id == null) {ctx._source.page_id = Long.parseLong(ctx._id.substring(0));}: [16:57:01] so, some doc with a non-numeric id [16:57:21] well, findable at least :) [16:59:00] * ebernhardson starts it over and hopes that was the only bad doc [16:59:04] :) [17:06:28] why not...running a quick loop from python to scroll through the index and verify all the ids come back as numeric [17:17:28] * ebernhardson for some reason thought that would take less than 10 minutes :P [17:20:52] Hm, still running? [17:21:59] BTW: Producer is down to 26mb from 52mb, and Consumer went from 74mb to 54mb https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/61 [17:22:25] a scan over enwiki_content and enwiki_general in cloudelastic to verify that all the doc id's are numeric (since enwiki_content failed to reindex due to a non-numeric doc id) [17:22:36] it's finished enwiki_content, but enwiki_general is taking its time :) [17:24:08] So at least it’s “narrowed down” to enwiki_general… [17:48:03] does anyone know what version of Blazegraph we're on, or the best place to check? cc: brouberol [17:48:51] inflatador: its a custom version https://gerrit.wikimedia.org/r/admin/repos/wikidata/query/blazegraph,general [17:49:08] i guess we call it 2.1.6-wmf.2 [17:49:56] actuall wmf.2 is the in-dev version, last release was wmf.1 [17:50:46] ebernhardson interesting, looking at GH and wondering if we ever pulled in https://github.com/blazegraph/database/pull/237 [17:51:28] inflatador: last patch to our repo is 2020, so probably not :) [17:52:03] doesn't look particularly concerning in our context [17:52:37] yes it's a test [17:53:34] i wonder how that managed to get a 'high' severity, i guess i don't know much about how all that is officially done but i would put this somewhere near lowest [17:54:38] yes and surprising to see this one being merged while others more important ones did not [17:54:45] https://github.com/blazegraph/database/pull/145 [17:54:45] no worries, just happened to see it when I was looking at GH [17:55:21] hmm, yea that looks more like the security concerns i'd be more worried about :) [18:00:41] dinner [18:17:50] lunch, back in time for pairing [19:06:18] curious, manually trying to get testwiki to emit a rerender event (from shell.php) gives: "Events of type '1' are not enqueueable" [19:12:24] back [19:18:01] and the answer is.... "." and "_" are not the same character :P [19:23:12] also running a test snapshot/restore from eqiad to relforge on frwiki, looks like it should all be set. Once we have the rerenders going should be able to start running the updater on a mid size wiki in relforge [19:23:30] will have to do another snapshot/restore cycle of course, just verifying it works now [19:29:40] ebernhardson cool, if eqiad snaps are working I think I'll close https://phabricator.wikimedia.org/T348686 [19:30:34] inflatador: nice! I haven't directly tested the small clusters, but i do see the config in place [19:31:02] ebernhardson ACK, I'll test the small ones then [19:31:17] prod -> relforge that is...LMK if that will mess up anything you're doing [19:32:08] inflatador: it should be fine. for the config i used its in mwmaint2002:~ebernhardson/{snapshot,restore}.json [19:32:30] Ah nice, will take a look [19:35:36] hmm, maybe it doesn't work :P it looks like it finished the restore but index is red. Will look into more [20:00:24] answer was, i wrote the json to update index settings, but didn't actually provide it to curl, so the index restored with default max_shards_per_node [21:54:17] getting rerender events out of testwiki now, although not the test edit i made :) Maybe it will take some time to work through [21:55:07] oh, there it is mine did come through. Curiously though we got a rerender event for both the template and the page using the template, after editing the template. I would have expected only the page using the template [21:56:58] the consumer fell over though :P page id mismatch during fetch [22:00:49] Oh, so that actually happens... Can you tell, which id resulted in the mismatch? [22:03:13] pfischer: yea it's in the log, and it's actually not to do with the rerenders [22:03:30] pfischer: it turns out someone deleted a page, created a new page in its place with the same name, and now our cirrusbuilddoc for the old rev id is giving the new page id [22:04:41] not sure yet how painful that will be to fix, mediawiki doesn't make it all that easy to deal with archived/deleted pages [22:06:10] oh curious, the revision table also has the new page_id...maybe something awkward happened with the other end of the event [22:14:42] something very wonky, we have events in codfw.mediawiki.page_change.v1 that don't really make sense. we have 4 separate events all for rev_id 582824, half with page_id 153494 and half with page_id 153495 [22:21:29] Ouch. I'll look into the cirrus extension code tomorrow. That's somewhat frustrating... [22:22:24] should be fun :) [22:30:33] dcausse: those .distinct() calls on the two graphs are probably not necessary. the cross-join artifact-looking stuff was i think only in the automated tests, because we duplicate things in the .ttl files. i just double checked the counts on the actual production data, and either it's a very lucky result or...we don't have a duplication problem in the real wikibase_rdf table. [22:30:35] select count(1) from wikibase_rdf_scholarly_split_refactor_w_cache_n_distinct where snapshot = '20231106' and wiki = 'wikidata' and scope = 'scholarly_articles'; [22:30:36] 7654774881 [22:30:36] select count(1) from wikibase_rdf_scholarly_split_refactor_w_cache_n_distinct where snapshot = '20231106' and wiki = 'wikidata' and scope = 'wikidata_main'; [22:30:36] 7713573819 [22:30:36] select count(1) from wikibase_rdf_scholarly_split_refactor where snapshot = '20231106' and wiki = 'wikidata' and scope = 'scholarly_articles'; [22:30:36] 7654774881 [22:30:36] select count(1) from wikibase_rdf_scholarly_split_refactor where snapshot = '20231106' and wiki = 'wikidata' and scope = 'wikidata_main'; [22:30:37] 7713573819 [23:33:04] low priority CR for monitoring the wikidata LDF endpoint: https://gerrit.wikimedia.org/r/c/operations/puppet/+/974281/