[03:11:11] wrt rolling restarts, eqiad and cloudelastic are done, codfw is not (it's stuck in yellow cluster status, there's something wonky with one of the network switches or something) [08:20:06] hare: we don't officially support Streaming Updater for the public yet, but that's on our board [08:20:56] I'm wondering how difficult it would be to be able to do something about that now - we only support Kafka as a source right now,but I imagine it wouldn't be impossible to wrap EventStream API as a source [08:21:23] our push it to Kafka at use Streaming Updater as-is? [08:37:10] dcausse: https://www.irccloud.com/pastebin/qJ9m4LCZ/ [08:37:17] upper one is from the job [08:37:35] something failed during execution, and I'm wondering if it didn't duplicate anything [08:38:14] I'm going to process it a bit further, but I'm guessing it's probably in the right ballpark [08:40:33] interesting, there might be dups in the rdf dumps [08:40:59] ah, ok, so I'll sort and uniq that thing, we'll see how it compares then [08:41:02] actually I don't know which one is what [08:41:35] new_map.csv is from state extraction, rev_map.csv is extracted from dumps [08:42:22] hm... then that would not explain [08:42:25] nope [08:42:33] strange [08:42:48] the process might've restarted during the night,though [08:43:00] that could cause duplicates? [08:43:37] can it restart on its own? [08:43:54] I would have assumed a failure [08:44:03] ah, probably no [08:44:39] let me ok, no duplicates, actually [08:45:18] that's unexpected :D [08:45:36] let's see what diff tells us [08:49:35] diff tells us that revision ids are different [08:50:08] weird [08:52:29] ahh [08:52:37] ok, PEBKAC [08:53:17] wrong rev map file [08:57:39] I rest my case https://www.irccloud.com/pastebin/KtcJrEOw/ [09:01:29] huh, but that means that the state is actually larger for SDC than it is for WD [09:01:53] strange, considering that the entity count is currently on about 82% of volume [09:04:54] Mid strings are longer than Qids? [09:05:27] is csv file larger by the ~same ratio? [09:07:44] I assume they are compressed? [09:08:00] but even if they are, maybe that's the case? [09:08:09] size difference is huge,though [09:15:24] in any case, I could make a histogram of id lengths [09:15:48] but I'm not really feeling the need - I verified that state is a correct representation of the dump [09:47:00] meal break [10:46:28] lunch [10:54:08] bon ap! [10:54:10] lunch 2 [12:48:07] lunch [15:50:49] maybe worth reading: https://www.stardog.com/news/independent-benchmark-demonstrates-first-trillion-edge-knowledge-graph-stardog-recognized-for-providing-sub-second-query-times-on-hybrid-multicloud-data/ [15:51:29] sounds a bit too much like marketing speak :/ [15:55:07] \o [15:55:45] o/ [15:55:51] o/ [15:56:14] from the bench: "Of course, there are no actual graph nodes or edges in the relational data sources, but if we transformed the relational data into the graph representation, we would get the number of nodes and edges as shown in the table." [15:57:16] yeah, they are virtualizing other data sources to expose them via their triple store. The approach is interesting, but not really compatible with what we do. [15:57:54] they probably also support transparent federation of SPARQL sources, which would be more interesting in our context. [15:58:08] But I suspect that the performance penalty still isn't transparent [16:03:07] IIUC the actual graph data (stored in stardog I presume) is still on a single machine (976GiB RAM) [16:25:42] my reading was the data stays at rest in the external cloud systems, it seemed like stardog was only acting as a query planner and executor [16:26:03] randomly guessing, they need a ton of memory because it has to bring lots of data from those engines into itself? dunno [16:27:36] early weekend for me [16:27:38] enjoy! [16:59:44] week-end time [17:30:27] (if you haven't started your weekend yet) if I want to reload WCQ at some point in the future, is it adequate to run the wcq update script, or do I have to purge the old data from Blazegraph first? [20:31:19] hare: for the wcqs-data-reload.sh script, one of it's problems (i don't fully understand) is it can't really delete the data. Instead we have to take the service down and delete the data files once in awhile so [20:31:43] it was acceptable for beta, but prod wont have that problem with streaming-updater [20:31:51] for you, hmm [21:07:01] If I've reached a dead end that's fine; I was mostly seeing how far I could go with this, and you all have been more than helpful!