[00:46:22] ^ Total size `27200978944` (human readable: `26G`) [07:39:18] inflatador: the fix you mentionned at https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/1070942 should be deployed with https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/1075019, but my understanding is that these fixes are not strictly required (they are only if you call these scripts directly) [07:39:43] reloadCategories.sh (from puppet) should initialize these env var via cronUtils.sh [07:40:27] this set of bash script is just a pure mess... [07:45:33] * hashar hints at python [07:45:39] :P [07:46:53] CI has a shellcheck job iirc [07:47:24] then I imagine there is a maven plugin or similar to run shellcheck on a repo [07:48:37] could help a bit, but the real annoyance here is the cross dependency between puppet, the rdf codebase and scap [07:50:34] not sure I understand why but we use the templating feature of scap to rewrite a yaml file managed by puppet into a shell script that initialize some env var... why not just puppet writing this env file directly? [07:51:11] * dcausse should stop ranting [09:27:33] errand+lunch [09:52:46] honestly, we should get completely rid of the run* scripts. It's about 100 lines of shell code to generate the arguments passed to start a program. Not counting the surrounding code to generate the arguments passed to that shell script [12:28:38] btw, I had a previous attempt at that simplification, which did not get merged and is by now probably outdated enough that it should be dropped and restarted if needed: https://gerrit.wikimedia.org/r/c/operations/puppet/+/956432 [12:28:52] And related phab task: T342361 [12:28:52] T342361: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 [12:46:08] +1 to rewrite most of this, it's that kind of task that's hard to start because it's all over the place, but in reality I wonder if the complexity is that high if you take a simpler approach [12:48:05] hashar: https://github.com/giraciopide/shellcheck-maven-plugin [13:20:22] re: categories, based on our tests yesterday, a newly-created categories journal file is 26 GB. And I haven't looked closely at the scripts...that is the GENERATED size, what is the size of the files we pull across the network to create the journal? [13:24:50] o/ [13:27:28] inflatador: seems to be 1.5G weekly + ~10M for dailies (https://dumps.wikimedia.org/other/categoriesrdf) [13:30:40] dcausse ACK, thanks...that's much better than ~26 G ;P [13:32:27] inflatador: indeed, a bit surprised that it inflates that much when importing into blazegraph [13:42:23] for wikidata_main it's ~3x with 200G n3 files -> 666G journal [13:57:26] and it took ~90m to generate the journal file. If it's all single-threaded, maybe we get similar generation speed on ganeti and/or wikikube. But it seems unlikely [13:59:11] what's blazegraph's serving model? does it preemptively read all of this data into RAM, or does it access it from disk in the user query path? [14:01:03] cdanis: it may hit the disk when answering queries, mostly rely on memory mapped file & os file cache iirc [14:04:00] ack [14:04:43] something like Ceph might not give you the performance you expect, then [14:05:24] cdanis: ok, I was worried about that too but thanks for confirming [14:06:02] my guess is, probably possible but it will require nontrivial tuning effort [14:07:03] I have a different perspective from years of supporting cinder in (block storage as a service) in a large public cloud, working with SAN etc. My opinion is that ceph definitely **can** handle categories, but to cdanis ' point that will take some engineering [14:10:00] sure, I absolutely agree that there exist Ceph installations (or installations of other distributed FS/block store systems) that can handle this workload :) [14:14:23] These are problems we'll have to solve if we want to use PVC at scale. T330153 is a step along the right path [14:14:24] T330153: Configure Anycast load-balancing ceph radosgw services on the data-engineering cluster - https://phabricator.wikimedia.org/T330153 [14:14:39] But I dunno if/when the time is right for all that work [14:22:30] yeah, at the very least we should be finding many excuses to do that kind of scaling work [14:58:24] but for now, yeah...we can table the PVC stuff as that will be a much larger discussion between your team, ServiceOps, etc. [15:04:16] please do keep gathering use cases and interest, I don't mean to dissuade you [17:18:28] far fewer people at opensearch con than i had expected, i suppose today is a different era but there were probably 5-10x more people last time i went to elasticon [17:19:16] :/ [17:21:45] also mildly annoyed at their schedule, i haven't found anywhere that posts the schedule in time order so it's a bit of a guessing game :) [17:22:46] schedules without times...that does sound "fun" ;P [17:23:18] they have times, it's in an order designed to make you interested in attending, most interesting to least, instead of in time order. Currently transcribing into time order while sitting in a dashboarding talk [17:26:22] :) [17:43:34] dcausse I folded your k8s process comments into the main body of the article, just don't want you to think I deleted them ;) [17:43:44] re: categories, that is [17:44:41] inflatador: sure, no worries! :) [17:44:51] lunch [17:48:47] dinner [18:19:28] back [18:20:17] About to drive back from outing with the dogs, 10-15 mins late to pairing [18:29:14] 🦮 [18:30:24] ryankemper ebernhardson is at opensearch con...what are your thoughts about moving pairing back to 2 PM PDT? [18:30:58] inflatador: SGTM! [18:32:59] cool, done. I should probably delete/recreate that event so it's not in g-ehel timezone anymore ;P [19:28:44] inflatador: I changed the ownership of that event. It is now all yours! [19:30:11] But I'm not sure if I changed ownership for the recurring event, or for just today's event. You could also create a new one and I can drop the current one [19:34:14] gehel thanks, I think I cleaned it up [19:34:23] * inflatador never can tell when it comes to me and scheduling