[00:05:26] ebernhardson: what's the context around these wikibase indices? [07:37:34] Latest update on the imports by RWTH Aachen i5. There are now 5 imports running in parallel on 3 different machines. One of the imports is doomed to be slow since it runs on a rotating disk. Another import had to be moved to another machine since it ran in a VMWare environment and the SSD access was to slow. 3 of the potentially successful imports [07:37:35] are for blazegraph, one is for QLever. Only the QLever import might be finished in time for the Hackathon later this week. To document the imports the triple count needs to be assessed - see https://github.com/ad-freiburg/qlever/issues/982 - would you please comment on that issue ... [09:08:29] I've added a comment on that issue. Maybe dcausse has additional context [09:19:40] yes the difference between the qlever import and the number of triples served by wdqs is explained by the munger that strips some redundant data exported in wikibase RDF dumps [09:20:17] ideally the munger should be called when importing to qlever too [09:21:34] e.g. labels are exported 3 times with rdfs:label, schema:name and skos:prefLabel which is far from ideal [10:00:47] lunch [10:34:54] Do we have a (read only) replica of the en wiki mysql db? I’d like to run some queries to find out, how many multi-hop redirects we have at the moment. [12:30:43] pfischer: you could use quarry https://quarry.wmcloud.org/ [12:53:00] or spark with hive if the table you need there, e.g. the page table is at wmf_raw.mediawiki_page (c.f. https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,wmf_raw.mediawiki_page,PROD)/Schema?is_lineage_mode=false&schemaFilter=) [13:03:06] o/ [13:18:34] getting a few alerts from CODFW streaming updater....pretty sure they are related to the switch maintenance though [13:22:08] yes seems like it, alerts are resolving [14:38:40] errand [14:46:00] \o [14:47:23] ryankemper: it's regarding how much wikibase indexes slow down an elastic cluster, first measuring our existing enwiki and wikidata index mappings/settings, then will re-measure with pruned down versions to see whats worth implementing [14:51:00] dcausse I have a meeting in 10, but FYI the deployment failed because it couldn't pull the image [15:09:22] ouch, so with 100 enwiki_content style indices, restart and pick it all up takes ~30s. With 100 wikidatawiki_content style indices it takes 30 minutes [15:12:01] dcausse: thanks! [15:51:28] I wonder how many fields they have on the logstash cluster and if they suffer such slowdown as well, if not that might perhaps indicate that the analysis config is to blame [15:54:55] interestingly, field count looks unrelated [15:55:09] running the deduplication over it, but keeping the full mapping, final restart with 100 indices was 67s [15:55:45] makes me almost want to run a binary search through creating indices with subsets of the mapping to see if there are specific culprits [15:56:53] it brings the .index.settings json down from 190kb to 30kb, so quite a bit gets dedup'd [16:00:20] or maybe there is a good way to attach a profiler and record something when creating an index to get an idea [16:05:30] workout, back in ~40 [16:09:19] index.settings might get a special treatment regarding deprecated settings but I doubt that explains everything [16:10:00] trying to get yourkit attached to elastic in a docker container, will see if anything obvious comes up [16:46:32] back [17:07:19] with gitlab when you force push after a rebase you get the confusing "18bbaaa7...7642b622 - 22 commits from branch main" [17:09:27] well... perhaps it's not confusing just me not getting the point of this message [17:17:43] sigh I think I messed up airflow-dags build [17:18:38] ah the conda tgz file has the version in its name and might require a rebuild of all the fixture [17:18:46] lunch, back in ~1h [17:26:28] profiling is....curious. master thread spent 55 out of 90s in TreeMap.putAll [17:42:31] i dunno, probably not worth investigating. i was initially suspicious that some sort of analysis component had high startup costs or some such, but brief profiling doesn't turn up anything to confirm that [17:48:08] I too was expecting some non-negligible time to be spent in creating these components, there are a few that have a dictionary to load but they're probably already packaged in a way that's easy to load them in mem... [17:48:11] dinner [18:29:27] couple mins late for pairing [18:32:37] back