[08:09:11] Good morning everyone [08:09:18] o/ [08:31:00] o/ [08:37:28] ebernhardson: indeed the revision map will be required, hopefully it just works but we'll have to double check [09:42:24] dcausse: I'm reviewing https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/736541 and am a bit confused by "inline-able" by blazegraph - what does it mean? [09:43:07] zpapierski: this is https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/736541/2/blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/WikibaseInlineUriFactory.java [09:43:21] it's a way to optimize the storage space [09:43:38] it's 100% blazegraph specific [09:44:06] so that's why I renamed the function since it's not consistent and blazegraph specific [09:44:42] ah, ok, I get it [09:44:43] thx [09:52:36] meal break [10:53:31] dcausse: I wonder about M entity prefix -it isn't includeed in the initials in UrisConstants and I'm not sure I understand the consequences of that, can you shed some light? [10:55:08] zpapierski: this is a bit of mess and it's what I attempted to improve with the patch [10:55:31] there are several things that rely on the entity initials [10:55:57] 1/ blazegraph inlining: so that we can optimize the storage [10:57:05] 2/ to do federation by answering the question "which UrisScheme should I use for entity starting with 'M'?" [10:58:24] 1 is relying on the renamed UrisScheme.inlinableEntityInitials() [10:59:11] I get that, but I'm wondering why M isn't inlined [11:00:14] it should be in a separate UrisScheme (looking) [11:00:43] zpapierski: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/736541/2/common/src/test/java/org/wikidata/query/rdf/common/uri/UrisSchemeFactoryUnitTest.java#52 [11:01:09] ah, I got confused a bit by naming [11:01:35] but the naming's ok, confusion is mine, thanks [11:02:04] I think the mistake was to attempt to generalize too many concepts behing the initials [11:02:56] in my case I got confused because this [11:02:59] public static final List WIKIBASE_INITIALS = ImmutableList.of("P", "Q"); [11:03:20] is part of UriConstants, so I assumed that it's the same place that M should be in [11:04:01] yes... this codebase is a mess for this reason... wikidata&commons are hard-coded but still we try to have generic concepts [11:04:31] getting federation fully "generalized" is not going to be that simple [11:04:48] and we might actually need it... [11:08:22] there are api consumers interested in improving this as well: https://phabricator.wikimedia.org/T261042 / https://hackmd.io/ZYWPoLrZSUSE9paRnXe7hg?view [11:09:26] nice, that would be cleaner solution by far [11:10:37] definitely, but not trivial as I'm sure many components have made the same assumptions we did by hard-coding all that in the codebase [11:10:48] oh yeah, definitely [11:11:01] but if we introduce new federations, they won't have them hardcoded yet [11:11:18] yes [11:12:35] lunch+errand [11:13:27] gehel: Nov 11 might be not the best day for a recruitment meeting - Dacid and I are out, and US has Veterans Day, I think? [11:13:36] s/Dacid/David [11:14:39] (which is written Veterans Day, not Veterans' Day as I long since assumed) [11:46:44] I'm going to try that commons boostrapping [12:09:38] lunch break [14:05:29] zpapierski: trying to reschedule to Wednesday [14:31:08] dcausse: I can't find a doc on running streaming updater rev map - do you have a command somewhere? [14:49:02] zpapierski: not really a doc but an example in airflow: https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/discovery/analytics/+/refs/heads/master/airflow/tests/fixtures/spark_submit_operator/import_wikidata_ttl_gen_rev_map.expected [14:49:19] this is for wikidata but the same data is available for commons too [14:49:26] good enough, thx [14:51:47] huh, I thought this one is made from a dump [14:52:54] dumps are imported to hive then from that table we build the revision map [14:53:12] I see, I vaguely remember something like that happening [14:53:28] hmm, but that's for a particular dump? or commons is already there? [14:53:53] commons is already imported weekly in the partition wiki=commons I suppose [14:54:00] awesome,thx [14:54:27] the date I don't know, might have to list partitions or check the date of the dump file on dumps.wikimedia.org [14:54:34] will do [14:54:47] btw - job will create subdirs for output path? [14:55:10] just to confirm: can T274982 be closed now that we have the streaming updater in production? [14:55:11] T274982: Disable fetching constraints from the wdqs updater - https://phabricator.wikimedia.org/T274982 [14:55:43] yes the folders I created [14:56:08] gehel: true, constraint are no longer fetched exception for wdqs1010 which still run the old updater [14:56:58] I'm closing it. [14:57:00] thanks! [14:58:45] same question on T261119 ? Since this is live, it seems unlikely that we'll do another round of architecture review... [14:58:46] T261119: Architecture review of Flink based WDQS Streaming Updater - https://phabricator.wikimedia.org/T261119 [15:02:16] I guess we can close it, IIRC it remained opened because we still 1h30 of consultancy time available but not sure for what we'll use it [15:05:52] I've got an instance of a "job failed successfully" :) [15:06:06] it finished with SUCCESS, but left behind an empty file :D [15:06:24] maybe I got the date wrongs [15:06:32] s/wrongs/wrong [15:07:17] hm.. maybe if the data is not found it simply leaves an empty file [15:08:08] you can check a quick select * from table_name where wiki='commons' AND date='date' LIMIT 1 with hive/beeline [15:09:20] I don't now, but the date was wrong [15:09:25] s/now/know [15:09:53] wy do I have such an issue with heterographs in written English... [15:10:26] it's doing way more now, so that was probably it [15:16:04] english is flawed that's why... (https://en.wikipedia.org/wiki/Ghoti) [15:22:18] that's absolutely true [15:22:52] but still, I'd expect that happening other way around, but my brain keeps trying to make me type different words than ones I need [15:23:27] like it's a struggle to type once instead of ones [15:25:07] https://www.irccloud.com/pastebin/DvKj01Vd/ [15:25:12] looks perfectly fine [15:25:39] yes, sounds good to me [15:26:59] random checks show correct revisions as well [15:27:35] I suppose the streaming updater boostrap should work as well, but best to test that as well [15:28:33] yes I don't remember anything specific to wikidata but better to double check indeed [15:54:35] ebernhardson: how is the wcqs data reload going? still issues with the instances? [16:01:55] dcausse: triage is starting: https://meet.google.com/qho-jyqp-qos [16:02:00] oops [16:45:38] zpapierski: reload finished, 4 of 6 look happy, the other 2 claim the namespace they tried to promote doens't exist [16:45:49] that's weird [16:46:28] are you sure they aren't complaining that they couldn't delete the old one? [16:53:33] mpham: I added wcqs to SLO dashboard, don't be worried about it showing no data - we still need to work on monitoring for wcqs. All you need to do is select "wcqs" for Cluster name [17:01:36] zpapierski: it finishes with: 2021-11-06T09:24:55+00:00 Data reload complete, switching active namespace [17:01:39] Namespace wcqs20211101 is in the alias file, but does not appear to be present in Blazegraph. [17:01:41] 2021-11-06T09:24:55+00:00 All done [17:01:43] @s [17:01:51] huh,, weird [17:02:03] could be from a previous attempt, not sure [17:02:13] blazegraph doesn't have simple /_cat api's like elastic :P [17:25:40] it has some frontend that displays namespaces [17:26:43] (and probably that takes the data from some API) [17:29:16] ryankemper: I might be late for our meeting. I'll ping you when I know. [17:29:30] gehel: understood [17:29:31] * gehel has too many meetings or too many kids for this evening [17:29:43] * gehel is not thinking of reducing the number of kids [17:29:54] dinner [17:31:44] ebernhardson: weird, but we really don't care all that much, it's not the process we'll follow [17:32:00] unless it's a problem with the blazegraph or the instance itself [17:32:32] zpapierski: ok, should we start prepping to sync streaming updater with a new data load then, or whats next? [17:32:32] in any case - I tested that revision map creation for commons, works as expected [17:33:00] we still need improvements on streaming updater side to accomodate the data on MCR [17:33:20] https://phabricator.wikimedia.org/T293195 [17:34:21] zpapierski: so i guess, i dont understand what any of that does or what to do next for it. My only observability is that the yarn deployed updater generates updates. What changes? [17:35:04] to put it simply - each revision for wikidata means a change to its triples [17:35:22] right, but currently it generates triples. How would i look at the current stream and know it's wrong? [17:35:24] but the same isn't always true for media items [17:35:58] so, some sort of propagation? [17:37:10] wdym? [17:37:42] I meant that, for media items, we only care about mediainfo slots [17:37:57] zpapierski: i mean, it currently generates updates in the output topic. But since the updater still needs work those aren't the right updates [17:38:01] how would i know that the updates are wrong? [17:38:06] ah [17:38:11] no, they are correct [17:38:22] they just might not be enough [17:38:27] ahh, ok [17:39:13] there are probably a lot more failures (404) than it should be, captured inside the fetch-failure topic [17:39:34] we need to retry on 404 if we get it for a new revision, but we don't know rn if we get 404 because the mediainfo revision not there yet (eventual consistency) or mediainfo slot doesn't simply exist (not every media item has it) [17:40:34] Should we be importing slave lag metrics and figuring out timing from there? [17:40:44] or just pretend 5 minutes is "good enough" ? [17:40:53] (maybe that leads to way too many in-flight operations in parallel) [17:41:32] in T279698 I found that you get almost all of them if you retry those that are < 10s [17:41:33] T279698: WDQS should retry when getting 404s - https://phabricator.wikimedia.org/T279698 [17:42:21] but this becomes untrue if we get 404 because of missing mediainfo data (why MCR data is important) [17:42:52] hmm, i suppose it's impossible here but i'd guess a better response would be a revision id with the 404 [17:43:56] We asked to add a bit more hint to the 404 response to better understand the cause of the 404 [17:44:45] as annoying as the traditional query api would be, if this was a page property in the query api this would be trivial :P [17:45:06] actually, this might make it even simpler [17:45:07] :) [17:45:22] because you can filter out irrelevant events at the start of the pipeline [17:46:20] zpapierski: i don't follow [17:47:04] we can filter out slot name in the source, only events that are about mediainfo will go ahead [17:47:31] or did I misunderstand? [17:47:43] MCR allows that yes [17:47:57] prevent "false"/irrelevant 404 [17:48:08] ok [17:49:23] after that you can do a retyr for anyone you'll get (not sure how revision supress shows up, but that's rare) [17:49:53] in any case, adapting streaming updater for MCR info in events is the first step [17:50:13] (and those 404 retries as well() [17:50:41] then we can do the K8S deployment - fun thing is we actually have a staging k8s env [17:51:54] this - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - explains it all [17:51:58] (I hope) [17:52:50] just remember to have a different job name as mentioned here [17:52:56] ok [17:55:19] apart from the updater stream, we still need to verify the monitoring [17:55:53] probably simple (I think some were dependant on the updater), but not all are showing up yet - https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wcqs [17:56:33] mostly means reviewing the generated prometheus config and figuring out why it's not polling, which prom reporter do this missing metrics come from? [17:57:38] looks like we have prometheus-blazegraph-exporter, and some sort of jmx magic (does the blazegraph-exporter talk jmx?) [17:57:44] I wish I knew :), dcausse ? [17:58:05] can't help you here, I never touched them :( [17:58:31] can find out, just have to align all the names with what the things on the host report [17:59:51] I'll take care tomorrow of that bootstrap test [17:59:55] kk [18:00:15] for now, I'm leaving the scene, have fun [18:09:41] the metrics that do not work I think are the ones that query blazegraph (prometheus-blazegraph-exporter.py I think) [18:10:05] the metrics exported from the jvm itself appear to be working [18:39:01] Here's something I am wondering about. At one point I read that WDQS ran on servers provisioned with 128 GB of memory. Does the query service run on bare metal, and utilize all 128 GB? Or are there VMs on the machine? Based on my own local copy of the WDQS I could get away with using a VM with 64 GB of RAM. Does this track with your experience? [18:40:38] btop tells me that process running the query service is currently using 31 GB of RAM, and the updater is using 5 GB (I've seen it spike up to 9 GB). [18:41:22] I am testing all this on a machine with 256 GB of RAM, but I am not inclined to get servers with that much RAM if I don't have to [18:42:47] hare: query service is provisioned on bare metal, the memory outside the application but available to the OS is *critical* for proper operation [18:43:12] Because of the caching I assume? [18:43:12] hare: like most databases, blazegraph doesn't keep all the data it knows about in memory, instead it depends on the kernel disk cache to keep all the hot data in memory for it available at any given moment [18:43:38] Interesting, so it's kind of like all this work happens outside of the process [18:44:09] hare: we don't have great metrics for this, but a high level view on how well it's working is to look at the typical disk read rate. If its low 10's of MB it's probably fine, but when that grows to 50 or 100 it will quickly grow beyond what disks can provide (and you need more memory) [18:45:23] i suppose thats also assuming you are running a user-visible realtime service, constraints can be different depending on use case [18:46:18] hare: i suppose an interesting aside, the reason varnish (which we use for front-end servers) became so popular over the competition of the time is it was the first web caching software built with the same principle, of let the kernel disk cache figure out what data is hot and manage memory [19:07:12] This is good insight, thank you [19:08:13] And, I am trying to build for different access patterns. The current one I am testing for is "a few clients making heavy demands," so bots and research projects, as opposed to WDQS which is a bunch of queries from a bunch of places. WDQS manages that use case so I am not doing it. [19:21:58] pondering how that might effect things, not really sure :) The size of the hot dataset in that case is going to be very query dependant, heavy reads might be fully expected depending on the queries and what they are trying to access, might just make it a weaker predictor [19:25:09] ryankemper: I should make it on time ! [19:28:30] Actually not, Oscar wants me for the injection, I'll be 4 minutes late [19:31:18] gehel: no worries I’m running ~4 mins late anyway [19:34:23] and I'm there! That 4 minutes estimate was really good! [19:54:55] zpapierski: just saw your comment about adding WCQS to the SLO dashboard. thanks. I can move T293027 to Needs Reporting [19:54:56] T293027: Create metrics for measuring WDQS/WCQS update lag - https://phabricator.wikimedia.org/T293027 [22:40:04] How much work would it be for us to load refreshed Image Suggestion data into the search index? (My gut tells me this is a small task, but I could be wrong) T295316 [22:40:05] T295316: Add an image: pre-deployment model refresh - https://phabricator.wikimedia.org/T295316 [23:30:41] mpham: shouldn't be too bad, the open question is around what exactly gets updated. When updating search it only updates the referenced pages, unreferenced pages retain their old value. I suspect we have to bridge that gap, perhaps using a new name for the import and clearing the old values or some such [23:37:19] I'm not sure I understand what un/referenced pages and values are in this context (this projects predates me a little, so I'm not entirely up to speed on its architecture). Do you mean that we need to decide whether old image suggestions are cleared or retained when we import the refreshed suggestions? [23:39:44] If Wikidata's namespace is wdq, what do I want for the Commons dataset? [23:50:53] hare: should be wcq [23:51:11] mpham: I mean that by default the old suggestions will only be cleared on pages referenced in the new data load [23:51:21] and I'm assuming the runUpdate command is slightly different [23:53:00] hare: hmm, that one might be tricky. We aren't intending to use that updater for production anymore, but i don't know what the state will be in the repo [23:53:37] there is some plan to expose change streams from the new updater publicly, but not sure how far along that is. Best to talk to dcausse or zpapierski [23:53:56] (they would also know if the old runUpdate is usable)