[08:25:47] Awesome, thank you ebernhardson: I’ll have a look. [09:27:02] hi folks, I just opened https://phabricator.wikimedia.org/T348222 though I'm not sure about the tags, please let me know how I can help [09:37:39] godog: thanks for the heads up, should be related to T346039 [09:37:39] T346039: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 [09:37:50] inflatador: ^ [09:38:12] errand+lunch [09:39:29] dcausse: thank you! agreed that indeed looks related [10:15:48] dcausse: what was the keyboard you talked about during the unmeeting? [10:19:46] lunch [12:12:28] gehel: https://www.zsa.io/voyager [12:25:40] Interesting! And it is using QMK ! [12:52:25] changing the release name from wdqs wikidata makes the rdf-streaming-updater@dse-k8s a completely new resource [13:02:16] we might have to delete it manually [13:16:56] tried kubectl patch flinkdeployment flink-app-wdqs --patch 'spec:\n job:\n state: suspend' [13:18:22] but don't have the perms [13:18:23] o/ [13:21:11] dcausse can help w/the flink deployment...I'm up at https://meet.google.com/xfs-pkcw-mbu if you want to pair [13:21:22] sure [13:43:11] I've started a draft of a communication around our WDQS work: https://docs.google.com/document/d/1m3n88FFzzHh9sDWDtXr5_CixbwE6TfUPzRnUw7BWRTw/edit [13:43:25] thanks! [13:43:30] It's a mess, but at least it is a written and documented mess! [13:43:48] I'll need your help (dcausse, Trey314159, others?) to clean it up and add whatever I'm missing. [13:56:36] hey! Queryservice updater stuff - anyone you any links to past documentation about why we doing do it using deffered updates/MediaWiki jobs rather than doing the whole flink (or even the old school RC polling stuff) [13:58:42] jeez, that was almost impossible to read; sorry. Why *aren't* we doing it with updates/jobs? [14:09:15] tarrow: not sure I understand your question? AFAIK the wdqs update process never used the MW deferred update process [14:09:50] dcausse: yeah, you're right. I was wondering if there was any documentation of why not [14:10:21] at first glance it seems like it might be a natural thing to do (just like we do with ES) [14:11:09] and I'm assuming there was a good reason not to do it there [14:16:15] hm I don't think we have, probably because we had no PHP sparql client at the time and also because blazegrah does not have replication built which means MW would have had to write to all blazegraph servers individually [14:23:03] dcausse power failure...reconnecting now [14:23:14] np! [14:53:15] gehel, I've made a pass over the doc. It was not a mess! Let me know if there are a lot of other edits and you want me to make another pass over the non-mess later. [15:00:17] Trey314159: thanks! [15:06:45] \o [15:08:37] \o [15:13:22] dcausse I resolved T342149 but wasn't sure if we needed to check T328565 for our application...it did not seem relevant to me, but LMK we should check it [15:14:20] T328565: [Flink Operations] Automate Replay of Failed Events - https://phabricator.wikimedia.org/T328565 [15:14:21] T342149: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 [15:15:22] dcausse: right! Would you see a good reason not to reconsider it for 3rd party wikibase (e.g. not commons/wikidata)? [15:16:39] given that the RC polling updater is basically now unmaintained right? [15:20:35] tarrow: the good reason to not reconsider it is that it might be quite some work to do it right (rewrite in php some code written in the rdf repo) and find ways to deal with the lack of replication within blazegraph [15:21:10] implementing replication from php wasn't the worst thing in the world for cirrus, but certainly plenty of extra complications [15:21:45] gotcha, I'm not aware of anyone running blazegraph with replicas outside of WMF. Are you? [15:22:20] i don't know what other people are running, but that sounds dangerous :P [15:22:20] no clue :/ [15:22:45] i dunno, something about having a single server serving a user-facing anything seems scary [15:22:52] well, for the average small scale 3rd party everything is probably running on one VM anyway :D [15:23:28] lol, i suppose [15:23:55] just like I imagine most mediawikis are on shared webservers even [15:24:00] why not start maintaining the RC poller? seems to be less work that rewritting another updater? [15:24:42] right now we (wikibase.cloud) use (or abuse) the RC poller in non-RC mode [15:25:06] just looking at our options really though [15:27:47] to me it seems like adding new features there might be harder than adding them in Wikibase; e.g. entitydiffing is non-existent whereas in Wikibase we do it all the time [15:28:11] I suspect the rc-poller or flink updater are the only great options, particularly when it comes to the specialty handling that has to be done like munging, doing that in php will be a big time sink [15:29:19] still coming at this from a naive perspective but I wonder if the munging is necessary at all. We could just change the ttl Wikibase generates to already be in the form that WDQS needs [15:30:02] sure but you'll have to two RDF outputs to maintain [15:30:44] mmm... true; I'd be keen to get some data but I suspect there as almost no other users of it though [15:30:51] perhaps we can sunset the old one [15:32:34] making changes to the RDF format is a long process, having two formats is probably the most practical solution [15:34:11] I think here you ponder the difficulty of setup extra component in your infra (e.g. rcpoller) vs writing new code in existing places MW/JobRunner [15:34:36] knowing the difficulty of writing/maintaining an updater I'd go with an existing one [15:35:34] gotcha; of course we are also considering moving to flink [15:36:22] the main diff is that you run multi-instance which is not something we do [15:36:38] do you have one blazegraph per wikibase instance? [15:36:42] yeah, we naturally can't run one RC poller per wiki; that;s crazy [15:36:55] nah, shared updater and shared blazegraph [15:37:07] currently all tenants in one [15:37:18] but would could consider sharding in the future [15:37:40] we use the blazegraph "namespace" concept to segregate data [15:38:05] oh I see, so these are virtually multiple graphs [15:38:17] yep [15:38:35] no edges between wikis [15:38:40] so all the updaters we have might require some adaptation [15:38:57] well, we already have an adapted RC poller one [15:39:02] iirc adam made a few tweaks to the rc poller [15:39:04] yes [15:39:17] (you can check it out at https://github.com/wbstack/queryservice-updater [15:39:47] it's coupled to our "platform API" which keeps track of which entities need updating [15:41:20] most magic (or ductape?) is n https://github.com/wbstack/queryservice-updater/blob/main/src/main/java/org/wikidata/query/rdf/tool/WbStackUpdate.java [15:42:04] nice [15:42:21] to really try and add context we're facing an issue where now (under higher edit load conditions) we're dropping many edits [15:42:32] :) [15:43:32] there's a million things to fix which not unreasonably lead to questions from the team: "why isn't this just a MW job?" [15:43:36] yes I remeber the rc poller being prone to this kind of issue [15:43:56] with a MW job you might also drop edits if done wrong :) [15:45:16] IIRC one issue with the rcpoller is that recentchange entries might arrive late eventhough they have a timestamp older than the last entry you've read [15:45:24] oh yeah, for sure! We'd just be trading out pain but for us it would remove the "platform api" coupling [15:45:57] so you have to re-read past entries all the time to catch up late entries [15:45:58] gotcha; we were suspicious of basically sql replication lag [15:46:17] and repl lag is poorly handled too :( [15:46:29] if a 404 is returned it triggers a delete IIRC [15:46:46] however we don't actually use RC; our "platform api" basically maintains an RC clone [15:46:58] interesting [15:47:11] it then dedupes entities [15:47:14] is this a queue? [15:47:25] sort of [15:47:28] I mean append only queue [15:48:04] we have an append only queue of "editevents" and then build another of "QSUpdate matches [15:48:09] we have an append only queue of "editevents" and then build another of "QSUpdate batches" [15:48:49] if the queue is append only the rcpoller should be happy, the only thing to fix is repl lag when fetching the RDF content [15:49:07] and we have a custom Hook being fired from MW edit events to add stuff to the queue [15:49:09] perhaps an arbitrary delay might help a bit [15:49:16] mm.... [15:49:35] good idea *writes it down* [15:49:55] or forcing Special:EntityData to access the master db if you can handle it [15:49:58] random fact, apparently vespa stood for Vertical Search Platform [15:50:34] sadly(?) we see very similar edit patterns to wikidata: people regularly make followup edits dependent on results from the queryservice [15:50:43] so keeping it fresh as possible is important [15:50:48] yes [15:51:14] TBH I really want us to offer a "queryservice lite" alternative for this usecase [15:51:32] which has greater guarantees about freshness but isn't full blown sparql [15:52:44] frankly a lot of people are using full sparql queries when probably they would be fine with effectively an index on statement values [15:53:00] yes... [15:53:20] * dcausse reads about vertical search [15:56:43] dcausse: i think by vertical, they mean that it handles the use case top to bottom, embedding the search application inside the sever [15:58:20] was reading "vertical search" in the sense that it is for searching a very specific domain as opposed to websearch [15:58:52] from what i've read, the original problem solved with vespa was image ranking for flicky [15:58:53] flickr [15:59:03] in mid 2000's [16:14:17] doing some cleanup in rdf-streaming-updater swift buckets [16:25:20] gehel or anyone else, https://gerrit.wikimedia.org/r/c/operations/puppet/+/963404/ is ready for another look [16:30:26] inflatador: I'm alone with the kids for dinner. I'll have s look in ~2h if no one has reviewed before [16:30:35] gehel ACK, no hurry [16:32:53] inflatador: quick look from my phone: we need to unmask the service when the updater is enabled to allow for the transition between the two states [16:42:46] ACK, fixing now [16:46:30] we also want to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/963404 before we merge this one so PCC is more accurate. [16:48:43] could use some gerrit help getting those 2 patches associated if anyone has time [16:56:43] CI is unhappy with this patch, I guess I need to update the tests but not sure what exactly it wants [16:59:01] inflatador: it's basically a typing issue, you are passing a value that doesn't match the systemd name regex [16:59:31] i'm not sure what value is being passed yet though...hmm [17:01:06] it seems to suggest it's the name, it needs to be suffixed with either .service or .timer perhaps? [17:02:09] inflatador: yea, looking at other use cases and the error here, i would change the systemd::unmask service you are defining to have a .service suffix [17:03:19] pfischer, ebernhardson any objections to merge https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/30 ? [17:03:58] dcausse: looks reasonable [17:04:12] i have to go grab my phone for gitlab 2-factor to give it an approval though...sec [17:04:41] thanks! [17:06:07] I think we should be ready to use the flink-app chart in prod for the rdf-streaming-updater [17:07:40] there's the multi-release deploy that we haven't tested tho but should be tested very soon with the cirrus-streaming-updater [17:25:02] ebernhardson looks like it worked! thanks [17:28:07] oops, this is the puppet patch we want to merge before the one above https://gerrit.wikimedia.org/r/c/operations/puppet/+/963777 so we can use PCC properly [17:29:20] inflatador: if you want to stack the patches in gerrit you can: git review -d ; git review -x [17:29:38] inflatador: then git review and they will be stacked. -d downloads and checks out the first, -x cherry-pick's the second on top of the first [17:37:23] thanks, will hit that after I eat lunch [17:47:00] * ebernhardson finds it, yet again, unfortunate swift is a programming language and and object storage system, and the programming language is a lot more popular :P [17:59:51] inflatador: something you could maybe help with, need to locate the swift secret (probably puppet secrets in profile::thanos::swift::account_keys::search_update_pipeline) and figure out how that makes it into k8s/helm secrets [18:01:01] guessing someone knows, as the rdf updater gets it's secret somehow :) [18:08:39] ebernhardson thanks for the gerrit advice, it worked [18:08:53] ebernhardson taking a look at the secrets thing now [18:09:19] i suspect, although could be mistaken, getting the swift bucket created and secret in place is the last step before merge and fail-to-deploy the cirrus updater [18:09:21] to staging [18:10:47] that reminds me, we did have to create the bucket manually when we updated the chart to use your changes this morning, re https://phabricator.wikimedia.org/T342149#9228659 [18:13:26] this task might have some hints too https://phabricator.wikimedia.org/T345765 [18:13:33] sounds right, i had separately poked the swift docs and afaict there buckets don't have any auto-create functionality [18:16:18] yeah, wouldn't expect flink to do that even if it has perms [18:16:41] inflatador: poking at puppet, it seems the key needs to be in profile::kubernetes::deployment_server_secrets::services_secrets [18:17:07] inflatador: my random guess would be the secret key is duplicated into the swift profile mentioned earlier and the kubernetes profile within the puppet private repo [18:17:42] still verifying, but it looks like secrets go into `/etc/helmfile-defaults/private` on the deploy servers. Not sure if that answers your question though [18:18:15] inflatador: yea, profile::kubernees::deployment_server::helmfile writes out to /etc/helmfile-defaults/private using the data in the deployment_server_secrets path of hiera [18:18:53] OK, so it sounds like you already figured that part out [18:19:04] poking things at the same time you are :) [18:19:27] good, 'cause I have to leave in ~10m ;P [18:21:32] ryankemper gehel can't making pairing today, but feel free to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/963404 some more if you don't have anything else to cover [21:04:10] back