[07:59:15] o/ [08:01:58] o/ [10:06:05] lunch [12:35:21] Am I correct in thinking that the public eventstream rdf data that is published, could be used by folks maintaining a triple store outside of Wikimedia to keep it up to date with wikidata changes ? [12:44:11] addshore: Yes, that should be possible. Others are following this approach too (which revealed an issue with duplicate events, but that’s under investigation already). dcausse may add his thoughts too once he’s back. [12:46:34] addshore: yes that's the main purpose, for context Hannah Bast and her team at qlever are already trying to use it, but as Peter mentioned we still have a small issue there (T396564) [12:46:35] T396564: EventStreams: duplicate events from double compute (wdqs/rdf) streams - https://phabricator.wikimedia.org/T396564 [12:48:31] we also wanted to provide better tooling for this (T374939) but we probably won't have time to do anything about it near term [12:48:31] T374939: Write a client that consumes the RDF update stream from https://stream.wikimedia.org/ and update a triple store - https://phabricator.wikimedia.org/T374939 [12:51:35] dcausse: aaah perfect, someone mentioned that "this didn't exist" on the wikidata project chat and said "Wikimedia aren't interested / don't want to do it" etc today, and I responded with "yes it does exist" etc :) [12:51:52] Thanks for the links, I'll continue to follow up on project chat [12:51:53] ah thanks! :) [12:53:31] to their defense it's not something we have communicated much about tbh [12:55:07] hehe, yeah, I imagined as such, they raised many other things too, all valid questions, and fortunatly my response to most of the points was this exists / is in progress etc [12:59:10] thanks! [13:10:28] pfischer: oops, just saw you scheduled two meetings today, joining [13:10:37] o/ [13:10:42] o/ [13:10:52] we missed ya dcausse ! Welcome back [13:10:59] thanks :) [13:11:01] dcausse: No worries, Marco is not around yet. I forwarded another invite for meeting to sync with AI product management (who would like to learn about Search) to you. I would appreciate it if you could join. [13:23:10] \o [13:25:14] inflatador: fyi just got pinged by Tiago on https://wikitech.wikimedia.org/wiki/Talk:Wikidata_Query_Service/Runbook, for context Tiago is working on improving wikidata docs related to the graph split but was not aware that wikitech was part of the scope he worked on [13:25:28] o/ [13:25:54] hmm, orchestrator gave up waiting for the reindex script while it appears to still be running again :( Might need to re-work things somehow [13:26:09] not sure what though...turned on the --verbose logging and poking it over [13:26:16] :/ [13:27:37] i'm thinking with it running them in k8s..maybe what needs to happen is the pod gets logged into the state, and then we keep checking the pod's progress independantly? [13:27:52] rather than i think right now we wait for an mwscript-k8s invocation to finish [13:27:59] (i think, looking) [13:28:24] dcausse Thanks. looks like ryankemper is working on this. Sadly it looks like this ticket is nearly a year old ;( [13:29:02] np, perhaps it's just a matter of updating the list of clusters? [13:29:24] the logs are bassically 22:02:56: tail_pod_logs, then next is 02:03 list_pods. I'm guessing the tail_pod_logs must have exited before it finished... [13:30:18] ebernhardson: if the orchestrator state could keep the mwscript-k8s state that'd be ideal indeed [13:31:39] I might not have tested that a lot but thought I added more guardrails with other k8s api calls rather than just waiting for tail_pod_logs [13:31:56] dcausse: maybe, i haven't read the code too closely yet. Will look closer [13:41:46] the annoying part is this seems to require 4 hours between invocation and causing the error :S [13:47:22] hmm, probably "Note: All connections to the kubelet server have a maximum duration of 4 hours." [13:48:23] but it's not really idle... [13:50:14] oh, the timeout idle doesn't count the actual streaming response...it's something about the api server [13:50:32] maybe :P lots of guessing [14:07:09] looks like wdqs1022's data reload is finished. Is there anything special I need to do to validate the data? T386098 and T384344 mention some possible approaches [14:07:10] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [14:07:10] T384344: Wikibase/Wikidata and WDQS disagree about statement, reference and value namespace prefixes - https://phabricator.wikimedia.org/T384344 [14:09:43] inflatador: looking [14:11:28] thanks, I just pinged Lucas as well [14:16:24] oh fun, the 4 hour streaming timeout is also something we can't change on our side, it's a config option in the k8s api server...not sure how to test without waiting 4 hours :P [14:22:17] :/ [14:38:33] inflatador: do you remember what hdfs path you used to reload? [14:38:59] nvm it's in the ticket sorry [14:39:49] going to assume it's hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250714 [14:54:47] dcausse confirmed, that is the path. LMK if we that's not the path we should use [14:55:31] inflatador: thanks! should be correct, just doing a quick check on counts in hdfs vs blazegraph [14:57:09] I'm working on the OpenSearch doc rewrite at https://wikitech.wikimedia.org/wiki/Search/OpenSearch/Administration#Why_Multiple_Clusters? . AFAIK we split out the clusters because we didn't want to have > 5000 shards in a single cluster, is that right? [14:57:50] oh i should have realized...so kubectl works well because it uses the golang client library which is hand written. The python client library is just auto-generated from a spec [14:58:04] or mostly auto-generated at least [15:00:25] the short answer seems to be write it in go...but i'd rather not :P [15:01:54] was a bit annoying by the python client, I remember it has a "well-defined" data-object of every response type but does not have any type hints, which makes your ide a bit clueless :/ [15:03:06] maybe https://github.com/kr8s-org/kr8s [15:04:36] yea that sonuds like an auto-generated library. Well defined but missing the things humans expect [15:11:17] yeah, the stock python k8s client isn't the greatest [15:11:47] I should have been more opiniated on this lib probably and search something better... on simply wrap kubectl? [15:12:55] but well... it's the "official" one and installed out of the box on the deployment machines [15:18:16] yeah, maybe if we're really lucky it's fixed in a future version of the python client and we can backport it from trixie or something [15:18:43] Trixie "should" be released on the 9th BTW [15:19:12] sadly being auto-generated...there is probably no fixing the base python library. Google/Amazon/Microsoft would have to decide that with the hundreds of billions in revenue they generate via kubernetes they can afford 3 engineers to maintain it :P [15:20:24] :) [15:23:31] Based on the client's github issues page, it doesn't seem like much of a priority ;( [15:30:21] yes, issues seem to have more comments from k8s-triage-robot than humans :/ [15:41:55] do we have an easy to check where the DYM comes from? [15:42:26] stumbled on T99813 while searching phab and it appears to be fixed, wondering if it's thanks to glent [15:42:27] T99813: Searching the English Wikipedia for "Charly Chaplin" doesn't produce a "Did you mean" suggestion - https://phabricator.wikimedia.org/T99813 [15:43:23] dcausse: there is a javascript variable that should say, goes into SearchSatisfaction. There is also cirrusDumpResult. sec [15:44:13] oh right, I see it from glent_production/_search [15:44:23] dcausse: in javascript check `mw.config.values.wgCirrusSearchFallback` [15:44:32] looking [15:45:09] i guess it's not too obvious, but i think thats saying glent-m01run provided the suggestion [15:46:39] i think autocomplete should also fix that up next week when the 2-char fuzziness becomes default [15:46:42] yes I see name":"glent-m01run","action":"suggestQuery [15:46:53] ie -> y [15:47:19] nice [15:47:41] i'm also curious if 2-char fuzziness ever worked...i have to assume yes but been too lazy to spin up elastic 1.6 and verify [15:48:04] yes found this strange... [15:48:28] seems like an obvious bug that should have been reported already? [15:48:42] basically what happens is there is a generic fuzziness parser, and it has a mechanism where if it doesn't have the query length it says the length is 5 (commented as "average length of english word") and then resolves the fuzziness [15:48:59] meh [15:48:59] the fuzziness object gets serializes when sent to the query nodes, and that serialization resolves the fuzziness [15:49:32] not really AUTOmatic as the name suggests [15:49:45] it is elsewhere, if the query is provided. but yea [15:50:51] i suppose i should file an upstream task, haven't yet [15:50:57] sure [15:51:15] it's also not the end of the world to resolve this on our side [15:51:21] yea it's pretty trivial [15:53:52] We have a new flink operator image, once the chart update gets merged ( https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1173407 ) we'll probably want to redeploy it in prod [15:55:12] with glent-m01run we can't know if it's M1 or M0, just curious if it got fixed recently when enabling M1 [15:55:32] I can run the query tho messing with the filter tho [15:57:34] sigh and ofcourse we don't the source :) [15:58:53] yea we need to change the index settings, we throw all that out [15:59:03] at least they get recreated regularly :) [16:01:43] so it's M0 and got fixed quite some time ago I bet [16:02:00] errand, back in ~45 [16:17:24] poked around with this a bit, but i suspect it's just too much to re-work this to move the k8s pods into the state, and have a separate step that monitors in-progress runs. It's totally doable, but it would require redoing how we control parallelization. Going to try and just handle the timeout and retry [16:22:28] ack [16:37:15] * ebernhardson is also just kinda waiting for the mwscripts to finish before doing anything on the prod clusters, so next time around it sees the new live indices and wont try again [16:39:56] oh nifty, you can `kubectl get pods -l script=UpdateOneSearchIndexConfig.php` [16:41:29] yes there are some handy labels, I think there's username too [16:47:05] dinner [16:51:55] sorry, been back [17:16:54] ryankemper I noticed that the `/srv/wdqs/data_loaded` file gets wiped out during a data transfer if you CTRL-C the cookbook. Not sure if that is intended behavior but we should maybe look at that during pairing [17:17:38] probably expected behavior no? [17:19:28] Could be...I'll take a closer look [17:21:51] Basically I had to manually recreate the file after I cancelled a cookbook run. I think we're using it to determine both graph type AND whether or not the data is loaded. So if the cookbook run fails, it doesn't create the file, and you have to create it manually before it will let you run a transfer, even with --force [17:22:26] Possibly we could use a cumin query or some other way of determining the graph type...not urgent but just thinking out load [17:22:28] data loaded should be absent on dest host if the transfer fails [17:22:28] or loud [17:22:35] if it removes it from source that's a problem [17:22:44] but there's a --no-check-graph-type flag that will do what you want [17:43:17] ryankemper I started the scholarly reload from your tmux session on cumin2002 just to keep everything in the same place [17:56:45] Looks like T193654 has the context on why we split the elastic clusters in the first place [17:56:46] T193654: [epic] Run multiple elasticsearch clusters on same hardware - https://phabricator.wikimedia.org/T193654 [18:18:38] it's basically too many shards in one cluster slowed the masters down [18:23:57] looks like i did a not terrible job documenting. Might be curious to re-run the analysis that led to splitting the clusters against opensearch 1.x and 3.0 [18:56:45] back [18:58:06] Thanks for confirming. I'm rewriting the OpenSearch docs and wanted to make sure I wasn't completely talking out of my hat [19:20:08] wdqs1022 -> wdqs2007 data xfer is done, let's see if it looks sane [19:22:52] looks good...proceeding to next host [20:24:15] * ebernhardson tried that script that loads all the production indices state into a local cluster...if i give it 3 16g nodes it runs out of memory at around 1500 indices [20:24:29] and i don't hve enough memory to give it more :P [20:29:00] ebernhardson the new relforge nodes have 256 GB memory ;) [20:30:11] hmm, would need docker and docker-compose. [20:31:20] could you just create an SSH tunnel and point the script to relforge-alpha? [20:31:48] hmm, indeed i suppose i could just point it at the live cluster without standing one up. hmm [20:33:32] it'd be more of a pain if you needed to change opensearch.yml, but if you need to one off some stuff just on the alpha instance that'd be fine [20:34:15] it's running now, if relforge starts alerting it's my fault :P [20:34:25] {◕ ◡ ◕} [20:34:25] it's not super important, i suppose i'm just curious if its any better now [20:34:50] this essentially takes the api response from _cluster/state in the other clusters and re-creates the same indices locally [20:35:04] yeah, I'm curious too...mainly I was rewriting that doc stuff, but I'd also like to know what's changed over the past few years, if anything [20:37:46] * ebernhardson separately notes that the `kubectl logs ...` command i started this morning doesn't seem to have timed out..even though the python variant of tailing logs seems to :( [20:38:19] was hoping -v=9 (verbose debug logging, prints all http requests and more) would show that it did timeout and just retried at some point [20:39:30] curious, already getting 503's from relforge with only 125 indices [20:39:43] master not discovered exception [20:44:03] interesting [20:44:58] hmm, it's failing due to NPE related to org.wikimedia.search.extra.analysis.ukrainian.UkrainianStopFilterFactory.getStopwords(UkrainianStopFilterFactory.java:31) [20:45:07] i feel like i've seen this before... [20:45:33] Oh yeah! we saw that during the incident with prod eqiad [20:45:51] I think just restarting the service "fixed" it? [20:45:59] hmm, will try [20:48:27] fails to come up because it doesn't like the logging configuration? [20:48:37] ERROR OpenSearchJsonLayout contains invalid attributes "compact", "complete" [20:48:49] ERROR Could not create plugin of type class org.opensearch.common.logging.OpenSearchJsonLayout for element OpenSearchJsonLayout: java.lang.IllegalArgumentException: layout parameter 'type_name' cannot be empty java.lang.IllegalArgumentException: layout parameter 'type_name' cannot be empty [20:50:08] I haven't seen that one before [22:19:08] https://wikitech.wikimedia.org/wiki/Logstash/Interface maybe useful? [22:25:44] logs are making it through to kafka now, can see them with `kafkacat -b kafka-logging1003.eqiad.wmnet -C -t logback-info -o end 2>/dev/null | grep relforge` [22:27:29] not sure if they are valid though... [22:30:50] searching the discover tab for `node_name:relforge*` does find logs now from relforge1010. But i suspect they should have a more "normal" set of fields [22:35:14] copied the changed files to relforge1010:/home/ebernhardson/relforge-eqiad, running out of time today but will try and make a puppet patch [22:36:50] re-enabled puppet on relforge1010 as well [22:40:52] back [22:44:34] Nice [22:45:02] I wonder if we need to set up https://wikitech.wikimedia.org/wiki/Logstash/Interface#Configuring_rsyslog_to_forward_your_logs [22:46:36] this ends up flowing through the existing logback config, it's the one listening on 11514 [22:46:54] it's called `50-udp-json-logback-compat.conf` [22:48:50] also cleaned up all the extra indices i created on relforge, should all be back to normal. Didn't finish patch though, tomorrow! [22:53:20] ryankemper the scholarly transfer finished, I just repooled eqiad wdqs-scholarly [22:53:37] inflatador: great [22:53:38] np, thanks for helping on the log pipeline stuff [22:56:33] ryankemper just started the xfer on wdqs2010. I'm out for today but you can run more xfers if you want, just update https://etherpad.wikimedia.org/p/wdqs-reload-T386098 if you do [22:56:33] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098