[07:58:26] o/ [07:58:27] bd808: this seems unexpected! Something about how we route traffic inside wikikube? [07:58:45] gmodena: o/ [08:02:47] I created T388855 to track that ip=127.0.0.1 issue [08:02:49] T388855: Search Update Pipeline requests to Action API are logged as coming from 127.0.0.1 - https://phabricator.wikimedia.org/T388855 [08:08:53] bd808 gehel I'm not familiar with the cirrus-streaming-updater code path, but I remember a similar issue in T368495 [08:08:53] T368495: client_ip attribute reports only 127.0.0.1 in PHP/API context - https://phabricator.wikimedia.org/T368495 [08:09:52] might be a red herring, but we might need to to check how the flink producer behaves (the task i linked is eventbus/eventgate specific) [08:12:18] so this might need data engineering to have a look as well? [08:20:17] I had a quick look and it seems that only cirrusearch is affected, other flink producers seems to report the correct ip [08:21:06] o/ [08:21:13] so it might be on us. Maybe SUP experts have an idea of what could be going on. Otherwise, happy to take a look [08:21:14] o/ [08:21:34] wondering if it's related to the ratelimit things happening in envoy [08:21:49] I think cirrus is the only one use that feature [08:22:53] mmm could very much be [08:27:29] dcausse unrelated, but I wanted to confirm that the cirrus opensearch image ships with the vector search (opensearch-knn) plugins [08:27:52] in 1.x knn is still a plugin, as of 2.x it's part of opensearch proper [08:28:39] FWIW: i don't know implementation details yet, but the API (e.g. index creation) is slightly different between major versions [08:28:48] ah we explicit set X-Forwarded-For: 127.0.0.1 which I think was made to control envoy behaviors regarding retries and timeouts (x-envoy-upstream-rq-timeout-ms & x-envoy-max-retries) which is only honoured if XFF is localhost [08:29:54] gmodena: thanks, by index creation you mean the analysis settings & mappings? [08:30:38] dcausse yes [08:31:17] gmodena: ack, I hope that an index created with 1.x can still be opened by 2 tho [08:31:27] fwiw this is how we handle request header forwarding in eventbus https://gerrit.wikimedia.org/r/plugins/gitiles/eventgate-wikimedia/+/refs/heads/master/eventgate-wikimedia.js#489 [09:17:52] dcausse: yes that’s correct, using 127.0.0.1 tricks envoy to consider this an internal request (which is allowed to override retry limits etc. per request). I think this is a workaround, let me see if I can find that ticket… [09:17:57] wikibase quality constraints moved away from the full graph: https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs-internal&var-graph_type=%289102%7C919%5B35%5D%29 [09:19:05] pfischer: thanks! [09:19:33] there are still few sparql queries sent there, looking [09:20:33] \o/ cc:ryankemper ^ [09:20:37] https://phabricator.wikimedia.org/T354853 [09:21:24] pfischer: ah thanks! let's link that ticket with the other [09:37:43] kartotherian/2.1.0 is using wdqs-internal [09:40:04] karthotherian is user generated queries. I'm not sure they should be sent to the internal endpoint. TBH, I'm not sure that Kartotherian should be sending SPARQL requests to anywhere :( [09:40:41] It seems unlikely that many of those requests would require the scholarly graph, so we can probably migrate it to the internal -main endpoint? [09:41:19] yes but I'm not even seeing a config value in their deployment values.yml, must be hardcoded somewhere in the codebase? [09:43:02] qd,cè kp d' saoèl,oajcpo [09:43:03] ah it's in the chart [09:43:13] oops, wrong keyboard layout :) [09:43:30] even in swiss german I'm sure that means nothing :P [09:45:42] bépo vs swiss french [09:46:23] and don't underestimate the power of confusion of swiss german! [09:46:28] :) [09:46:37] :-D [09:47:15] status update: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2025-03-14 [09:52:26] errand [10:16:59] a quick scan of access.log shows kartotherian and cirrus deepcat being the two remaining users of wdqs-internal, we should be good to remove this endpoint soon [10:18:22] deepcat is using only the categories? We should really have a different DNS for that one... [10:20:35] gehel: the cloudelastic1011 puppetzeroresources alert keeps firing hourly or so. up above in the backscroll a little bit it was thought to potentially be transient, but it's continued firing, so i'm thinking it should probably be addressed to de-noise the alerts. should i open a fresh message on the slack channel or just leave it here for the backscroll for when i.nflatador and r.yankemper are online? [10:21:12] this https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DPuppetZeroResources and subject:"PuppetZeroResources cloudelastic search-platform (cloudelastic1011:9100 node ops warning eqiad prometheus)" [10:22:06] or maybe brouberol or stevemunene could already have a look [10:24:14] looking at https://puppetboard.wikimedia.org/report/cloudelastic1011.eqiad.wmnet/51309e3e692323f487bc839edc9a5cab42ee1d77 it seems this host has no role [10:25:13] it doesn't match any node block in site.pp AFAICT [10:25:22] node /^cloudelastic101[02]\.eqiad\./ [10:25:27] lack of - [10:25:30] is not [0-2= [10:25:33] is not [0-2] [10:25:36] that seems suspect :) Is it part of the cluster already? Or should it be in setup? [10:25:37] yep, let me fix that, good spot [10:26:17] it is... login there I see "cloudelastic1011 is a opensearch cloud elastic cirrus (cirrus::cloudelastic)" [10:26:22] I see opensearch running on the host [10:26:35] but also "The last Puppet run was at Thu Mar 13 19:23:56 UTC 2025 (901 minutes ago)." [10:26:38] so what I think happened is that it was migrated to OS and then a subsequent change to site.pp broke its role allocationm [10:26:43] cf what volans suggested [10:26:47] I'll send a patch [10:26:52] thx! [10:27:01] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1127607 [10:27:05] this is what broke it [10:27:37] unlucky host that was in the middle between 0 and 2 :D [10:27:54] :) [10:28:02] so it might be half configured... [10:28:19] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1127863 [10:28:29] depends when it was reimaged, if before that change it will be fully configured and just have puppet broken since yesterday [10:28:46] that's my wild guess [10:29:06] +1ed [10:30:26] gehel: I think it should be fine. bking reimaged cloudelastic1012 yesterday while making sure the cluster was green [10:30:44] _after_ cloudelastic1011 was worked on, I mean [10:32:57] I ran puppet on 1011 and all I got was a lousy ~tshirt~ change to /etc/ssh/ssh_known_hosts [10:33:08] all good [10:33:30] thanks! [10:35:15] errand+lunch [10:51:09] lunch [14:12:29] \o [14:16:58] o/ [14:17:59] one thing i wasn't able to find yesterday, where does CirrusDocFetchException in SUP become retryable? I was trying to add an UnrenderableDocException that extends from it, and an isRetryable() method that is normally true, but false in the unrenderable version, but i'm failing to understand how retry is decided [14:18:48] the purpose is to throw when we ask cirrus to render a redirect which should never be rendered into a doc [14:21:04] ebernhardson: I think it's not retried [14:21:14] should be in LagAwareRetryStrategy I think [14:21:16] dcausse: hmm, we have comments elsewhere that claim it is :) [14:21:23] :/ [14:21:41] lemme find those...because it would make sense if it's not retryable [14:22:02] reading the retry logic I don't think it is retried [14:22:37] only some subclasses of this one are [14:23:15] in RerenderCirrusDocEndpoint::extractAndAugment: NOTE: InvalidMWApiResponseException error is retryable, from our POV it is unclear if we need to retry or not so we assume that retrying might yield different results. [14:23:30] i guess i was assuming by proxy since Invalid extends from CirrusDocFetchException [14:24:03] but i see here in LagAware, that makes sense now [14:25:08] retry is opt-in without generic exceptions except perhaps IOException but this one makes sense to retry [14:28:48] i'm mildly surprised we dont have more indexed redirects, turns out what happens is if a redirect exists and our existing page checks don't complain about anything, then we declare the redirect is an "oldVersionInIndex" [14:29:08] which sup renders and injects. Which we then have to remove in a future loop [14:29:38] if anything, i assume i'm missing something because there aren't as many indexed redirects as this implies :P [14:32:25] o/ [14:34:06] thx brouberol volans gehel for the patch to put cloudelastic1011 into scope! no alert email ~0815-0820 my time [14:38:51] anytime, I've done close to nothing :) [15:41:53] for T388549 I think we'll need a way to map "search query" -> embedding. E.g. we need to somehow use the outlink model to encode the query into embeddings, and then query the index [15:41:53] T388549: [NEEDS GROOMING] Vector Search PoC - https://phabricator.wikimedia.org/T388549 [15:42:26] for a PoC hopefully I can run the model in a container [15:42:36] gmodena: indeed the question of how to embed two very different things into the same space is a significant question in how we actually use embeddings [15:43:05] i suppose thats part of why santosh's thing works, it's embedding roughly the same thing [15:43:17] ebernhardson and from what I can tell opensearch does not directly provide capabilities to map text -> embedding [15:43:28] this is one of the open questions we had with dcausse [15:44:00] looks like the way to go is a transformation step that we could handle in cirrus / some service before hitting the index [15:44:14] gmodena: something like: https://opensearch.org/docs/latest/query-dsl/specialized/neural/ ? [15:44:24] that at least implies it takes the query text and a model id [15:44:24] maybe something like we do now with MLR would work? [15:44:44] (maybe only 2.x, not sure) [15:44:54] ebernhardson yes. Neural is not available on 1.x though [15:45:15] mmm [15:45:20] or maybe it's a plugin [15:45:28] makes sense, i suppose i would plan to push forward into 2.x for any prod deployment, but for testing purposes we could embed externally? [15:46:42] i'm also getting the feeling we will need mixed clusters for 2.x, with non-data nodes in k8s providing extra compute for things like embeddings [15:48:23] ebernhardson our cirrursearch image does not bundle a neural plugin. This is uncharted territory for me :) [15:48:43] what does mixed clusters mean? [15:49:16] gmodena: it's pretty easy to add, you can clone the repo, put the plugin in the devel_plugins folder, and build a local image [15:50:05] getting it in full prod means updating plugin_urls.lst in https://gerrit.wikimedia.org/r/c/operations/software/opensearch/plugins/+/1125533 [15:50:23] the image build is https://gitlab.wikimedia.org/repos/search-platform/cirrussearch-opensearch-image/ [15:51:24] ack [15:51:39] I'm afraid the minimum version required for neural is 2.9 https://docs.aws.amazon.com/opensearch-service/latest/developerguide/supported-plugins.html [15:52:35] sadly not too surprised, was kinda expecting we need to get to 2.x to get full vector support [15:53:49] the knn plugin bundled with 1.x does support faiss/hnsw though, that's at least a starting point [15:53:57] gmodena: for vectors and the outlink model I thought you could mimic a morelike query, step 1 fetch the vectors from opensearch for doc A then run knn on this vector excluding doc A [15:54:00] yea seems enough to experiement with at least [15:54:52] reminds mea little of https://dtunkelang.medium.com/bags-of-documents-and-the-cluster-hypothesis-7bd2ed9c4fa9 [15:54:57] dcausse yep. I'm on the same page here [15:55:12] (daniel is a super knowledgable person in the IR space) [15:56:07] dcausse IIRC we also thought about comparing this type of similarity search on vector vs morelike [15:57:00] i'll focus on that before fiddling with embedding generation :) [15:57:27] gmodena: yes, I thought that the outlink model would allow to quickly experiment with vectors without going down the rabbit holes of extracting embeddings out of the search query [16:25:26] ebernhardson: this bag-of-documents thing reminds me of one the step in mjolnir's query grouping, IIRC there's one step that grabs the docs of the grouped queries (grouped after being "normalized") to double check that we don't group unrelated things [16:26:18] yea that is in a similar vein [16:43:49] ebernhardson thanks for the pointer! [17:02:15] * ebernhardson is still failing to explain why we have so few redirectInIndex problems...there is clearly something i don't understand here :S [17:25:20] gehel: thanks for opening a task to look into that 127.0.0.1 stuff. [18:36:53] dr0ptp4kt de nada!