[08:24:04] o/ dcausse: Would you have some time today to go over T389053 with me? I hope I got all relevant bits but I’d appreciate a second pair of eyes. [08:24:05] T389053: Rename weighted_tags referencing ores in their names - https://phabricator.wikimedia.org/T389053 [08:24:17] o/ [08:24:19] pfischer: sure [08:25:17] o/ [08:25:23] o/ [08:26:15] gmodena: en, it and fr wikis are finally available on relforge the index names start with "gmodena_" [08:27:19] o/ [08:29:53] going to ship https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1124485 (make the SUP to produce to "v1" streams) [08:37:56] dcausse: do stop the producer for a while or how do we make sure the consumer consumes all r0 events first? [08:38:44] pfischer: there was a initial patch deployed that told the consumer to consume both rc0 and v1 streams [08:39:11] dcausse ack! I'll take a look this morning. Thanks! [08:39:20] it's a new config option I added in the SUP: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1124484 [08:39:27] dcausse: Oh, okay. I did not see that while skimming the commit logs. [08:46:31] sigh... did not work... Failed reading JSON/YAML data from /null/latest seems like I got some schema/stream wrong, reverting [08:51:02] dcausse: I’ll postpone our meeting [08:54:31] dcausse: looks like the streams are all part of streamconfigs [08:56:19] it's the consumer that failed and I only changed the fetch_error stream there... [08:59:11] yes the stack mentions org.wikimedia.discovery.cirrus.updater.consumer.graph.ConsumerGraphFactory.createFetchFailureSink(ConsumerGraphFactory.java:311) [09:01:31] sigh... fetch-error-stream: codfw.cirrussearch.update_pipeline.fetch_error.v1, the codfw prefix should not be there... [09:02:39] that might explain why staging did not fail that way... [09:04:09] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1130527/1..2 [09:04:33] pfischer: if you have a sec ^, I think that's what could explain the problem [09:20:49] sure [09:21:34] +2 sorry I overlooked this [09:21:43] np! [09:27:12] better this time [09:28:25] doing eqiad next, that'll move the bulk of the update traffic to the new v1 stream [09:36:40] ok seems to have switched, we should see an alert complaining about the update rate [10:01:51] I have not received a warning yet, but looking at the grafana kafka by topic dashboard, it looks like rc0 is phased out. [10:13:29] dcausse is there any doc I could look at re working with relforge? I've been digging through phabricator/wikitech, but I'm a bit lost :) [10:25:39] gmodena: well... relforge is "just" an opensearch cluster so you "just" need a place from where you can query the opensearch HTTP api [10:25:54] but IIRC you can't ssh relforge1003? [10:26:33] checking from a stat machine [10:26:43] gmodena@relforge1003.eqiad.wmnet: Permission denied (publickey). [10:27:06] gmodena: so from a stat machine or hadoop you can: curl https://relforge1003.eqiad.wmnet:9243/_cat/indices [10:27:29] gotcha [10:27:51] so I suppose you could from stat100x use the script you wrote to import the articletopics vectors [10:27:52] and I can I update the indexes you provided, righ [10:28:08] gmodena: yes, that would be great [10:28:09] anything specific re-mappings? [10:28:26] for mappings you should be able to add a new field to an existing index [10:28:44] ack. I did try locally and it seems to work [10:28:47] there might changes that requires (similarity config [10:28:50] oops [10:29:33] take 2: there might changes that requires a closed index to be applied (similarity config) feel free to close/re-open these indices at will [10:30:59] ack [10:31:05] this API, right https://opensearch.org/docs/latest/api-reference/index-apis/close-index/ ? [10:31:33] gmodena: yes [10:31:44] Bear with me, I'm out of my element :D [10:31:51] gmodena: no worries! :) [10:32:11] alright. I'll wrap up one local experiment and then move to relforge [10:32:29] gmodena: and this cluster is mainly for testing so don't worry about breaking things there [10:32:40] sounds good! Thanks [10:32:41] s/mainly/solely/ [10:32:59] errand+lunch [10:54:43] lunch [13:14:57] o/ [13:17:28] o/ [13:17:55] inflatador dcausse I'm afraid we might be missing the knn/vector search plugin on relforge [13:17:58] {"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"No handler for type [knn_vector] declared on field [embedding]"}],"type":"mapper_parsing_exception","reason":"No handler for type [knn_vector] declared on field [embedding]"},"status":400} [13:19:32] gmodena interesting...will take a look after I look at this WDQS lag alert [13:19:52] inflatador no worries! [13:20:08] gmodena: you tested using the cirrus image and it was there? [13:20:41] dcausse yep, I'm using the cirrus image in my local env [13:21:10] weird... I thought that one would end up running the same opensearch as prod, looking... [13:21:12] cirrussearch-opensearch-image:1.3.20 [13:21:33] thanks! [13:22:10] we migt be using a base image that has much more than the vanilla opensearch setup [13:22:56] yes it has much more... [13:23:07] opensearch-knn [13:23:16] and many others [13:23:20] that's the one we need for vector search [13:23:46] sounds like we might need to build a deb pkg for https://github.com/opensearch-project/k-NN ? [13:23:50] hm... wondering if we should change the base image of cirrussearch-opensearch-image and control all plugins from a single place [13:24:57] inflatador: I hope it's just a matter of pulling this plugin into the existing deb packages [13:25:35] IIRC opensearch has a plugin installer [13:25:57] https://opensearch.org/docs/latest/install-and-configure/plugins/ [13:26:56] interesting! I wonder if Observability is using that [13:27:04] gmodena: we generally use https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/opensearch/plugins/+/refs/heads/master/README.txt to install plugins [13:27:44] it's a "meta" package that pulls a bunch of plugins sourced from https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/opensearch/plugins/+/refs/heads/master/debian/plugin_urls.lst [13:28:16] dcausse ack [13:28:28] thanks for the pointer [13:28:30] here it would just be matter of adding opensearch-knn but if we do this we need to stop installing opensearch-knn as part of the cirrus base images [13:29:36] If y'all want to get a ticket started for adding more plugins I'm happy to look at it as time allows [13:37:55] thanks! going to check the base image, might be a bit error prone to pull more plugins from the cirrus dev image [13:38:39] or we could just start running the prod service in containers ;P [13:38:54] * inflatador is only half joking [13:39:32] :) [13:41:36] Unrelated, but it looks like cloudelastic only does 60 RPS? https://grafana.wikimedia.org/goto/XZspqVTHR?orgId=1 [13:42:05] we might be just **a little** overprovisioned? [13:44:38] inflatador: I don't think we provisionned cloudelastic to serve many qps but rather being able to host the indices and keep-up with the update rate [13:46:13] dcausse ACK. So if we reduced its specs, there's a chance it would not be able to use the SUP without falling over? [13:47:39] inflatador: I suspect that we might be able to scale the CPU down? mem&disks I'm not sure sure [13:48:53] dcausse: FYI, I implemented term-query for legacy weighted tag (alongside the new one) https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1130523 [13:49:06] pfischer: thanks! looking [13:49:44] ACK, the reason I ask is that I'd like to start experimenting with larger VMs or k8s at some point. I'm pretty sure we can collapse down cloudelastic + relforge from 9 physical hosts if we use Ceph storage. Needs a lot of testing but I think that could save WMF a lot of money [13:53:42] dcausse: forgot to adapt a test… my bad [13:53:53] np! [14:00:39] going to depool wdqs1012 seems like it went down for >1day [14:02:44] hm... can't depool it, depool prints 3 empty lines [14:03:28] brouberol: could you help? ^ [14:04:06] actually 3 wdqs nodes are in a bad shape wdqs1013, 1014 and 1018 (very high thread count) [14:04:28] dcausse I've already depooled it [14:04:53] Oh, I did not see that inflatador was already around. I'm mixing up my timezones [14:05:09] something is going on with wdqs in eqiad for sure [14:05:11] I'll let you figure it out, please scream if you need help! [14:05:42] inflatador: thanks! seeing just now that the overall maxlag is resolved [14:06:45] np, I think we should probably alert on thread count or something similar. I missed those other hosts since they dropped off the lag dashboard panel ;( [14:07:32] the thread count issue is a known problem sadly... the host stop functionning and gets killed/restarted some time after and then starts to catch up on lag while serving queries [14:07:39] +1 for an alert [14:08:30] shouldn't jvmquake kill the process in that case? And it should recover automatically? [14:09:37] jvmquake only kills on gc activity not threadcount slowly increasing [14:09:58] the host is mostly idle when deadlocked [14:15:25] not finding an "official" opensearch image :/ [14:15:37] \o [14:15:42] err I meant official "minimal" image [14:15:45] o/ [14:16:36] found https://github.com/rpardini/opensearch-minimal-multiarch but not sure we want that [14:16:55] gehel: sorry. I only saw your message now [14:16:59] reading the backscroll [14:18:37] ok, inflatador seems to have it handled [14:21:07] yup, thanks for checking though! [14:27:02] np! [14:30:28] OK, I've restarted/depooled all the wdqs hosts mentioned above. Watching the lag and will repool once they get back down to reasonable levels [14:33:16] thx! [14:36:11] do we build our own opensearch deb? [14:36:49] 1.3.20 is different between https://apt.wikimedia.org/wikimedia/pool/thirdparty/opensearch1/o/opensearch/ & https://artifacts.opensearch.org/releases/bundle/opensearch [14:37:16] artifacts.opensearch.org has the additional plugins [14:38:59] Friday and this morning I've been experimenting with using LLMs as judges for comparing morelike vs vector search. Here's an example (prompt + result) generated with chatgpt https://phabricator.wikimedia.org/P74345 [14:39:13] i'd be curious to the hear what your take is [14:39:29] i have a setup almost working locally with ollama (llama3/mistreal) [14:39:46] modulo some result truncation weirdness :| [14:41:35] nice! :) [14:44:29] not sure if useful though :) [14:46:27] well it has opinions at least, not sure if we can trust it but it's definitely doing some judgement :) [14:46:45] dcausse we do build our own opensearch deb [14:47:03] inflatador: ah that explains it, thanks! [14:47:44] I think that started with 0lly, but we're the only ones still using Opensearch 1.x. we can add plugins to the current deb if needed [14:51:37] I don't have strong opinions on this but I like to have a minimal deb and use the existing wmf-opensearch-plugins deb to provide additional ones, this limits possible conflicting requirements [14:53:18] yeah, that's probably for the best...once we're on opensearch2, we will be sharing the package with 0lly [15:01:40] how do we silence the saneitizer fix rate alert? The annoyance is it uses a large window, so once it fires it will fire for a week or two [15:07:48] I think you can do it in alertmanager's web UI [15:08:20] ebernhardson: yes from https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=team%3Dsearch-platform I think you can silence it for Xdays [15:10:25] oh of course, i should know that. thanks! [15:11:28] gentle reminder to get your elastic rename suggestions up in https://etherpad.wikimedia.org/p/elastic-rename-suggestions-T387028 , we will discuss/probably make a decision today [15:11:28] T387028: Decide on a new name for Elastic hosts - https://phabricator.wikimedia.org/T387028 [15:11:41] i suppose it'll fire in another week when it fixes all those pages that got moved [15:13:15] yes :/ [15:39:30] quick break, back in time for standup [15:53:30] back [15:58:30] * ebernhardson notices we still have a few servers named after elements, seaborgium.wikimedia.org [16:00:25] * inflatador names his home computers after video game consoles ;P [16:02:26] pfischer: we're in https://meet.google.com/eki-rafx-cxi [16:44:33] gmodena: i havent read it, but this is a related article from someone i respect in the industry: https://thesearchjuggler.com/relevance-judgements-are-boring-so-lets-get-ai-to-do-it/ [16:45:05] another good person: https://softwaredoug.com/blog/2025/01/21/llm-judge-decision-tree [16:45:19] (these are from the search relevancy slack, run by our old collaborators at open search connections) [16:45:42] they primarily do e-commerce though [16:46:02] ebernhardson thanks for the pointers! I did literature review, but somehow I missed this post [17:04:49] nifty: starting with OpenSearch 2.19, the Learning to Rank (LTR) plugin is now included by default in the OpenSearch release artifact! [17:05:59] {◕ ◡ ◕} [17:06:39] lunch, back in ~40 [17:09:43] gmodena: i suppose a related thing, I talked to the author of https://arxiv.org/abs/1609.00464 at the latest opensearch conference i went to, he was surprised we could drive millions of clicks per day with more like this and said he has seen strictly better results from semantic knowledge graph. Caveat: I can't understand how to implement it (he provided it in Solr) :P [17:26:04] * ebernhardson spins up the reindexer for cebwiki on opensearch...hoping it still works on opensearch [17:26:50] oh actually i should review the open patch first [17:51:56] hmm, it's not doing anything :S [17:52:01] oh, it just took a minute [17:57:04] ebernhardson how long do you think that cloudelastic reimage will take? Was gonna play around with the rolling-operation cookbook, but it can wait if necessary [17:59:29] inflatador: s/reimage/reindex/. The speed varies, the current estimate is ~20:00 (2 hours from now). You can check in deployment.eqiad.wmnet:~ebernhardson/cirrus-reindex.20250324/cloudelastic/cebwiki_*.reindex.log [18:00:46] cebwiki_general is throwing out all kinds of estimates though, the estimate printed in the logs basically takes how long the latest indexing operation took, and guesses that if the rest of reindexing takes that amount of time when it will end (The `Complete: Mon, 24 Mar 2024 22:49:43 GMT` part) [18:02:14] there is no rush on reindexing, in theory if i ctrl-c this process a couple times it will cancel the in-progress reindexes and cleanup after itself. It can run at the end of the day [18:02:46] ebernhardson that's OK, you can let it run [18:04:09] the estimates are all over the place, from as soon as 18:43 to as late as 01:28 tomorrow [18:07:51] i wonder if we should mark the port sudachi patch as done...it's deployed to cloudelastic, it will be deployed to prod clusters as they migrate to opensearch [18:12:58] probably...I think I closed the deb package build one [18:22:34] dinner [19:25:18] reindex is ~2/3 done [19:26:01] yup looks like it's getting there [19:40:47] * ebernhardson realizes that the reindexer logs include things like `s3.secret-key`, does that matter? [19:42:03] looks like the two secrets i can find in there are the mediawiki-auth-token and the s3.secret-key [19:43:39] we could probably throw some filter in there for high-entropy strings. Not sure if worthwhile [19:46:17] do they contain the secret values? If so, we should probably filter that out [19:47:33] yes, both of those have the full values coming from helm, it basically prints the full release into the log [19:47:58] and the mediawiki-auth-token and s3.secret-key are burried in there [19:52:35] ah, just DM'd ya for more details...dunno how specific we wanna get in a public channel ;) [20:24:07] in case anyone is curious, this turned out to be a non-issue [20:26:10] random pondering: Should the reindex orchestrator be a helm release? On the one hand, it seems like the general idea of how automation is "supposed" to be done in k8s. On the other hand, a helm release that deploys other helm releases sounds more like a k8s operator, and thats getting way too deep :P [20:31:44] Could it be an mwmaint-type script? [20:32:40] it invokes other mwmaint scripts, typically mwmaint scripts operate in the context of a single wiki (also, writing this in php would be tedious) [20:32:58] ACK, sounds like we're kind of in a grey area then [20:33:33] It needs access to secrets that live on the deploy servers, right? [20:33:41] it's orchestration of mwmaint scripts, and flink releases. certainly a gray area. Not really interested in reworking it, just random pondering :) [20:33:59] yes to do the helm releases for both mwmaint scripts and flink [20:34:25] ACK, just playing along. And I don't have any decent answers either ;P [20:45:49] physical therapy, back in ~1h [20:55:25] meh...reindex orchestration worked for all of the normal bits. First backfill ran fine. Second backfill failed to fetch flink status 30 times in a row then failed :P [20:56:56] with an odd error message: stat ('python3', '-c', '[script removed for sanity]'): no such file or directory: unknown [20:57:12] is it saying it didn't find python3? [21:05:35] * ebernhardson notices the pod has `curl` now, it didn't used to. Might as well use it [22:13:54] back [22:55:35] WIP patch to alert on missing wdqs lag metrics per ryankemper suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1130730 . If anyone can figure out why it's not plumbing thru the instance label LMK, seems like I've run into this one before