[13:26:13] ryankemper ebernhardson once ya'll get in, can you help catch me up on what happened with the cluster yesterday? Do we need to do an incident report or anything? [13:32:04] also, it looks like all of WDQS main in CODFW is lagged for +1 day? [13:39:13] \o [13:39:48] inflatador: no incident, we just moved traffic around to make sure it works with the migration to dns-discovery [13:40:30] ebernhardson ACK, that's a relief [13:41:01] Re: WDQS, looks like something's wrong with rdf-streaming-updater in codfw. not sure what yet https://grafana.wikimedia.org/goto/QSi9_aPHR?orgId=1 [13:41:24] :S will poke around [13:41:54] looks like it's trying to recover from a really old save point [13:42:08] Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: Cannot find checkpoint or savepoint file/directory 's3://rdf-streaming-updater-codfw/wikidata/2023-11-29T162302/savepoint-f62514-fc06085c8727' [13:43:52] I **think** we can point it to the newest checkpoint and redeploy and it should be OK? I don't remember how long checkpoints are valid vs savepoints, but it looks like it crashed around 1100 UTC yesterday [13:43:59] inflatador: seems plausible [13:44:21] the 2023 savepoint is from the chart, looks like that is our "initialSavepointPath" [13:44:25] not sure if it should be update [13:45:31] it def should, I just can't remember what failure mode makes it want to grab that initial save point [13:52:36] I have to get ready for a meeting in ~10m but we should be able to get the latest checkpoint from https://logstash.wikimedia.org/goto/25309e0f51a046174e1adbf2c5e46ef4 , apply it to the helm values, and redeploy [13:52:53] ok, i can try and work that out [13:53:26] Thanks, I can pick it up in hour or so if not. I've already depooled CODFW so there's no user impact ATM [14:28:16] best guess is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1163380 is what we need, but not sure how to verify the switft url's [14:39:48] * cormacparle waves [14:40:02] I've another wish around searching that I jsut want to check something about [14:40:06] https://meta.wikimedia.org/wiki/Talk:Community_Wishlist/Wishes/Better_add-new-wikilink_searches [14:40:30] I was going to respond with this, just hoping ebernhardson will correct me where I have things wrong [14:40:45] https://www.irccloud.com/pastebin/ZLhbho8a/ [14:43:29] sure, looking [14:46:06] cormacparle: still trying to understand, i can say this bit (from the user, not you) is wrong: In short: wikilink suggestions should search by all word stems, as they do for the last word stem now. [14:46:32] there is no special thing that stems part of a string but not the rest [14:47:03] yeah - the user misunderstands what's going on [14:47:41] i dont know how i would explain it to a user though...the way it works is generating a graph like this: https://blog.mikemccandless.com/2013/06/build-your-own-finite-state-transducer.html [14:47:57] and then building an automaton that can walk the graph [14:48:19] hahaha I don't think we need that level of detail! [14:48:39] what the user sees is they edit the search term so the first 2 terms match exactly, and then a prefixsearch happens allowing 1 character to be wrong and they get a match [14:48:55] so they think the last term is stemmed and the others aren't [14:56:54] actually perhaps this is better answer for the user, having re-read their original wish [14:56:56] hmm, not clear :S We call the analyzer 'plain' but it's not really the same as the plain analyzer we use elsewhere, i don't think it's stemming [14:56:57] Ok here's what happens when you select "Svetog Rimskog Carstva" on the page you mentioned [14:56:57] 1) we look for an exact title match (and find nothing) [14:56:57] 2) as soon as you change any of the letters we do a "prefixsearch" which tries to match the search term to the start of a page title ... one character is allowed to be different, so you get a match when you remove the two "g"s [14:56:57] If we tried to address 1) above and replaced the exact title match with a regular search we would find the page you want, because regular search has stemming ... '''but''' because there are other pages with closer matches in the title (e.g. https://hr.wikipedia.org/wiki/Lotar_I.,_car_Svetog_Rimskog_Carstva) the page you want does not come first in the list [14:56:58] prefixsearch in 2) above does '''not''' do stemming, but it allows one character to be different, and that's why you're getting a match when you change the first two search terms. The underlying software allows 2 characters to be different, so we could investigate this - it wouldn't solve your problem but would improve it. What do you think? [14:57:49] yea that seems reasonable [14:59:07] 👍 [14:59:25] ebernhardson thanks for the patch, will review when I'm done w/my current mtg [15:01:06] inflatador: i merged it....but didn't end up bold enough to deploy :P [15:18:22] random idea: We have a page creation timestamps in the search index, to support sorting by page creation time. That could also support a new keyword, for example `before:2015` could return only articles that were created before 2015. I have no clue if that's useful for anything :P [15:18:46] * ebernhardson randomly saw that youtube supports a `before:nnnn` search keyword [15:19:59] separately i was wondering if maybe we should consider a round of building out additional keywords for editors, but we would almost need to run some sort of consultation to survey editors and find out what they might need [15:29:35] ebernhardson: how did we end up with the keywords we already support? Did they approach us or did we actively ask for needs? [15:30:17] pfischer: varies, i think most came from our side. Some came from tickets being filed [15:34:13] ebernhardson: would you already have a short list of candidates (besides `before:` to present them as survey options (followed by an ‘other’ option)? Do we already have tooling for such kind of community interaction (I hope we do)? [15:35:09] pfischer: as candidates, not really. But i imagine if we sat down and pondered we could come up with a few. But i worry that we need to talk to editors because we will start from "what data do we have", where editors will start from "what task am i trying to complete" [15:35:46] pfischer: as for tooling...i'm not sure :S Years ago we had community liasons which we would work with to start threads on a few different wikis and talk to editors, encourage them to provide options, etc. [15:35:55] but thats a people process, not tooling [15:36:35] plausible some come in the wishlist form as well, could review whats been going into there [15:37:49] i could probably poke chris (our old team community liason, still at wmf but in a different job) and he would at least have ideas [16:07:23] * ebernhardson kinda wishes the unit and integration testing suites would ignore local wiki config.... [16:07:50] i guess i should just create a custom wiki for running tests where it doesn't have random config changes i make [16:42:56] deploying rdf-streaming-updater...let's see what happens [16:47:24] apply and destroy aren't doing a thing...hmm, maybe I have the selector wrong? [16:47:35] `helmfile -e codfw --selector name=wikidata -i apply` [16:49:16] ah yes, you need the `-deploy` user to destroy. Just destroyed/applied... [16:50:10] still getting a file not found on the new URL, let me check that one with swiftly [17:20:33] inflatador: any luck? [17:26:44] ebernhardson I think so...I need to update docs but the it should be checkpoint 3831011 per logstash [17:39:44] OK, this should do the trick: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1163434 . Feel free to review, but I'm gonna self-merge as long as jenkins is happy [17:45:27] OK, looks like the deploy worked this time. Let's see if the lag starts dropping [17:45:33] \o/ [17:53:17] That touched off a ton of alerts for lag, so I guess that's a good thing [17:54:48] yup, sounds like it started working again [19:07:34] lag is back down to reasonable levels, so I just repooled CODFW [19:07:52] we cleared ~24h of backlog in about an hour, that's pretty good [20:39:03] hmm, so the problem with hewikisource memory explosion while building docs is we read 3k docs at a time, and on hewikisource that translates into ~170k completable strings (probably due to subphrase matching) [20:39:06] well, probably [20:39:45] we can cut the batch size...but i wonder if we actually want to be generating ~60 completions per title [20:44:36] could also be something else going on...memory usage keeps building even in my test that builds docs and throws them away [22:22:15] re: wdqs lag...if the flink-operator was the problem, could the cirrus streaming updater be affected as well? [22:24:38] I do see a gap in some metrics starting around the time of yesterday's outage (1100 UTC) but the gap only lasts about 90m? https://grafana.wikimedia.org/goto/YLJ9UfPHR?orgId=1