[13:26:13] <inflatador>	 <o/
[13:30:50] <inflatador>	 ryankemper ebernhardson once ya'll get in, can you help catch me up on what happened with the cluster yesterday? Do we need to do an incident report or anything?
[13:32:04] <inflatador>	 also, it looks like all of WDQS main in CODFW is lagged for +1 day?
[13:39:13] <ebernhardson>	 \o
[13:39:48] <ebernhardson>	 inflatador: no incident, we just moved traffic around to make sure it works with the migration to dns-discovery
[13:40:30] <inflatador>	 ebernhardson ACK, that's a relief
[13:41:01] <inflatador>	 Re: WDQS, looks like something's wrong with rdf-streaming-updater in codfw. not sure what yet https://grafana.wikimedia.org/goto/QSi9_aPHR?orgId=1
[13:41:24] <ebernhardson>	 :S will poke around
[13:41:54] <inflatador>	 looks like it's trying to recover from a really old save point
[13:42:08] <inflatador>	 Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: Cannot find checkpoint or savepoint file/directory 's3://rdf-streaming-updater-codfw/wikidata/2023-11-29T162302/savepoint-f62514-fc06085c8727' 
[13:43:52] <inflatador>	 I **think** we can point it to the newest checkpoint and redeploy and it should be OK? I don't remember how long checkpoints are valid vs savepoints, but it looks like it crashed around 1100 UTC yesterday
[13:43:59] <ebernhardson>	 inflatador: seems plausible
[13:44:21] <ebernhardson>	 the 2023 savepoint is from the chart, looks like that is our "initialSavepointPath"
[13:44:25] <ebernhardson>	 not sure if it should be update
[13:45:31] <inflatador>	 it def should, I just can't remember what failure mode makes it want to grab that initial save point
[13:52:36] <inflatador>	 I have to get ready for a meeting in ~10m but we should be able to get the latest checkpoint from https://logstash.wikimedia.org/goto/25309e0f51a046174e1adbf2c5e46ef4 , apply it to the helm values,  and redeploy
[13:52:53] <ebernhardson>	 ok, i can try and work that out
[13:53:26] <inflatador>	 Thanks, I can pick it up in hour or so if not. I've already depooled CODFW so there's no user impact ATM
[14:28:16] <ebernhardson>	 best guess is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1163380 is what we need, but not sure how to verify the switft url's
[14:39:48] * cormacparle waves
[14:40:02] <cormacparle>	 I've another wish around searching that I jsut want to check something about 
[14:40:06] <cormacparle>	 https://meta.wikimedia.org/wiki/Talk:Community_Wishlist/Wishes/Better_add-new-wikilink_searches 
[14:40:30] <cormacparle>	 I was going to respond with this, just hoping ebernhardson will correct me where I have things wrong
[14:40:45] <cormacparle>	 https://www.irccloud.com/pastebin/ZLhbho8a/
[14:43:29] <ebernhardson>	 sure, looking
[14:46:06] <ebernhardson>	 cormacparle: still trying to understand, i can say this bit (from the user, not you) is wrong: In short: wikilink suggestions should search by all word stems, as they do for the last word stem now.
[14:46:32] <ebernhardson>	 there is no special thing that stems part of a string but not the rest
[14:47:03] <cormacparle>	 yeah - the user misunderstands what's going on
[14:47:41] <ebernhardson>	 i dont know how i would explain it to a user though...the way it works is generating a graph like this: https://blog.mikemccandless.com/2013/06/build-your-own-finite-state-transducer.html
[14:47:57] <ebernhardson>	 and then building an automaton that can walk the graph 
[14:48:19] <cormacparle>	 hahaha I don't think we need that level of detail!
[14:48:39] <cormacparle>	 what the user sees is they edit the search term so the first 2 terms match exactly, and then a prefixsearch happens allowing 1 character to be wrong and they get a match
[14:48:55] <cormacparle>	 so they think the last term is stemmed and the others aren't
[14:56:54] <cormacparle>	 actually perhaps this is better answer for the user, having re-read their original wish
[14:56:56] <ebernhardson>	 hmm, not clear :S  We call the analyzer 'plain' but it's not really the same as the plain analyzer we use elsewhere, i don't think it's stemming
[14:56:57] <cormacparle>	 Ok here's what happens when you select "Svetog Rimskog Carstva" on the page you mentioned
[14:56:57] <cormacparle>	 1) we look for an exact title match (and find nothing)
[14:56:57] <cormacparle>	 2) as soon as you change any of the letters we do a "prefixsearch" which tries to match the search term to the start of a page title ... one character is allowed to be different, so you get a match when you remove the two "g"s
[14:56:57] <cormacparle>	 If we tried to address 1) above and replaced the exact title match with a regular search we would find the page you want, because regular search has stemming ... '''but''' because there are other pages with closer matches in the title (e.g. https://hr.wikipedia.org/wiki/Lotar_I.,_car_Svetog_Rimskog_Carstva) the page you want does not come first in the list
[14:56:58] <cormacparle>	 <code>prefixsearch</code> in 2) above does '''not''' do stemming, but it allows one character to be different, and that's why you're getting a match when you change the first two search terms. The underlying software allows 2 characters to be different, so we could investigate this - it wouldn't solve your problem but would improve it. What do you think?
[14:57:49] <ebernhardson>	 yea that seems reasonable
[14:59:07] <cormacparle>	 👍
[14:59:25] <inflatador>	 ebernhardson thanks for the patch, will review when I'm done w/my current mtg
[15:01:06] <ebernhardson>	 inflatador: i merged it....but didn't end up bold enough to deploy :P
[15:18:22] <ebernhardson>	 random idea:  We have a page creation timestamps in the search index, to support sorting by page creation time.  That could also support a new keyword,  for example `before:2015` could return only articles that were created before 2015.  I have no clue if that's useful for anything :P 
[15:18:46] * ebernhardson randomly saw that youtube supports a `before:nnnn` search keyword
[15:19:59] <ebernhardson>	 separately i was wondering if maybe we should consider a round of building out additional keywords for editors, but we would almost need to run some sort of consultation to survey editors and find out what they might need
[15:29:35] <pfischer>	 ebernhardson: how did we end up with the keywords we already support? Did they approach us or did we actively ask for needs?
[15:30:17] <ebernhardson>	 pfischer: varies, i think most came from our side. Some came from tickets being filed
[15:34:13] <pfischer>	 ebernhardson: would you already have a short list of candidates (besides `before:` to present them as survey options (followed by an ‘other’ option)? Do we already have tooling for such kind of community interaction (I hope we do)?
[15:35:09] <ebernhardson>	 pfischer: as candidates, not really. But i imagine if we sat down and pondered we could come up with a few.  But i worry that we need to talk to editors because we will start from "what data do we have", where editors will start from "what task am i trying to complete"
[15:35:46] <ebernhardson>	 pfischer: as for tooling...i'm not sure :S Years ago we had community liasons which we would work with to start threads on a few different wikis and talk to editors, encourage them to provide options, etc.
[15:35:55] <ebernhardson>	 but thats a people process, not tooling
[15:36:35] <ebernhardson>	 plausible some come in the wishlist form as well, could review whats been going into there
[15:37:49] <ebernhardson>	 i could probably poke chris (our old team community liason, still at wmf but in a different job) and he would at least have ideas
[16:07:23] * ebernhardson kinda wishes the unit and integration testing suites would ignore local wiki config....
[16:07:50] <ebernhardson>	 i guess i should just create a custom wiki for running tests where it doesn't have random config changes i make
[16:42:56] <inflatador>	 deploying rdf-streaming-updater...let's see what happens
[16:47:24] <inflatador>	 apply and destroy aren't doing a thing...hmm, maybe I have the selector wrong?
[16:47:35] <inflatador>	 `helmfile -e codfw --selector name=wikidata -i apply`
[16:49:16] <inflatador>	 ah yes, you need the `-deploy` user to destroy. Just destroyed/applied...
[16:50:10] <inflatador>	 still getting a file not found on the new URL, let me check that one with swiftly
[17:20:33] <ebernhardson>	 inflatador: any luck?
[17:26:44] <inflatador>	 ebernhardson I think so...I need to update docs but the it should be checkpoint 3831011 per logstash
[17:39:44] <inflatador>	 OK, this should do the trick: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1163434 . Feel free to review, but I'm gonna self-merge as long as jenkins is happy
[17:45:27] <inflatador>	 OK, looks like the deploy worked this time. Let's see if the lag starts dropping
[17:45:33] <ebernhardson>	 \o/
[17:53:17] <inflatador>	 That touched off a ton of alerts for lag, so I guess that's a good thing
[17:54:48] <ebernhardson>	 yup, sounds like it started working again
[19:07:34] <inflatador>	 lag is back down to reasonable levels, so I just repooled CODFW
[19:07:52] <inflatador>	 we cleared ~24h of backlog in about an hour, that's pretty good
[20:39:03] <ebernhardson>	 hmm, so the problem with hewikisource memory explosion while building docs is we read 3k docs at a time, and on hewikisource that translates into ~170k completable strings (probably due to subphrase matching)
[20:39:06] <ebernhardson>	 well, probably
[20:39:45] <ebernhardson>	 we can cut the batch size...but i wonder if we actually want to be generating ~60 completions per title
[20:44:36] <ebernhardson>	 could also be something else going on...memory usage keeps building even in my test that builds docs and throws them away
[22:22:15] <inflatador>	 re: wdqs lag...if the flink-operator was the problem, could the cirrus streaming updater be affected as well?
[22:24:38] <inflatador>	 I do see a gap in some metrics starting around the time of yesterday's outage (1100 UTC) but the gap only lasts about 90m? https://grafana.wikimedia.org/goto/YLJ9UfPHR?orgId=1