[13:13:13] <ebernhardson>	 \o
[14:10:12] <ebernhardson>	 tempted to try and get llama providing embeddings from relforge since the ml-inferrence is down and still trying to ship something today...
[14:12:51] <inflatador>	 <o/
[15:21:05] <ebernhardson>	 well thats annoying...in mwscript MediaWiki\Http\Telemetry::getInstance()->getRequestId() returns a 25 char string. But when it makes a request to semantcsearch-test cluster it gets rejected because the envoy (i'm assuming) there only accepts 32 char X-Request-Id headers
[15:25:09] <ebernhardson>	 also realizing we should probably setup mediawiki side envoy so it maintains connections from codfw->eqiad
[15:40:41] <cdanis>	 ebernhardson: hmmmmm that's probably my fault 😅
[15:40:53] <cdanis>	 I bet I can get that MW codepath fixed next week, can you open a task?
[15:51:18] <ebernhardson>	 cdanis: sure i'll file one
[16:36:44] <volans>	 I see an-backup-datanode1033 and an-worker1168 marked as Active in Netbox but disappeared from PuppetDB (usually means they are broken) and I don't see open tasks for them is that known?
[16:43:52] <btullis>	 volans: Many thanks. an-backup-datanode1033 can be deleted from NetBox. an-worker1168 had puppet left disabled by mistake. I'm just running it manually now.
[16:44:29] <btullis>	 Will you clean up an-backup-datanode1033 from netbox, or shall I?
[16:44:59] <volans>	 why cleanup? 
[16:45:06] <volans>	 what's the status of the host
[16:47:17] <volans>	 don't change Netbox manually for this, there are scripts and automations around the server lifecycle
[16:54:56] <volans>	 btullis: ^
[16:56:44] <btullis>	 OK, got it. The host doesn't exist any more. It was an old an-worker node that we renamed in T397166 but then canned the project and decommissioned it. So as far as I know, it has been de-racked.
[16:56:45] <stashbot>	 T397166: Reimage and rename 46 hadoop worker nodes to use in the HDFS backup cluster - https://phabricator.wikimedia.org/T397166
[16:57:52] <volans>	 I don't see the decommissioning cookbook to ever be run on it, am I missing something? https://sal.toolforge.org/production?p=0&q=%22an-backup-datanode1033%22&d=
[17:00:34] <btullis>	 Oh, you're quite right and I'm totally wrong.
[17:00:47] <volans>	 and then dcops will take care of the following bits
[17:04:07] <volans>	 so basically the next steps depend on whether it has an OS and is powered on or not and what you plan to do with it
[17:05:26] <btullis>	 Thanks. I'm running the decom cookbook against it now. I'll add a note to T404970 to say that I had accidentally missed it.
[17:05:26] <stashbot>	 T404970: Decommission the 46 hadoop workers and 2 namenode servers that were planned for the hadoop-backup cluster - https://phabricator.wikimedia.org/T404970
[17:54:12] <ebernhardson>	 doh, it turns out the problem with x-request-id isn't limited to mwscript, getting the same error with web requests :(  Thankfully it doesn't seem to apply to the prod request flows, maybe because they have a local envoy
[17:57:56] * ebernhardson should probably try and understand how exactly that flow works
[18:00:23] <inflatador>	 We're going to add envoy to the k8s opensearch clusters pretty soon too (assuming it works), ref https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1248865
[18:00:49] <ebernhardson>	 inflatador: thanks! That was also on my list of things to check into today
[18:01:25] <ebernhardson>	 i was also pondering setting up some sort of healthcheck for the semanticsearch cluster, to have something that lets us know if the embedding service is down (as it was yesterday)
[18:01:55] <ebernhardson>	 probably they should have their own healthcheck on the ml side (maybe they do, uncertain), but seems reasonable to have our own side
[18:02:05] <inflatador>	 We might be the first that are doing TLS termination + reencryption back to the backend, hopefully it works
[18:02:22] <ebernhardson>	 inflatador: does it have to tls terminate? The other clusters use http to localhost
[18:02:28] <ebernhardson>	 i guess i don't have a strong opinion either way
[18:03:08] <inflatador>	 ebernhardson great question, I think the old version of the chart does force you to use TLS on the pods themselves
[18:03:28] <inflatador>	 I would prefer not to, maybe the new chart will let us get by without it
[18:05:24] <inflatador>	 I'm currently testing the opensearch-cluster 3.3.2 tag https://github.com/opensearch-project/opensearch-k8s-operator/blob/opensearch-cluster-3.2.2/charts/opensearch-cluster/README.md
[18:05:33] <inflatador>	 err...3.2.2 that is
[18:10:03] <inflatador>	 Looks like Ben did get `opensearch-test` codfw working with TLS termination, ref https://grafana.wikimedia.org/goto/dff80s8lp7ke8d?orgId=1
[18:11:27] <ebernhardson>	 oh! I think the x-request-id problem isn't envoy, i think thats builtin for opensearch 
[18:11:41] <ebernhardson>	 i guess it must have been added post-1.3 that we use in prod
[18:13:11] <ebernhardson>	 i think that means i need some envoy magic on the instances that strips that when forwarding to localhost
[18:13:42] <ebernhardson>	 yes, a quick look suggests that strict validation is part of opensearch 3.x
[18:18:23] <ebernhardson>	 indeed, poking around the opensearch 3.5 branch, there is no option to disable validation, if it exists it must be exactly 32 hexadecimal chars
[18:18:28] <ebernhardson>	 (ours has '-' in it)
[18:21:34] <inflatador>	 I guess I don't know much about `x-request-id`, is there some kind of convention that it's always 32 chars?
[18:22:09] <ebernhardson>	 it's usually a uuid, i think we use uuidv4 in prod, but at a general level it's used so that logs from varied services can attach the same string to their logs to correlate them
[18:22:12] <ebernhardson>	 at least, that's how i use it
[18:22:45] <ebernhardson>	 i'm also suspecting now we don't have envoy on the opensearch clusters, since mesh is disabled
[18:22:58] <ebernhardson>	 poking through `kubectl get pod ... -o yaml` now to try and better understand
[18:23:29] <ebernhardson>	 mildly annoyed at opensearch to have such strict validation for something like this...it shouldn't blow up a request
[18:23:36] <ebernhardson>	 it should just ignore it for it's own purposes
[18:24:04] <inflatador>	 we def don't have envoy on the k8s clusters ATM
[18:24:44] <ebernhardson>	 is that because it's not working yet?  Could i set one up? Not certainly, but currently pondering least-awkward way to solve this. envoy header stripping might be viable
[18:25:24] <ebernhardson>	 i suppose i probably thought envoy was there because the server opens 9200, but the public port is 30433
[18:25:41] <ebernhardson>	 *30443
[18:31:24] <inflatador>	 Ben just got it to work a few minutes ago, we can try deploying it on `opensearch-semantic-search-test` if you like.
[18:32:17] <ebernhardson>	 can't hurt to test, although i guess then i gotta figure out if the templating allows enough flexibility for arbitrary header stripping. Havn't got that far yet :)
[18:33:10] <inflatador>	 NP. Heading to lunch now but will be around for the next ~4h at least if you wanna give it a shot
[18:33:26] <ebernhardson>	 oddly...i'm still trying to figure out how that gets into the cirrus code path.
[18:34:01] <ebernhardson>	 oh, actually it's just because it's in the Elastica extension and not Cirrus
[18:38:30] <ebernhardson>	 looking at the patch, it already strips x-request-id (i think). That sounds wonderful to try out
[18:44:13] <ebernhardson>	 can verify that -test does strip the header and avoids the issue.   But the it also fails TLS validation: SSL certificate problem: certificate has expired
[18:52:43] <ebernhardson>	 going to guess expiration is intentional? Expiry date is `notAfter=Mar  5 16:37:00 2026 GMT` which implies to me it was a test cert generated with a very short expiry date
[19:06:23] <cdanis>	 ebernhardson: one option you have in the MW client code is something like the following
[19:07:02] <cdanis>	 				Telemetry::getInstance()->overrideReqId( MediaWikiServices::getInstance()->getGlobalIdGenerator()->getUUIDv4() );
[19:11:20] <ebernhardson>	 cdanis: interesting, that might be possible. Indeed if we have to munge it, i would want the munged value to show up in mw logging too. I guess that needs to be fairly early?
[19:12:06] <cdanis>	 earlier is better (and I was looking at doing something like this in ServiceWiring.php) but as long as its before you initiate the http request, it should be okay
[19:12:27] <cdanis>	 I think all the various MW HTTP clients get a reference to the Telemetry singleton, and don't get a copy of the reqid
[19:13:53] <ebernhardson>	 it would have to be something like strtr(...->getUUIDv4(), ['-' => '']);  I feel a little awkward about breaking conventions expected there though
[19:16:23] <cdanis>	 uh hmm
[19:16:47] <ebernhardson>	 maybe best is to aim to strip that in envoy before it lands on the server instead?
[19:16:55] <cdanis>	 I think you are going to have to
[19:17:07] * ebernhardson is separately annoyed opensearch would add strict validation for a generally used header with no RFC...but what are you going to do
[19:17:13] <cdanis>	 yeah that's wild tbh
[19:18:18] <cdanis>	 so, nowadays, any request that starts from the CDN, will wind up with a uuidv4 as reqid, generated as soon as the request arrives from the user in our haproxy TLS terminator
[19:19:50] <ebernhardson>	 that's awesome, this request-id logging has come a long way in the last decade :)
[19:21:36] <ebernhardson>	 i'll still be filing the ticket for mwscript btw, it's not as critical, but it is different as a 24 char hex string, vs the web which has the full uuidv4
[19:22:25] <inflatador>	 back
[19:22:47] <ebernhardson>	 inflatador: it seems our best bet is for envoy + stripping (already in the patch). Would love to test
[19:22:48] <cdanis>	 thanks Erik :)
[19:29:37] <inflatador>	 c-danis had more than a little to do with that ;P
[19:30:18] <inflatador>	 ebernhardson you mean the mw config patch that you needed to test is already deployed?
[19:30:51] <ebernhardson>	 inflatador: yea that's out there, in theory this is the last step to getting the prod api available for apps team testing, although that's what i thought about the last patch :P
[19:31:05] <ebernhardson>	 i checked with releng and sre and they let me deploy it earlier today
[19:32:01] <inflatador>	 Ah OK, so you're just blocked on the k8s side. We can do a homedir deploy now if you like.
[19:32:41] <ebernhardson>	 sure, is that basically running same commands, but from a clone in ~? I guess i haven't tried that yet, but makes sense
[19:33:11] <inflatador>	 yeah, we can get on a Meet and I can walk you thru it 
[19:33:14] <inflatador>	 https://wikitech.wikimedia.org/wiki/User:BKing_(WMF)/Notes/homedir_deploy_k8s are my notes
[19:34:11] <ebernhardson>	 sure
[19:34:11] <cdanis>	 that's a good trick Brian
[19:34:14] <cdanis>	 sometimes I also sshfs my deploy host homedir's version of deployment-charts 😇
[19:36:58] <inflatador>	 Nice! I forgot to mention in my notes, but if you're making chart changes you have to point helmfile to `../../../charts/`
[19:37:17] <inflatador>	 ebernhardson up in https://meet.google.com/qtf-yoqt-tmu whenever
[19:56:15] <ebernhardson>	 req/resp cycle seems to work now. I'm a little suspicious of the current results though: https://fr.wikipedia.org/w/api.php?action=query&format=json&list=search&formatversion=2&srsearch=Quelle%20est%20la%20capitale%20de%20la%20France%3F&cirrusSemanticSearch
[20:00:49] <ebernhardson>	 oh, of course...i forgot to prepend the query with instructions in CirrusSearch
[20:01:23] <ebernhardson>	 It's a little silly but we don't query with "foobar" we query with "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: foobar"
[21:21:25] <ebernhardson>	 adjusted the ml_connector in opensearch to have an `instruct` parameter, not sure if that's the best long term location. Maybe
[21:45:26] <ebernhardson>	 not sure what exactly changed, but perf is much better today. It would previously cap at ~35 qps, even with everything in memory.  currently seeing 50qps with 40g of indexes on 17g of disk cache
[21:49:36] <ebernhardson>	 it actually runs into a new problem, runs out of heap memory (circuit breakers trigger) if too many concurrents are sent. but not a big deal