[13:13:13] \o [14:10:12] tempted to try and get llama providing embeddings from relforge since the ml-inferrence is down and still trying to ship something today... [14:12:51] well thats annoying...in mwscript MediaWiki\Http\Telemetry::getInstance()->getRequestId() returns a 25 char string. But when it makes a request to semantcsearch-test cluster it gets rejected because the envoy (i'm assuming) there only accepts 32 char X-Request-Id headers [15:25:09] also realizing we should probably setup mediawiki side envoy so it maintains connections from codfw->eqiad [15:40:41] ebernhardson: hmmmmm that's probably my fault 😅 [15:40:53] I bet I can get that MW codepath fixed next week, can you open a task? [15:51:18] cdanis: sure i'll file one [16:36:44] I see an-backup-datanode1033 and an-worker1168 marked as Active in Netbox but disappeared from PuppetDB (usually means they are broken) and I don't see open tasks for them is that known? [16:43:52] volans: Many thanks. an-backup-datanode1033 can be deleted from NetBox. an-worker1168 had puppet left disabled by mistake. I'm just running it manually now. [16:44:29] Will you clean up an-backup-datanode1033 from netbox, or shall I? [16:44:59] why cleanup? [16:45:06] what's the status of the host [16:47:17] don't change Netbox manually for this, there are scripts and automations around the server lifecycle [16:54:56] btullis: ^ [16:56:44] OK, got it. The host doesn't exist any more. It was an old an-worker node that we renamed in T397166 but then canned the project and decommissioned it. So as far as I know, it has been de-racked. [16:56:45] T397166: Reimage and rename 46 hadoop worker nodes to use in the HDFS backup cluster - https://phabricator.wikimedia.org/T397166 [16:57:52] I don't see the decommissioning cookbook to ever be run on it, am I missing something? https://sal.toolforge.org/production?p=0&q=%22an-backup-datanode1033%22&d= [17:00:34] Oh, you're quite right and I'm totally wrong. [17:00:47] and then dcops will take care of the following bits [17:04:07] so basically the next steps depend on whether it has an OS and is powered on or not and what you plan to do with it [17:05:26] Thanks. I'm running the decom cookbook against it now. I'll add a note to T404970 to say that I had accidentally missed it. [17:05:26] T404970: Decommission the 46 hadoop workers and 2 namenode servers that were planned for the hadoop-backup cluster - https://phabricator.wikimedia.org/T404970 [17:54:12] doh, it turns out the problem with x-request-id isn't limited to mwscript, getting the same error with web requests :( Thankfully it doesn't seem to apply to the prod request flows, maybe because they have a local envoy [17:57:56] * ebernhardson should probably try and understand how exactly that flow works [18:00:23] We're going to add envoy to the k8s opensearch clusters pretty soon too (assuming it works), ref https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1248865 [18:00:49] inflatador: thanks! That was also on my list of things to check into today [18:01:25] i was also pondering setting up some sort of healthcheck for the semanticsearch cluster, to have something that lets us know if the embedding service is down (as it was yesterday) [18:01:55] probably they should have their own healthcheck on the ml side (maybe they do, uncertain), but seems reasonable to have our own side [18:02:05] We might be the first that are doing TLS termination + reencryption back to the backend, hopefully it works [18:02:22] inflatador: does it have to tls terminate? The other clusters use http to localhost [18:02:28] i guess i don't have a strong opinion either way [18:03:08] ebernhardson great question, I think the old version of the chart does force you to use TLS on the pods themselves [18:03:28] I would prefer not to, maybe the new chart will let us get by without it [18:05:24] I'm currently testing the opensearch-cluster 3.3.2 tag https://github.com/opensearch-project/opensearch-k8s-operator/blob/opensearch-cluster-3.2.2/charts/opensearch-cluster/README.md [18:05:33] err...3.2.2 that is [18:10:03] Looks like Ben did get `opensearch-test` codfw working with TLS termination, ref https://grafana.wikimedia.org/goto/dff80s8lp7ke8d?orgId=1 [18:11:27] oh! I think the x-request-id problem isn't envoy, i think thats builtin for opensearch [18:11:41] i guess it must have been added post-1.3 that we use in prod [18:13:11] i think that means i need some envoy magic on the instances that strips that when forwarding to localhost [18:13:42] yes, a quick look suggests that strict validation is part of opensearch 3.x [18:18:23] indeed, poking around the opensearch 3.5 branch, there is no option to disable validation, if it exists it must be exactly 32 hexadecimal chars [18:18:28] (ours has '-' in it) [18:21:34] I guess I don't know much about `x-request-id`, is there some kind of convention that it's always 32 chars? [18:22:09] it's usually a uuid, i think we use uuidv4 in prod, but at a general level it's used so that logs from varied services can attach the same string to their logs to correlate them [18:22:12] at least, that's how i use it [18:22:45] i'm also suspecting now we don't have envoy on the opensearch clusters, since mesh is disabled [18:22:58] poking through `kubectl get pod ... -o yaml` now to try and better understand [18:23:29] mildly annoyed at opensearch to have such strict validation for something like this...it shouldn't blow up a request [18:23:36] it should just ignore it for it's own purposes [18:24:04] we def don't have envoy on the k8s clusters ATM [18:24:44] is that because it's not working yet? Could i set one up? Not certainly, but currently pondering least-awkward way to solve this. envoy header stripping might be viable [18:25:24] i suppose i probably thought envoy was there because the server opens 9200, but the public port is 30433 [18:25:41] *30443 [18:31:24] Ben just got it to work a few minutes ago, we can try deploying it on `opensearch-semantic-search-test` if you like. [18:32:17] can't hurt to test, although i guess then i gotta figure out if the templating allows enough flexibility for arbitrary header stripping. Havn't got that far yet :) [18:33:10] NP. Heading to lunch now but will be around for the next ~4h at least if you wanna give it a shot [18:33:26] oddly...i'm still trying to figure out how that gets into the cirrus code path. [18:34:01] oh, actually it's just because it's in the Elastica extension and not Cirrus [18:38:30] looking at the patch, it already strips x-request-id (i think). That sounds wonderful to try out [18:44:13] can verify that -test does strip the header and avoids the issue. But the it also fails TLS validation: SSL certificate problem: certificate has expired [18:52:43] going to guess expiration is intentional? Expiry date is `notAfter=Mar 5 16:37:00 2026 GMT` which implies to me it was a test cert generated with a very short expiry date [19:06:23] ebernhardson: one option you have in the MW client code is something like the following [19:07:02] Telemetry::getInstance()->overrideReqId( MediaWikiServices::getInstance()->getGlobalIdGenerator()->getUUIDv4() ); [19:11:20] cdanis: interesting, that might be possible. Indeed if we have to munge it, i would want the munged value to show up in mw logging too. I guess that needs to be fairly early? [19:12:06] earlier is better (and I was looking at doing something like this in ServiceWiring.php) but as long as its before you initiate the http request, it should be okay [19:12:27] I think all the various MW HTTP clients get a reference to the Telemetry singleton, and don't get a copy of the reqid [19:13:53] it would have to be something like strtr(...->getUUIDv4(), ['-' => '']); I feel a little awkward about breaking conventions expected there though [19:16:23] uh hmm [19:16:47] maybe best is to aim to strip that in envoy before it lands on the server instead? [19:16:55] I think you are going to have to [19:17:07] * ebernhardson is separately annoyed opensearch would add strict validation for a generally used header with no RFC...but what are you going to do [19:17:13] yeah that's wild tbh [19:18:18] so, nowadays, any request that starts from the CDN, will wind up with a uuidv4 as reqid, generated as soon as the request arrives from the user in our haproxy TLS terminator [19:19:50] that's awesome, this request-id logging has come a long way in the last decade :) [19:21:36] i'll still be filing the ticket for mwscript btw, it's not as critical, but it is different as a 24 char hex string, vs the web which has the full uuidv4 [19:22:25] back [19:22:47] inflatador: it seems our best bet is for envoy + stripping (already in the patch). Would love to test [19:22:48] thanks Erik :) [19:29:37] c-danis had more than a little to do with that ;P [19:30:18] ebernhardson you mean the mw config patch that you needed to test is already deployed? [19:30:51] inflatador: yea that's out there, in theory this is the last step to getting the prod api available for apps team testing, although that's what i thought about the last patch :P [19:31:05] i checked with releng and sre and they let me deploy it earlier today [19:32:01] Ah OK, so you're just blocked on the k8s side. We can do a homedir deploy now if you like. [19:32:41] sure, is that basically running same commands, but from a clone in ~? I guess i haven't tried that yet, but makes sense [19:33:11] yeah, we can get on a Meet and I can walk you thru it [19:33:14] https://wikitech.wikimedia.org/wiki/User:BKing_(WMF)/Notes/homedir_deploy_k8s are my notes [19:34:11] sure [19:34:11] that's a good trick Brian [19:34:14] sometimes I also sshfs my deploy host homedir's version of deployment-charts 😇 [19:36:58] Nice! I forgot to mention in my notes, but if you're making chart changes you have to point helmfile to `../../../charts/` [19:37:17] ebernhardson up in https://meet.google.com/qtf-yoqt-tmu whenever [19:56:15] req/resp cycle seems to work now. I'm a little suspicious of the current results though: https://fr.wikipedia.org/w/api.php?action=query&format=json&list=search&formatversion=2&srsearch=Quelle%20est%20la%20capitale%20de%20la%20France%3F&cirrusSemanticSearch [20:00:49] oh, of course...i forgot to prepend the query with instructions in CirrusSearch [20:01:23] It's a little silly but we don't query with "foobar" we query with "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: foobar" [21:21:25] adjusted the ml_connector in opensearch to have an `instruct` parameter, not sure if that's the best long term location. Maybe [21:45:26] not sure what exactly changed, but perf is much better today. It would previously cap at ~35 qps, even with everything in memory. currently seeing 50qps with 40g of indexes on 17g of disk cache [21:49:36] it actually runs into a new problem, runs out of heap memory (circuit breakers trigger) if too many concurrents are sent. but not a big deal