[11:34:26] lunch [13:07:13] \o [13:29:43] o/ [13:36:56] .o/ [13:52:49] the discovery endpoint returns "Remote service returned error status 404 with empty body" from mediawiki@k8s-mwdebug-codfw [13:53:53] working fine from mwdebug-eqiad [13:54:02] :S [13:54:30] is there some config in e.g. envoy that could explain this? [13:54:53] fwiw, it works from deploy2002 in codfw [13:54:57] possibly, although I'm not sure what [13:55:04] a curl works fine tho from within the cluster [13:55:28] we could try rolling back the envoy TLS termination, but you'd lose the request ID headers [13:55:54] we actually need the x-request-id headers to be stripped, or it fails [13:56:36] damn. That's probably worth a Github issue if there isn't one already ;( [13:57:16] where does this x-request-id hack is happening? [13:57:30] envoy on the opensearch side strips the x-request-id [13:57:49] so it's being used when using curl I guess? [13:57:55] I'm not sure why it would work in one DC but not the other [13:58:10] deploying the envoy termination was a bit of a hack though, we pulled the open patch down to a clone on deployment host and deployed from that clone [13:58:17] I guess it works on deploy though [13:58:29] the "with empty body" is weird in Remote service returned error status 404 with empty body [13:58:42] not sure what could respond 404 without a body [13:58:54] I wonder if there's some IP range that needs to be allowed to mwdebug or something? [13:59:01] yea that feels like an intermediary thing rejecting it [13:59:21] opensearch-ipoid (which doesn't have envoy TLS termination) works fine in CODFW [14:00:12] What about hitting the codfw endpoint from mwdebug, as in `opensearch-semantic-search.svc.codfw.wmnet`? [14:01:08] I'll test this after, needs to revert the current one now [14:02:29] inflatador: is the discovery enabled for ipoid? [14:03:13] dcausse yes [14:03:25] it's possible that ipoid does not propagate the x-request-id, could explain why you don't need this envoy hack? [14:06:03] I think it's because it's on an older version of OpenSearch (2.x) [14:06:16] ebernhardson might know for sure, but I think that strict enforcement of the headers is new [14:09:11] it came in opensearch 3.x [14:09:44] but it's probably not this envoy hack failing in this case, otherwise curl from a codfw host would not work [14:09:54] yea indeed [14:10:12] i tried from eqiad and codfw deploy hosts, fine from there...i don't know what to guess :S [14:11:40] my best guess is something to do with the newness of the dse-k8s-codfw cluster, something is not allowlisted somewhere. If you wanna get a ticket started I can try and run it down with the other SREs/ServiceOps [14:11:56] it's pointing to k8s-ingress-dse-aa.discovery.wmnet so the host header must be important I guess [14:12:17] curiously, i can also curl_init/curl_exec it from a mwscript shell.php [14:12:37] oh, silly..that was mwscript and not mwscript-k8s [14:13:14] works from mwscript-k8s too, hmm [14:13:32] getting a 404 with no body with 'curl -v -HHost:foo.bar https://opensearch-semantic-search.discovery.wmnet:30443/' [14:14:39] hmm [14:14:55] Does -H work with HTTPS? I thought you had to do something like ` curl --resolve example.com:443:127.0.0.1 https://example.com`? [14:15:32] the host header is part of http and curl does seem to respect the one you provide [14:15:35] it probably still sends the host header, it sounds like istio or envoy is doing host based routing? But not sure why we would send an odd host header [14:15:40] ah, OK [14:16:53] do you have request logs on the ingress? [14:17:15] BTW, I just set readahead for eqiad, forgot to do that yesterday after redeploying [14:18:01] there might be request logs in logstash. I think you need root on the deploy hosts to see the ingress logs there, but I'll take a look [14:31:52] logstash dashboards are in a sad state ;( I don't even see the codfw cluster as an option on any of the k8s dashboards [14:37:56] school run [14:39:52] not finding anything useful in logstash [14:43:52] I couldn't find logs at all in k8s, but I don't know enough about where to look [14:44:32] Let's ping Balthazar after the deep dive [14:58:00] back [15:03:26] seems to work OK using $client = new \Elastica\Client(['servers' => [[ 'host' => 'opensearch-semantic-search.discovery.wmnet', 'transport' => 'Https', 'port' => 30443 ]]]); [15:03:45] from deploy@codfw with mwscript shell.php [15:04:04] hmm, isn't that what the config should already be doing? :S [15:04:21] i guess it uses DeprecationLoggedHttps [15:04:49] but all that does is inspect response headers [15:05:01] haven't been able to use this transport, it's complaining [15:05:17] but possibly for some other reasons [15:05:25] i suppose, i don't even know if opensearch does the warnings, just assuming they still do [15:06:16] how weird :S [15:07:23] but mwscript is running on the deploy machine itself [15:08:46] trying mwscript-k8s [15:09:35] working fine as well... [15:10:10] I'm using bare Elastica connections [15:11:03] indeed, seems to be working fine from mwscript-k8s :S [15:14:39] nothing really stands out...seems to work from mwscript-k8s, the aliases all exist as expected [15:28:06] yes... same using Cirrus connection & HashConfig overriding CirrusSearchClusters, I see codfw indices... [15:29:04] and can search as well [15:33:56] going to be hard if we can't repro outside a mw deploy [15:35:43] add a new cluster but don't make it the default for anything? In theory that should be no different than testing in mwscript-k8s, but who knows [15:39:30] sure [15:45:28] back [15:59:35] integration failures are odd...for example prefix_search_api.feature often fails the first test...adding more waiting [15:59:42] fails with timeout [15:59:59] :/ [16:03:12] i ran a bunch of integration tests overnight in a loop, but i think there is too much instability in our existing os 1.3 stuff..going to try and fix some of that to see which things are actually wrong in os 2.19. [16:04:44] I should be able to publish the trixie plugins deb pkg today [16:04:50] inflatador: nice! thanks [16:05:28] i also wasn't sure what debian version to put there...i just arbitrarily picked trixie as latest. The old ones were bullseye which seemed too old [16:06:42] that works