[07:52:26] pfischer: usual ping on the standup notes... [07:52:53] 👀 [07:59:28] dcausse: if you have time to add a comment on T339347 (and feel free to disagree with the one I left) [07:59:29] T339347: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 [07:59:30] gehel: done. If I manage to add/update the decision record in time, I’ll link it. [07:59:41] pfischer: thanks! [08:02:42] gehel: looking [08:14:45] weekly update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2023-09-08 [09:14:10] Off, back in 1h [09:46:34] lunch [13:15:23] o/ [14:46:48] dcausse LMK if you're around and feel like walking me thru getting a new save point for dse-k8s . I know it's pretty late in your day so don't feel like you have to [14:48:25] inflatador: sure, no worries [14:49:01] dcausse cool, I'm at https://meet.google.com/aod-fbxz-joy if you wanna join [14:53:12] \o [14:56:19] o/ [15:22:52] ebernhardson: jelto is out but re-opened a thread in slack in the channel #developer-experience, dancy said we could use the tag "wmcs" on everything but the job that publishes to the docker-registry [15:23:34] dcausse: hmm, can't hurt to try. Any idea what's happening though? I increased the timeout up to 1s, it feels more like a firewall dropping packets than a cpu limit (is that even possible? I dunno how wireshark works) [15:23:38] s/wireshark/wiremock/ [15:23:57] the java8 job passed tho [15:24:05] they both pass if i retry enough times :) [15:24:10] ah :) [15:24:28] it's unclear how, also i poked through the old logs and the only time we passed before was when running on a memopt instance [15:24:36] when it runs on an 8G instance it dies every time [15:24:53] basically the memopt instances are somehow also in the normal job runner pool (based on the reported hostnames in CI logs) [15:26:21] pushed a patch, see what CI thinks about wmcs [15:27:51] the instances are bigger at least, it reports 24G memory available (but the container may have lower limits) [15:30:30] seems to have passed, guess we can ignore whatever wiremock was doing :) [15:34:12] :) [17:35:04] ebernhardson I vaguely remember you mentioning about replacing URIs with localhost in helm charts? Just wondering if that's part of our ZK problem. I nsentered a flink container and I noticed it couldn't resolve the flink-zk host [17:35:47] inflatador: for zookeeper it shouldn't be necessary, the URI's being replaced with localhost is because envoy is doing https proxying and metrics collection [17:36:13] when you say can't resolve, that means dns? curious [17:36:27] Y, no dns [17:36:37] I'm trying to find another container in this cluster to compare [17:36:54] i suppose i don't know how dns is working inside there, but i would have expected if it can be looked up from a normal host it would also work there. i guess not :( [17:37:46] can you pull other random hostnames from prod, like search.svc.eqiad.wmnet or some such? [17:38:07] I can try that. Should probably try the thanos-swift endpoint too [17:38:44] basically what i was wondering is if there is something special about flink-zk not resolving, or if all prod names don't resolve [17:39:05] yeah, that's why I was looking for a different container. Found only control plane containers so far [17:42:42] so our container isn't resolving flink-main-container . Still looking for a non-control plane container [17:43:59] nothin' there [17:44:18] so that suggests there is a dns layer to this helm/k8s stuff that we need to learn :) [17:44:54] I'm just surprised we're the only active service on the dse cluster ATM. But I'd be pretty surprised if DNS didn't work inside containers [17:45:09] I guess I can verify that, but my guess is firewall rules [17:47:22] I suppose i would expect all containers to have the CoreDNS address in /etc/resolv.conf, and for CoreDNS to be configured to forward to our normal nameservers. Trying to see how to verify [17:47:53] resolv.conf appears to be mounted from the host on dse-k8s-worker1001 [17:48:09] lookups work from host but not from container [17:48:31] hmm, that does sound like firewalls or some other networking limit [17:51:09] well...I hopped into the istio container and it doesn't seem to do lookups either. hmm [17:53:05] same for pods in a different k8s cluster? [17:53:14] checking staging now [17:53:50] i suppose i'd be very surprised if k8s pods can't lookup regular dns, i'm sure i've seen domain names used in other helmfiles [17:54:07] DNS working fine in staging flink container [17:54:37] so something is wrong with the dse-k8s cluster then, but i don't really know where to start there :( [17:55:06] yeah me neither, I'll ask around but something tells me this will have to wait 'till Monday ;) [18:00:38] OK, so I can resolve from within the container using the endpoints listed by ` kubectl get endpoints kube-dns --namespace=kube-system` [18:02:24] does the other k8s cluster also mount the host resolv.conf into the cluster, or does it mount some volume with cluster-specific config perhaps? [18:03:03] randomly guessing, but i found it surprising that it would mount the host resolv.conf into the container. For no great reason i was expecting it to go through the kube specific dns first [18:03:27] but perhaps that host resolv.conf is special, again for no great reason i'm assuming the host resolv.conf would be like every other server in the fleet :P [18:05:01] re: resolv.conf being identical to other hosts, you're right so far as I can tell (checked 4 hosts) [18:06:52] very curious [18:09:21] I one-offed resolv.conf on a dse k8s worker to use the k8s internal DNS, but the flink app is getting stuck in the same place. hmmm [18:10:00] apparently there is a dnsPolicy field in pod specifications that controls this [18:10:11] if it's set to host it would mount the host's resolv.conf [18:11:56] I see bidirectional communication between the container and thanos-swift endpoint. hmm [18:13:12] * ebernhardson is surprised to find 600+ files in /etc/kubernetes [18:13:19] we love config files :) [18:15:15] I wonder if there's a problem with the savepoint then. d-causse and I created it manually in yarn this morning [18:15:36] so it gets past the dns problem, but still gets stuck? [18:16:17] nothing changes even with valid DNS [18:16:33] I guess I can force a redeploy after the DNS fix but it doesn't seem to help [18:16:49] and it is talking to the thanos endpoint, so that doesn't seem to be the problem [18:17:01] hmm, seems plausible a container might not notice that resolv.conf was updated [18:17:09] talking to thanos does suggest it was able to lookup a domain name [18:17:20] i dunno :( [18:18:28] me neither. I'm going to work on updating the docs around creating that savepoint. After that I might poke around and compare it with known-good savepoints, but that's really grasping at straws [18:18:48] for now...it's lunch time! [18:58:41] back [19:48:00] * ebernhardson watches helm-linter spin...i should find a faster way to run this thing [20:16:54] ebernhardson: good afternon, has search/MjoLniR been migrated from Gitlab to Gerrit. I have a fix for mypy pending in Gerrit at https://gerrit.wikimedia.org/r/c/search/MjoLniR/+/954711 :) [20:17:18] but dcausse mentioned it might have been migrated, if that is the case I will abandon my patch and archive the repo :) [20:17:25] no rush :] [20:22:34] hashar: yes it has, sorry thought i responded with that on the gerrit patch [20:23:19] the gitlab repo probably requires a similar patch :) [20:24:31] it was migrated to a different dependency environment, it's using conda now so hard to say. will check [20:24:43] ebernhardson: I will archive the repo eventually :) [20:24:54] conda noooooooooooo [20:24:56] :D [20:25:26] if it fit your needs, I guess it is all fine [20:25:35] I will archive the Gerrit repo next week [20:25:38] lol. I'm not the biggest fan of conda, but someone else built out the pipeline that builds conda envs to deploy into the analytics hadoop cluster, so we just use it :) [20:25:40] thanks [20:32:08] ebernhardson: I noticed conda being introduced for somethinganalytics yeah [20:32:25] it is a poor precedent compared to using Debian package, but I guess that is how things are being done nowadays :] [20:32:54] I filled the archival task at T345956, the Gerrit repository will eventually be disposed of soon (tm) [20:32:54] T345956: Archive the search/MjoLniR repository (moved to gitlab) - https://phabricator.wikimedia.org/T345956 [20:37:35] Hey all, I'm going to be upgrading from MW 1.38 to 1.39 soon and since we use CirrusSearch + AWS OpenSearch, To do this MW upgrade, I will also need to upgrade (replace) the current OpenSearch cluster that uses their Elasticsearch 6.5.4 engine with one that uses the 7.10.2 engine. However, I was also interested in whether Serverless OpenSearch [20:37:35] would be supported. I just had an hour-long call with an Amazon OpenSearch expert and he pointed out that it requires SignatureV4 request sigining (with which I'm not familiar) and he provided a link to its supported operations. [20:37:44] https://docs.aws.amazon.com/sdk-for-php/v3/developer-guide/service_es-data-plane.html [20:37:52] https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-genref.html#serverless-operations [20:38:01] Is this something that might work with CirrusSearch? [20:39:00] justinl: if ebernhardson is still around he is probably the best person to be able to answer :) [20:39:43] justinl: hmm, For talking to elasticsearch we use the Elastica library (https://github.com/ruflin/elastica). I'm not familiar with SignatureV4 but i would be surprised if elastica had direct support for it [20:40:48] it looks like they mostly want to attach something to the request (makes sense) so it could probably be done in a custom connection type [20:41:44] justinl: hmm, actually they have a transport called AwsAuthV4, you could try configuring that [20:41:59] Yeah, I just searched that code, saw that, so I'll take a look. [20:42:55] That said, our wikis have so many pieces, that if trying to move to Serverless adds too much complexity, then it would certainly do to stay on the regular, provisioned service. [20:44:37] justinl: it might make some things harder, for example in the 6->7 transition if you plan to run a mixed cluster at some point (sound like not) we added a custom http transport that was used that was compatible with 6 and 7 at the same time (for enough uses cases, but not 100%) [20:47:03] Yeah, that doesn't sound like something I would do, I just use a pretty vanilla setup and really don't even know how to use Elasticsearch directly, just point MW at it and let it do its thing beyond using the maintenance scripts for creating and updating indexes. [20:50:18] The only odd thing I have to do is set $wgCirrusSearchServers to use localhost:9200 and then have Nginx proxy those requests to the HTTPS cluster endpoint. Turns out I may also need to do a DNS workaround for multi-AZ clusters, but that doesn't really impact the engine compatibility, so I should be fine just sticking with a provisioned [20:50:19] Elasticsearch 7.10.2 cluster. [20:51:25] yea sounds reasonable [21:03:19] shouldn't need a proxy [21:03:24] supports https directly [21:04:04] $wgCirrusSearchServers = [ [ 'server' => 'endpoint', 'port' => 1234, 'transport' => 'https' ] ]; (you can also add 'username' and 'password' keys if you have auth) [21:10:01] I thought I tried that originally but I may have not used the transport parameter. That said, there's also a DNS issue: whenever certain changes are made to OpenSearch clusters, they do blue/green deployments, replacing all of the nodes, and thus IP addresses, with new ones. I had to adjust the Nginx config to use the AWS Route 53 Resolver in the [21:10:01] VPC to recheck the IP addresses of the endpoint every 60 seconds, i.e. "resolver 169.254.169.253 valid=60s;". Otherwise, Nginx would keep trying to hit the old servers. [21:25:35] (er slight correction to above, should be 'host' => 'endpoint' if using array-based config, dunno why I mistyped that) [21:29:27] a transport key of 'AwsAuthV4' should in theory work for the serverless case you mentioned [21:33:44] That seems too simple, there's gotta be a catch. But I'm certainly interested in testing it! [21:50:53] Good luck! Let us know how it goes [21:57:03] Thanks, I will. The actual work to do the upgrade will probably take a couple of months given how much there is to do in this upgrade, but I may check in as I start testing the OpenSearch stuff.