[09:42:52] So here's a question: can services running on say k8s-mwdebug-{codfw,eqiad} talk to disocery endpoints that then point to. LW Staging? [09:44:43] Context is https://phabricator.wikimedia.org/T364551#9989502 [10:02:15] klausman: via mesh local endpoint right? If so I think that LW staging is not allowed among the predefined list (in puppet) of mediawiki discovery endpoints [10:02:33] That may explain it, yeah. [10:02:42] I'll do some PPuppet digging [10:04:02] LW staging shouldn't probably go in the default list, I guess there should be some override for staging envs [10:04:16] in that case, it should be in deployment-charts [10:04:23] but I am not 100% sure, serviceops should now [10:05:25] ack. In this particular case, one could also argue that even the "customer" staging service should only talk to LW-prod, but the model server just isn't there yet. [10:05:30] anyway, in hieradata/common/profile/services_proxy/envoy.yaml you'll find at the bottom the mw list [10:06:12] Thanks! [10:06:17] inference is there, but not -staging, so this is why you can't access via mesh [10:06:38] but that is the global list, in deployment-chart there are surely overrides for mwdebug etc.. [11:44:09] there's no listeners override that I can remember on mwdebug [11:44:26] mwdebug is considered "production" so it talks to production services [11:54:56] yes and I would caution against intertwining staging and production things. That way madness lies [12:24:02] * claime looks at termbox and sighs [12:30:18] exactly [12:31:24] tbh, I think I am gonna bring in some future k8s-sig a question regarding how we envision staging. I 've been mulling over it for a long time now, and I fear staging is gonna go the way of deployment-prep (starting from an idea and ending up doing too many things for too many people) and I 'd like to avoid that. [12:36:19] while I agree, I have Additional Opinions™ :) [12:36:41] Also, the next future k8sig is tonight ;) [12:37:22] yeah, I am not ready for tonight [12:37:39] but my intent is to indeed first capture use cases and opinions [12:40:08] ack. [12:41:02] I think in this particular case, both the model server and the service that would use it aren't ready yet, so not in prod. And thus it'd be a case of staging talking to staging. I am vaguely confident that it's not what should be the normal case [12:48:12] yeah, but staging was envisioned as a pre deployment safety net, not as an integration testing environment [12:48:45] at the beginning it started exactly like that. Replacing deployment-prep and integration testing [12:49:22] and later it became clear that this is futile. It works in very small sets and cases but it can't really scale much [12:50:10] which is when we re-envisioned it as a safety net, but overall I am not sure it makes much sense for that use case either. What it does make sense is for k8s platform development [12:52:33] and as far as I know, some teams use it kinda as a dev/proof environment without having to touch production itself. But do they really need a cluster for it? or is another helm release sufficient for their use cases? [12:57:17] In our case, we're using it for iterating rapidly on new isvcs, testing new releases of existing ones, general experimenting (like throwing suspected queries of death at a service), and also etsting the Istio/Calico/infra stuff. [12:57:41] For the latter, I think a separate cluster is required, and one with production-ish services on it. [12:58:44] yeah, the latter falls in the "k8s platform development" bucket, it makes total sense. [12:59:20] the production-ish part probably could use some clearer definition, otherwise it's going to become an always expanding set of services [13:00:00] Now, we could of course run some of the other testing in a separate NS in prod, but that leads to interesting questions like uncommon resources. E.g. when experimenting with a new LLM, we might accidentally outcrowd prod services on GPUs. Not insurmountable to prevent, but physical separation always feels a bit safer. [13:00:01] for the former, we don't currently have a good solution ofc and we kinda improvise on using staging, using other helm releases and so on [13:01:17] ah yes, you got the GPU complication which makes this ... even more interesting? [13:01:37] In the Chinese sense, yes ;) [13:01:51] ;-)