[09:42:52] <klausman>	 So here's a question: can services running on say k8s-mwdebug-{codfw,eqiad} talk to disocery endpoints that then point to. LW Staging? 
[09:44:43] <klausman>	 Context is https://phabricator.wikimedia.org/T364551#9989502
[10:02:15] <elukey>	 klausman: via mesh local endpoint right? If so I think that LW staging is not allowed among the predefined list (in puppet) of mediawiki discovery endpoints
[10:02:33] <klausman>	 That may explain it, yeah.
[10:02:42] <klausman>	 I'll do some PPuppet digging
[10:04:02] <elukey>	 LW staging shouldn't probably go in the default list, I guess there should be some override for staging envs
[10:04:16] <elukey>	 in that case, it should be in deployment-charts
[10:04:23] <elukey>	 but I am not 100% sure, serviceops should now
[10:05:25] <klausman>	 ack. In this particular case, one could also argue that even the "customer" staging service should only talk to LW-prod, but the model server just isn't there yet.
[10:05:30] <elukey>	 anyway, in hieradata/common/profile/services_proxy/envoy.yaml you'll find at the bottom the mw list
[10:06:12] <klausman>	 Thanks!
[10:06:17] <elukey>	 inference is there, but not -staging, so this is why you can't access via mesh
[10:06:38] <elukey>	 but that is the global list, in deployment-chart there are surely overrides for mwdebug etc..
[11:44:09] <claime>	 there's no listeners override that I can remember on mwdebug
[11:44:26] <claime>	 mwdebug is considered "production" so it talks to production services
[11:54:56] <akosiaris>	 yes and I would caution against intertwining staging and production things. That way madness lies
[12:24:02] * claime looks at termbox and sighs
[12:30:18] <akosiaris>	 exactly
[12:31:24] <akosiaris>	 tbh, I think I am gonna bring in some future k8s-sig a question regarding how we envision staging. I 've been mulling over it for a long time now, and I fear staging is gonna go the way of deployment-prep (starting from an idea and ending up doing too many things for too many people) and I 'd like to avoid that. 
[12:36:19] <klausman>	 while I agree, I have Additional Opinions™ :)
[12:36:41] <klausman>	 Also, the next future k8sig is tonight ;)
[12:37:22] <akosiaris>	 yeah, I am not ready for tonight
[12:37:39] <akosiaris>	 but my intent is to indeed first capture use cases and opinions
[12:40:08] <klausman>	 ack.
[12:41:02] <klausman>	 I think in this particular case, both the model server and the service that would use it aren't ready yet, so not in prod. And thus it'd be a case of staging talking to staging. I am vaguely confident that it's not what should be the normal case
[12:48:12] <akosiaris>	 yeah, but staging was envisioned as a pre deployment safety net, not as an integration testing environment
[12:48:45] <akosiaris>	 at the beginning it started exactly like that. Replacing deployment-prep and integration testing
[12:49:22] <akosiaris>	 and later it became clear that this is futile. It works in very small sets and cases but it can't really scale much
[12:50:10] <akosiaris>	 which is when we re-envisioned it as a safety net, but overall I am not sure it makes much sense for that use case either. What it does make sense is for k8s platform development
[12:52:33] <akosiaris>	 and as far as I know, some teams use it kinda as a dev/proof environment without having to touch production itself. But do they really need a cluster for it? or is another helm release sufficient for their use cases? 
[12:57:17] <klausman>	 In our case, we're using it for iterating rapidly on new isvcs, testing new releases of existing ones, general experimenting (like throwing suspected queries of death at a service),  and also etsting the Istio/Calico/infra stuff.
[12:57:41] <klausman>	 For the latter, I think a separate cluster is required, and one with production-ish services on it.
[12:58:44] <akosiaris>	 yeah, the latter falls in the "k8s platform development" bucket, it makes total sense. 
[12:59:20] <akosiaris>	 the production-ish part probably could use some clearer definition, otherwise it's going to become an always expanding set of services
[13:00:00] <klausman>	 Now, we could of course run some of the other testing in a separate NS in prod, but that leads to interesting questions like uncommon resources. E.g. when experimenting with a new LLM, we might accidentally outcrowd prod services on GPUs. Not insurmountable to prevent, but physical separation always feels a bit safer.
[13:00:01] <akosiaris>	 for the former, we don't currently have a good solution ofc and we kinda improvise on using staging, using other helm releases and so on
[13:01:17] <akosiaris>	 ah yes, you got the GPU complication which makes this ... even more interesting?
[13:01:37] <klausman>	 In the Chinese sense, yes ;)
[13:01:51] <akosiaris>	 ;-)