[10:05:08] <_joe_>	 claime: I took a look at the opentelemetry helm chart
[10:05:18] <claime>	 yeah?
[10:05:20] <_joe_>	 the good news is it doesn't do anything terrible
[10:05:34] <claime>	 The bad news is it doesn't fit our env?
[10:05:38] <_joe_>	 the bad news is... it doesn't do much
[10:05:49] <claime>	 Right
[10:05:51] <_joe_>	 most of the stuff you might want is injected directly from values.yaml
[10:06:10] <_joe_>	 which is both ok-ish and kinda defeats the idea of using helm charts
[10:08:14] <claime>	 Right. So inspiration but rewrite?
[10:09:03] <_joe_>	 I... don't know
[10:09:16] <_joe_>	 on one hand, I'd love not to have to maintain another chart
[10:09:31] <_joe_>	 on the other, I don't remember how we've managed other charts we've imported
[10:09:40] <_joe_>	 but I would say we can start from there
[10:09:47] <_joe_>	 at least we can try
[10:09:59] <_joe_>	 we can always rewrite it later if it does not fit our needs
[10:11:30] <_joe_>	 elukey: we need to add rate-limiting to our service mesh too
[10:13:05] <elukey>	 _joe_ seems very easy with Envoy (at least a coarse grained traffic volume protection is quick to set up)
[10:14:59] <_joe_>	 elukey: yes, just needs to be done
[10:23:32] <elukey>	 _joe_ do you have a specific idea in mind about what rate-limit should do in the mesh? I mean, would it be a simple coarse grain protection on traffic volume (not caring about the clients - headers IPs etc..) or something more elaborate?
[10:24:34] <_joe_>	 that would be step 0 I guess
[10:24:51] <_joe_>	 but that's already covered by circuit breaking, which we should tune too
[10:25:55] <_joe_>	 one of our issues is that we don't know which user to attribute a request to until we reach mediawiki
[10:26:33] <_joe_>	 so there's a few onions to peel here
[10:26:57] <_joe_>	 one is async traffic between services, which we should specifically rate-limit
[10:27:33] <_joe_>	 another is user-traffic, which I think should be rate-limited nearer to the edge
[10:27:46] <_joe_>	 and finally sync traffic between services
[10:30:07] <elukey>	 ack make sense
[10:30:20] <elukey>	 in my case I'd like to avoid people from hadoop hammering by mistake lift wing
[10:30:32] <elukey>	 or similar use cases, since the majority of ORES traffic is internal
[10:30:52] <elukey>	 (and external traffic will hopefully be rate-limited by the api-gw)
[10:31:19] <elukey>	 so as starter I'd just add a basic rate limit on the whole traffic, a high ceiling to return 429 when needed
[10:31:28] <elukey>	 but as coarse grain solution has its drawbacks etc..
[10:31:35] <elukey>	 so I am still trying to figure out what is best
[10:31:48] <elukey>	 (we also have circuit breaking to avoid hammering the mw api)
[11:49:18] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe)
[11:49:37] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Seen): Improve performance of deployment to mw on k8s - https://phabricator.wikimedia.org/T323349 (10Joe) 05Open→03Resolved Status update: with the pre-pulling activated, the deployment times for a small patch are in the order of 2-3 minutes on k8s,...
[13:35:46] <btullis>	 I've just read this about the required jsonschema for any CustomResourceDefinitions that are created by a helm chart: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/jsonschema/README.md
[13:37:46] <btullis>	 I realise that my spark-operator change doesn't https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/855674 do this yet, so I'll need to update it, right?
[13:39:11] <btullis>	 I believe that the CRDs were copied into the operator image from here: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/charts/spark-operator-chart/crds 
[13:53:50] <wikibugs>	 10serviceops, 10Arc-Lamp, 10Performance-Team (Radar), 10SRE Observability (FY2022/2023-Q3): Expand RAM on arclamp hosts and move them to baremetal - https://phabricator.wikimedia.org/T316223 (10lmata)
[14:40:09] <_joe_>	 btullis: yes, in order to be able to validate your crds later you'd need to do so
[15:08:41] <btullis>	 _joe_: thanks for confirming. I'll try to add that now.
[15:15:04] <btullis>	 The strange thing is that the chart itself doesn't include the CRDs. Unlike calico and cert-manager and knative-serving etc.
[15:15:23] <btullis>	 In the spark-operator case, they just seem to have been baked into the operator at compile time. So I'm still a bit confused as to whether I am *required* to add them to the deployment-charts repo as well.
[15:20:14] <_joe_>	 uhhh that is quite horrifying
[15:20:39] <_joe_>	 now the problem is, if you are *using* those CRDs
[15:20:50] <_joe_>	 you will need the corresponding jsonschema
[15:21:01] <_joe_>	 I am not sure how that would be generated in this case
[15:34:53] <btullis>	 Right. Sorry about that. We are definitely planning to use the CRDs, but not as part of the deployment with helm/helmfile. Here's an example of a SparkApplication resource being submitted afterwards, with kubectl: https://phabricator.wikimedia.org/T318926#8389971
[15:39:51] <btullis>	 s/submitted/created/
[16:06:00] <_joe_>	 ok then you probably don't need the definition for validation!
[16:08:24] <btullis>	 Great, that works for me. Thanks.