[10:05:08] <_joe_> claime: I took a look at the opentelemetry helm chart [10:05:18] yeah? [10:05:20] <_joe_> the good news is it doesn't do anything terrible [10:05:34] The bad news is it doesn't fit our env? [10:05:38] <_joe_> the bad news is... it doesn't do much [10:05:49] Right [10:05:51] <_joe_> most of the stuff you might want is injected directly from values.yaml [10:06:10] <_joe_> which is both ok-ish and kinda defeats the idea of using helm charts [10:08:14] Right. So inspiration but rewrite? [10:09:03] <_joe_> I... don't know [10:09:16] <_joe_> on one hand, I'd love not to have to maintain another chart [10:09:31] <_joe_> on the other, I don't remember how we've managed other charts we've imported [10:09:40] <_joe_> but I would say we can start from there [10:09:47] <_joe_> at least we can try [10:09:59] <_joe_> we can always rewrite it later if it does not fit our needs [10:11:30] <_joe_> elukey: we need to add rate-limiting to our service mesh too [10:13:05] _joe_ seems very easy with Envoy (at least a coarse grained traffic volume protection is quick to set up) [10:14:59] <_joe_> elukey: yes, just needs to be done [10:23:32] _joe_ do you have a specific idea in mind about what rate-limit should do in the mesh? I mean, would it be a simple coarse grain protection on traffic volume (not caring about the clients - headers IPs etc..) or something more elaborate? [10:24:34] <_joe_> that would be step 0 I guess [10:24:51] <_joe_> but that's already covered by circuit breaking, which we should tune too [10:25:55] <_joe_> one of our issues is that we don't know which user to attribute a request to until we reach mediawiki [10:26:33] <_joe_> so there's a few onions to peel here [10:26:57] <_joe_> one is async traffic between services, which we should specifically rate-limit [10:27:33] <_joe_> another is user-traffic, which I think should be rate-limited nearer to the edge [10:27:46] <_joe_> and finally sync traffic between services [10:30:07] ack make sense [10:30:20] in my case I'd like to avoid people from hadoop hammering by mistake lift wing [10:30:32] or similar use cases, since the majority of ORES traffic is internal [10:30:52] (and external traffic will hopefully be rate-limited by the api-gw) [10:31:19] so as starter I'd just add a basic rate limit on the whole traffic, a high ceiling to return 429 when needed [10:31:28] but as coarse grain solution has its drawbacks etc.. [10:31:35] so I am still trying to figure out what is best [10:31:48] (we also have circuit breaking to avoid hammering the mw api) [11:49:18] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) [11:49:37] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Seen): Improve performance of deployment to mw on k8s - https://phabricator.wikimedia.org/T323349 (10Joe) 05Open→03Resolved Status update: with the pre-pulling activated, the deployment times for a small patch are in the order of 2-3 minutes on k8s,... [13:35:46] I've just read this about the required jsonschema for any CustomResourceDefinitions that are created by a helm chart: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/jsonschema/README.md [13:37:46] I realise that my spark-operator change doesn't https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/855674 do this yet, so I'll need to update it, right? [13:39:11] I believe that the CRDs were copied into the operator image from here: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/charts/spark-operator-chart/crds [13:53:50] 10serviceops, 10Arc-Lamp, 10Performance-Team (Radar), 10SRE Observability (FY2022/2023-Q3): Expand RAM on arclamp hosts and move them to baremetal - https://phabricator.wikimedia.org/T316223 (10lmata) [14:40:09] <_joe_> btullis: yes, in order to be able to validate your crds later you'd need to do so [15:08:41] _joe_: thanks for confirming. I'll try to add that now. [15:15:04] The strange thing is that the chart itself doesn't include the CRDs. Unlike calico and cert-manager and knative-serving etc. [15:15:23] In the spark-operator case, they just seem to have been baked into the operator at compile time. So I'm still a bit confused as to whether I am *required* to add them to the deployment-charts repo as well. [15:20:14] <_joe_> uhhh that is quite horrifying [15:20:39] <_joe_> now the problem is, if you are *using* those CRDs [15:20:50] <_joe_> you will need the corresponding jsonschema [15:21:01] <_joe_> I am not sure how that would be generated in this case [15:34:53] Right. Sorry about that. We are definitely planning to use the CRDs, but not as part of the deployment with helm/helmfile. Here's an example of a SparkApplication resource being submitted afterwards, with kubectl: https://phabricator.wikimedia.org/T318926#8389971 [15:39:51] s/submitted/created/ [16:06:00] <_joe_> ok then you probably don't need the definition for validation! [16:08:24] Great, that works for me. Thanks.