[07:45:14] <wikibugs>	 10serviceops, 10Arc-Lamp, 10Performance-Team (Radar): Expand RAM on arclamp hosts and move them to baremetal - https://phabricator.wikimedia.org/T316223 (10akosiaris) >>! In T316223#8362915, @jbond wrote: > Is there a more specific tag we can use for this instead of SRE?  perhaps `serviceops`?    Yeah, this...
[08:00:15] <hashar>	 good morning, I have an old change in my dashboard for `docker-pkg` which is to pass `PATH` from the environment variable when running the image `test.sh`.  That solved the issue of `docker` being in `/usr/local/bin`.    May one review it and get it merged ? ;)  https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-pkg/+/692995
[08:00:31] <hashar>	 I did wrote some test to cover the feature Piotr wrote at the time
[08:06:36] <_joe_>	 Petr, not Piotr 
[08:06:52] <_joe_>	 I'll take a look when I have time
[08:08:00] <hashar>	 errr sorry
[08:08:03] <hashar>	 for him
[08:08:14] <hashar>	 thanks _joe_ :]
[08:08:21] <_joe_>	 jayme: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/854943 does the right thing re: not patching the chart entries with a version declared
[08:08:35] <_joe_>	 sadly that means we need to install the "wmf-stable" repo first
[08:08:42] <_joe_>	 can I leave that part to you?
[08:08:55] <_joe_>	 I hated this part of the patching very much already
[08:10:23] <wikibugs>	 10serviceops, 10Observability-Logging, 10Shellbox: Shellbox's http container does not log in wmfjson or ecs format - https://phabricator.wikimedia.org/T301757 (10Joe) 05Open→03Resolved We can still activate ecs logs later, but for now I'll consider this task resolved.
[08:35:00] <jayme>	 thanks _joe_, I will take a look. IIRC we had code to setup the repo already in an older version of the Rakefile
[08:35:18] <_joe_>	 jayme: yes
[08:35:27] <_joe_>	 it's a few lines of code I would think
[08:35:40] <jayme>	 yep
[09:04:24] <claime>	 Mornin'
[09:11:59] <jayme>	 _joe_: if I'm not mistaken that approach does treat all releases of a chart as pinned if one is pinned, right?
[09:12:23] <_joe_>	 only if they also have a version: stanza near them
[09:12:36] <jayme>	 ah, now I see - sorry
[09:12:38] <_joe_>	 which means ofc that it can happen that we treat one as pinned that should not
[09:12:45] <_joe_>	 but it should be relatively rare
[09:24:37] <godog>	 hi folks, we got this request re: a maintenance script cronjob T320403, I'm assuming it is ok to send it your way
[10:46:43] <_joe_>	 godog: I assume, yes
[10:49:12] <godog>	 cheers _joe_ 
[10:49:15] <wikibugs>	 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, 10Campaign-Tools (Campaign-Tools-Sprint-24): Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10fgiunchedi) Thank you for reaching out @Daimona, I'll move this t...
[10:50:52] <wikibugs>	 10serviceops, 10ContentTranslation, 10Machine-Learning-Team, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10akosiaris) >>! In T321781#8366443, @elukey wrote: > @LSobanski this is the first example of AWS microservice built outside our production...
[10:56:22] <hnowlan>	 another little bit of thumbor archaeology in the context of https://gerrit.wikimedia.org/r/c/operations/software/thumbor-plugins/+/854563/ - currently the swift connection objects don't take a cacert file as an arg, but currently it appears that the metal instances have no issue connecting to swift over https
[10:56:52] <hnowlan>	 however unsurprisingly I am getting "unable to get local issuer certificate" when connecting to swift from thumbor in k8s. I can't really figure out how/why there'd be different behaviour (although there have been lots of version bumps for dependencies etc) 
[10:57:32] <_joe_>	 I am going afk so I can't help now, but check that your container has wmf-certificates installed
[10:58:08] <_joe_>	 that should add the puppet cert and the new pki cert to the system certs
[10:58:11] <_joe_>	 jayme: ^^
[10:58:36] <hnowlan>	 it has the certificates installed yep
[10:59:10] <hnowlan>	 and it's directly configured for the HTTP loader (entirely independent from the swift connections) 
[11:01:08] <jayme>	 do you know which file/path is passed exactly?
[11:02:26] <hnowlan>	 for the http loader /etc/ssl/certs/wmf-ca-certificates.crt is passed, which I think is correct? 
[11:02:54] <jayme>	 yeah, that should contain puppet and pki ca
[11:02:59] <hnowlan>	 the above CR is to pass it explicitly to the swift client also (which seems reasonable) 
[11:03:08] <hnowlan>	 but I'm wondering why it hasn't been needed up until now
[11:03:59] <jayme>	 maybe because the metal instances where not using the service proxy to talk to swift?
[11:04:25] <hnowlan>	 this is in staging which is bypassing the proxy
[11:04:36] <jayme>	 hmpf
[11:06:53] <hnowlan>	 yeah :D 
[11:07:35] <jayme>	 I'd also asume something, something down the libraries used...
[11:08:14] <hnowlan>	 yeah I just checked whether the earlier swiftclient forces insecure=True or something similar, but no dice. I guess for now there's limited harm in explicitly passing the bundle 
[11:13:53] <jayme>	 yeah
[12:37:37] <wikibugs>	 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 2 others: Upgrade maps servers to node >= 14 - https://phabricator.wikimedia.org/T321789 (10jijiki)
[12:39:22] <wikibugs>	 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 2 others: Create helm chart for kartotherian k8s deployment - https://phabricator.wikimedia.org/T231006 (10jijiki)
[12:40:40] <wikibugs>	 10serviceops, 10Platform Engineering, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, and 4 others: Move Kartotherian to Kubernetes - https://phabricator.wikimedia.org/T216826 (10WMDE-Fisch)
[12:41:47] <wikibugs>	 10serviceops, 10Platform Engineering, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, and 4 others: Move Kartotherian to Kubernetes - https://phabricator.wikimedia.org/T216826 (10jijiki)
[12:41:50] <_joe_>	 the service proxy will use an http url and the secure connection is managed by envoy, so I don't think it would be a problem
[12:42:01] <wikibugs>	 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 3 others: Upgrade maps servers to node >= 14 - https://phabricator.wikimedia.org/T321789 (10WMDE-Fisch)
[12:42:10] <jayme>	 very true
[12:49:10] <hnowlan>	 yep, sure. But for now I'm both curious as to why this hasn't been an issue until now on the thumbor* instances, and am using the discovery endpoint in staging 
[12:49:18] <hnowlan>	 (fwiw that patch has resolved the issue)
[12:55:14] <jayme>	 nice
[12:55:29] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki)
[12:56:54] <jayme>	 _joe_: ci stuff looks good btw.https://integration.wikimedia.org/ci/job/helm-lint/8271/console
[12:57:00] <jayme>	 * https://integration.wikimedia.org/ci/job/helm-lint/8271/console
[12:58:35] <wikibugs>	 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 3 others: Create helm chart for kartotherian k8s deployment - https://phabricator.wikimedia.org/T231006 (10WMDE-Fisch)
[15:30:34] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Import istio 1.1x (k8s 1.23 dependency) - https://phabricator.wikimedia.org/T322193 (10elukey) ` root@apt1001:/srv/wikimedia# reprepro lsbycomponent istio-cni istio-cni |  1.9.5-1 | bullseye-wikimedia | component/istio195 | amd64 istio-cni...
[15:30:45] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Import istio 1.1x (k8s 1.23 dependency) - https://phabricator.wikimedia.org/T322193 (10elukey)
[16:49:08] <btullis>	 Hello. Could anyone give me some quick pointers about when I might need to add a ClusterRole into `helmfile.d/admin_ng/helmfile_rbac.yaml` versus having it defined within a particular chart?
[16:54:07] <btullis>	 I'm working on translating the spark-operator RBAC policies to the least privilege required on our platform, scoped to two namespaces: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/charts/spark-operator-chart/templates/rbac.yaml
[16:55:25] <btullis>	 I'm just not sure what, if anything, should be added to the kube-system namespace in the way that flink has done here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/helmfile_rbac.yaml#54
[16:55:29] <btullis>	 Thanks.
[17:02:39] <akosiaris>	 hmm so that operator wants access to everything ? 
[17:03:25] <akosiaris>	 the way I read it, the spark operator's service account (which gets auto created by the k8s platform) gets a ClusterRoleBinding (which is NOT namespaced), to the ClusterRole that has a ton of rights
[17:03:47] <akosiaris>	 including doing anything to any pod
[17:03:58] <akosiaris>	 creating/deleting/updating services
[17:04:14] <akosiaris>	 in any namespace
[17:04:51] <akosiaris>	 it effectively requires to own the cluster IMHO
[17:05:35] <akosiaris>	 there's a few things it doesn't get access to, like being able to delete node objects, mess with resource quotas
[17:06:20] <akosiaris>	 btullis: is that understanding ^ correct ? Does the operator document that it expects to have such broad access?
[17:08:43] <akosiaris>	 to answer your question btw, that file is practically a set of simple if clauses right now
[17:09:22] <akosiaris>	 what essentially happens is that e,g, for wikikube, when https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/values/main.yaml#5 is sourced
[17:09:26] <btullis>	 That understanding is correct insofar that the upstream helm chart *permits* the operator to have this broad level of access. I am trying to restrict our deployment from the outset, so that the operator may only operate on one namespace.
[17:09:48] <akosiaris>	 the deployExtraClusterRoles variable is set to have "flink" in it and then the entirety of the ClusterRole is applied 
[17:10:13] <akosiaris>	 ah, good 
[17:10:17] <btullis>	 https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md#about-the-spark-job-namespace
[17:10:42] <akosiaris>	 so in that case you can create the ClusterRole in admin_ng and your will need a RoleBinding (not a ClusterRoleBinding as the chart does)
[17:10:58] <btullis>	 https://usercontent.irccloud-cdn.com/file/Kie8sQoR/image.png
[17:11:00] <akosiaris>	 ClusterRoleBindings are not namespaced, RoleBindings are namespaed
[17:11:14] <jayme>	 let me jump in real quick
[17:11:39] <jayme>	 I'm not sure it makes sense to treat the operator like a service tbh
[17:12:28] <akosiaris>	 jayme: wdym?
[17:12:32] <jayme>	 the discrimination we do is that every chart that is deployed via admin_ng can basically create it's own (cluster)roles as the deployment process has root in the cluster
[17:13:01] <jayme>	 operators IMHO are more on the admin side of things than on the service side
[17:13:29] <akosiaris>	 sure, but the question remains about how broad one wants their access to be
[17:13:31] <jayme>	 where on the service side we have deploy-users which usually do not have broad privileges
[17:13:45] <jayme>	 sure, sure
[17:14:08] <jayme>	 but if deployed via admin_ng, nothing needs to be added to helmfile_rbac
[17:14:16] <jayme>	 everything can be done in the chart then
[17:14:21] <jayme>	 (and should be IMHO)
[17:14:55] <jayme>	 does that make sense?
[17:15:30] <akosiaris>	 not disagreeing on that one. Not entirely sold the various operators we will have need to be deployed via admin_ng though
[17:16:00] <akosiaris>	 the main caveat being that an SRE will end up being the focal point for any kind of operator deployment
[17:16:23] <btullis>	 I was not aware that the deploy users are effectively root on the clusters. 
[17:16:30] <akosiaris>	 they are not
[17:16:32] <jayme>	 they are not
[17:16:39] <jayme>	 but the admin_ng deploy user is
[17:16:41] <akosiaris>	 they are not even root in their namespaces
[17:16:49] <akosiaris>	 for some definition of "root"
[17:16:54] <akosiaris>	 root isn't a thing in kubernetes
[17:17:34] <btullis>	 Oh right. Sorry, misunderstood "deployment process has root in the cluster" :-) 
[17:17:35] <akosiaris>	 btullis: I think the easy question to pose is: who will be deploying the spark operator 18 months from now?
[17:17:57] <jayme>	 the reference to flink you posted is adding more privileges to the deployment user for flink, not to flink itself
[17:18:02] <btullis>	 I would be happiest for this to be limited to SRE.
[17:18:50] <akosiaris>	 if it is always going to be SREs, then you probably want to do the deployment via admin_ng and keep it contained there. That makes it easier to give the operator more broad rights
[17:19:15] <jayme>	 akosiaris: I'm also not sure if is maybe desired to have SRE do operator deployments as they are kind of building blocks, like knative
[17:19:16] <akosiaris>	 if you expect that some dev will want to deploy that operator down the line, then you probably want to limit the operator per namespace 
[17:19:56] <akosiaris>	 not sure either. We are new in this operator pattern. The logic is easy to understand, the day-to-day operations haven't been defined yet though
[17:20:15] <jayme>	 indeed
[17:22:01] <akosiaris>	 btullis: I know we aren't helping a lot, operators are a bit of a green field yet. 
[17:22:13] <btullis>	 jayme: Are you saying that be building block components /should/ be SRE, or should be less restricted? I wasn't quite sure what you meant.
[17:22:32] <btullis>	 You are both helping a great deal, thanks. :-)
[17:22:36] <jayme>	 I was leaning towards they should be deployed by SRE
[17:22:59] <jayme>	 as in "it's a functionality the k8s cluster offers"
[17:23:25] <akosiaris>	 that's a sound approach for now ^. At least we aren't overcomplicating it 
[17:23:37] <akosiaris>	 but it should also be communicated to devs I guess
[17:24:08] <jayme>	 I have not thought this through, though. It's just my gut feeling :)
[17:24:09] <btullis>	 jayme: Me too. Thanks.
[17:25:15] <jayme>	 tbh I think you will have kind of a hard time limiting the operator. It creates webhooksconfigs on the fly for example which are not namespaced objects
[17:26:19] <btullis>	 What about if I make the ClusterRole and RoleBinding part of the chart, but limit access to the /etc/kubernetes/spark-operator-dse-k8s-eqiad-admin.config file?
[17:27:12] <akosiaris>	 limiting access to that file is going to happen anyway
[17:27:13] <btullis>	 jayme: OK, thanks. Will think about that. If it turns out that the operator just won't fly, then we'll have to look at other options, but I'd like to try first.
[17:27:16] <jayme>	 that gets limited per namespace (sparkJobNamespace) is the scope of a spark cluster AIUI
[17:27:22] <jayme>	 *what gets...
[17:28:03] <jayme>	 https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/charts/spark-operator-chart/templates/rbac.yaml vs https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/charts/spark-operator-chart/templates/spark-rbac.yaml
[17:28:31] <jayme>	 but I'm seriously unsure here :D
[17:29:38] <btullis>	 We can take it to: https://phabricator.wikimedia.org/T322635
[17:29:38] <btullis>	 I had envisaged that a member of `analytics-privatedata-users` launches a SparkApplication into the `spark` namespace.
[17:31:00] <btullis>	 Within this namespace, a spark-driver pod is created, which then spawns several spark-executor pods. They are all within this namespace. I'm not yet sure how the webhook actually functions. 
[17:31:01] <jayme>	 yeah...I guess that is actually a matter of allowing those users to deploy a custom resource to the namespace (which the operator picks up and does it's thing)
[17:31:43] <jayme>	 AIUI the webhooks are created by the operator on startup and used to validate/mutate said custom resources
[17:32:34] <btullis>	 Yep. I've got it working on minikube and my WIP helm chart using the common_templates is starting to come together. 
[17:33:38] <jayme>	 btw. we have something that is close to an operator in the clusters. cert-manager that is
[17:33:51] <jayme>	 if you want to look into how that is deployed currently
[17:34:22] <btullis>	 Cool. Many thanks again. 
[17:36:23] <jayme>	 good luck :)
[17:39:44] <wikibugs>	 10serviceops, 10API Platform, 10SRE: Block non-browser requests that use generic user agent (UA) headers - https://phabricator.wikimedia.org/T319423 (10daniel)
[17:55:39] <wikibugs>	 10serviceops, 10API Platform, 10SRE: Block non-browser requests that use generic user agent (UA) headers - https://phabricator.wikimedia.org/T319423 (10daniel) We have rate limits in place for some generic UA strings:  https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/m...
[17:57:23] <wikibugs>	 10serviceops, 10SRE: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10Dzahn) per T316223#8381863 serviceops-core is taking this over
[17:57:34] <wikibugs>	 10serviceops, 10SRE: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10Dzahn) 05Stalled→03Open
[17:57:40] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Dzahn)
[17:57:54] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Dzahn)
[17:58:00] <wikibugs>	 10serviceops, 10SRE: service implementation tracking: arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319429 (10Dzahn) 05Stalled→03Open per T316223#8381863 serviceops-core is taking this over
[18:02:20] <wikibugs>	 10serviceops, 10Arc-Lamp, 10Performance-Team (Radar): Expand RAM on arclamp hosts and move them to baremetal - https://phabricator.wikimedia.org/T316223 (10Dzahn) @akosiaris Thank you! So what this is is:  1) hardware has been procured (I reviewed/approved in T316906 (eqiad) and  T316907 (codfw). done  ->  2...
[18:05:18] <claime>	 I'm off, see you tomorrow :)