[07:45:14] 10serviceops, 10Arc-Lamp, 10Performance-Team (Radar): Expand RAM on arclamp hosts and move them to baremetal - https://phabricator.wikimedia.org/T316223 (10akosiaris) >>! In T316223#8362915, @jbond wrote: > Is there a more specific tag we can use for this instead of SRE? perhaps `serviceops`? Yeah, this... [08:00:15] good morning, I have an old change in my dashboard for `docker-pkg` which is to pass `PATH` from the environment variable when running the image `test.sh`. That solved the issue of `docker` being in `/usr/local/bin`. May one review it and get it merged ? ;) https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-pkg/+/692995 [08:00:31] I did wrote some test to cover the feature Piotr wrote at the time [08:06:36] <_joe_> Petr, not Piotr [08:06:52] <_joe_> I'll take a look when I have time [08:08:00] errr sorry [08:08:03] for him [08:08:14] thanks _joe_ :] [08:08:21] <_joe_> jayme: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/854943 does the right thing re: not patching the chart entries with a version declared [08:08:35] <_joe_> sadly that means we need to install the "wmf-stable" repo first [08:08:42] <_joe_> can I leave that part to you? [08:08:55] <_joe_> I hated this part of the patching very much already [08:10:23] 10serviceops, 10Observability-Logging, 10Shellbox: Shellbox's http container does not log in wmfjson or ecs format - https://phabricator.wikimedia.org/T301757 (10Joe) 05Open→03Resolved We can still activate ecs logs later, but for now I'll consider this task resolved. [08:35:00] thanks _joe_, I will take a look. IIRC we had code to setup the repo already in an older version of the Rakefile [08:35:18] <_joe_> jayme: yes [08:35:27] <_joe_> it's a few lines of code I would think [08:35:40] yep [09:04:24] Mornin' [09:11:59] _joe_: if I'm not mistaken that approach does treat all releases of a chart as pinned if one is pinned, right? [09:12:23] <_joe_> only if they also have a version: stanza near them [09:12:36] ah, now I see - sorry [09:12:38] <_joe_> which means ofc that it can happen that we treat one as pinned that should not [09:12:45] <_joe_> but it should be relatively rare [09:24:37] hi folks, we got this request re: a maintenance script cronjob T320403, I'm assuming it is ok to send it your way [10:46:43] <_joe_> godog: I assume, yes [10:49:12] cheers _joe_ [10:49:15] 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, 10Campaign-Tools (Campaign-Tools-Sprint-24): Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10fgiunchedi) Thank you for reaching out @Daimona, I'll move this t... [10:50:52] 10serviceops, 10ContentTranslation, 10Machine-Learning-Team, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10akosiaris) >>! In T321781#8366443, @elukey wrote: > @LSobanski this is the first example of AWS microservice built outside our production... [10:56:22] another little bit of thumbor archaeology in the context of https://gerrit.wikimedia.org/r/c/operations/software/thumbor-plugins/+/854563/ - currently the swift connection objects don't take a cacert file as an arg, but currently it appears that the metal instances have no issue connecting to swift over https [10:56:52] however unsurprisingly I am getting "unable to get local issuer certificate" when connecting to swift from thumbor in k8s. I can't really figure out how/why there'd be different behaviour (although there have been lots of version bumps for dependencies etc) [10:57:32] <_joe_> I am going afk so I can't help now, but check that your container has wmf-certificates installed [10:58:08] <_joe_> that should add the puppet cert and the new pki cert to the system certs [10:58:11] <_joe_> jayme: ^^ [10:58:36] it has the certificates installed yep [10:59:10] and it's directly configured for the HTTP loader (entirely independent from the swift connections) [11:01:08] do you know which file/path is passed exactly? [11:02:26] for the http loader /etc/ssl/certs/wmf-ca-certificates.crt is passed, which I think is correct? [11:02:54] yeah, that should contain puppet and pki ca [11:02:59] the above CR is to pass it explicitly to the swift client also (which seems reasonable) [11:03:08] but I'm wondering why it hasn't been needed up until now [11:03:59] maybe because the metal instances where not using the service proxy to talk to swift? [11:04:25] this is in staging which is bypassing the proxy [11:04:36] hmpf [11:06:53] yeah :D [11:07:35] I'd also asume something, something down the libraries used... [11:08:14] yeah I just checked whether the earlier swiftclient forces insecure=True or something similar, but no dice. I guess for now there's limited harm in explicitly passing the bundle [11:13:53] yeah [12:37:37] 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 2 others: Upgrade maps servers to node >= 14 - https://phabricator.wikimedia.org/T321789 (10jijiki) [12:39:22] 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 2 others: Create helm chart for kartotherian k8s deployment - https://phabricator.wikimedia.org/T231006 (10jijiki) [12:40:40] 10serviceops, 10Platform Engineering, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, and 4 others: Move Kartotherian to Kubernetes - https://phabricator.wikimedia.org/T216826 (10WMDE-Fisch) [12:41:47] 10serviceops, 10Platform Engineering, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, and 4 others: Move Kartotherian to Kubernetes - https://phabricator.wikimedia.org/T216826 (10jijiki) [12:41:50] <_joe_> the service proxy will use an http url and the secure connection is managed by envoy, so I don't think it would be a problem [12:42:01] 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 3 others: Upgrade maps servers to node >= 14 - https://phabricator.wikimedia.org/T321789 (10WMDE-Fisch) [12:42:10] very true [12:49:10] yep, sure. But for now I'm both curious as to why this hasn't been an issue until now on the thumbor* instances, and am using the discovery endpoint in staging [12:49:18] (fwiw that patch has resolved the issue) [12:55:14] nice [12:55:29] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [12:56:54] _joe_: ci stuff looks good btw.https://integration.wikimedia.org/ci/job/helm-lint/8271/console [12:57:00] * https://integration.wikimedia.org/ci/job/helm-lint/8271/console [12:58:35] 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 3 others: Create helm chart for kartotherian k8s deployment - https://phabricator.wikimedia.org/T231006 (10WMDE-Fisch) [15:30:34] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Import istio 1.1x (k8s 1.23 dependency) - https://phabricator.wikimedia.org/T322193 (10elukey) ` root@apt1001:/srv/wikimedia# reprepro lsbycomponent istio-cni istio-cni | 1.9.5-1 | bullseye-wikimedia | component/istio195 | amd64 istio-cni... [15:30:45] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Import istio 1.1x (k8s 1.23 dependency) - https://phabricator.wikimedia.org/T322193 (10elukey) [16:49:08] Hello. Could anyone give me some quick pointers about when I might need to add a ClusterRole into `helmfile.d/admin_ng/helmfile_rbac.yaml` versus having it defined within a particular chart? [16:54:07] I'm working on translating the spark-operator RBAC policies to the least privilege required on our platform, scoped to two namespaces: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/charts/spark-operator-chart/templates/rbac.yaml [16:55:25] I'm just not sure what, if anything, should be added to the kube-system namespace in the way that flink has done here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/helmfile_rbac.yaml#54 [16:55:29] Thanks. [17:02:39] hmm so that operator wants access to everything ? [17:03:25] the way I read it, the spark operator's service account (which gets auto created by the k8s platform) gets a ClusterRoleBinding (which is NOT namespaced), to the ClusterRole that has a ton of rights [17:03:47] including doing anything to any pod [17:03:58] creating/deleting/updating services [17:04:14] in any namespace [17:04:51] it effectively requires to own the cluster IMHO [17:05:35] there's a few things it doesn't get access to, like being able to delete node objects, mess with resource quotas [17:06:20] btullis: is that understanding ^ correct ? Does the operator document that it expects to have such broad access? [17:08:43] to answer your question btw, that file is practically a set of simple if clauses right now [17:09:22] what essentially happens is that e,g, for wikikube, when https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/values/main.yaml#5 is sourced [17:09:26] That understanding is correct insofar that the upstream helm chart *permits* the operator to have this broad level of access. I am trying to restrict our deployment from the outset, so that the operator may only operate on one namespace. [17:09:48] the deployExtraClusterRoles variable is set to have "flink" in it and then the entirety of the ClusterRole is applied [17:10:13] ah, good [17:10:17] https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md#about-the-spark-job-namespace [17:10:42] so in that case you can create the ClusterRole in admin_ng and your will need a RoleBinding (not a ClusterRoleBinding as the chart does) [17:10:58] https://usercontent.irccloud-cdn.com/file/Kie8sQoR/image.png [17:11:00] ClusterRoleBindings are not namespaced, RoleBindings are namespaed [17:11:14] let me jump in real quick [17:11:39] I'm not sure it makes sense to treat the operator like a service tbh [17:12:28] jayme: wdym? [17:12:32] the discrimination we do is that every chart that is deployed via admin_ng can basically create it's own (cluster)roles as the deployment process has root in the cluster [17:13:01] operators IMHO are more on the admin side of things than on the service side [17:13:29] sure, but the question remains about how broad one wants their access to be [17:13:31] where on the service side we have deploy-users which usually do not have broad privileges [17:13:45] sure, sure [17:14:08] but if deployed via admin_ng, nothing needs to be added to helmfile_rbac [17:14:16] everything can be done in the chart then [17:14:21] (and should be IMHO) [17:14:55] does that make sense? [17:15:30] not disagreeing on that one. Not entirely sold the various operators we will have need to be deployed via admin_ng though [17:16:00] the main caveat being that an SRE will end up being the focal point for any kind of operator deployment [17:16:23] I was not aware that the deploy users are effectively root on the clusters. [17:16:30] they are not [17:16:32] they are not [17:16:39] but the admin_ng deploy user is [17:16:41] they are not even root in their namespaces [17:16:49] for some definition of "root" [17:16:54] root isn't a thing in kubernetes [17:17:34] Oh right. Sorry, misunderstood "deployment process has root in the cluster" :-) [17:17:35] btullis: I think the easy question to pose is: who will be deploying the spark operator 18 months from now? [17:17:57] the reference to flink you posted is adding more privileges to the deployment user for flink, not to flink itself [17:18:02] I would be happiest for this to be limited to SRE. [17:18:50] if it is always going to be SREs, then you probably want to do the deployment via admin_ng and keep it contained there. That makes it easier to give the operator more broad rights [17:19:15] akosiaris: I'm also not sure if is maybe desired to have SRE do operator deployments as they are kind of building blocks, like knative [17:19:16] if you expect that some dev will want to deploy that operator down the line, then you probably want to limit the operator per namespace [17:19:56] not sure either. We are new in this operator pattern. The logic is easy to understand, the day-to-day operations haven't been defined yet though [17:20:15] indeed [17:22:01] btullis: I know we aren't helping a lot, operators are a bit of a green field yet. [17:22:13] jayme: Are you saying that be building block components /should/ be SRE, or should be less restricted? I wasn't quite sure what you meant. [17:22:32] You are both helping a great deal, thanks. :-) [17:22:36] I was leaning towards they should be deployed by SRE [17:22:59] as in "it's a functionality the k8s cluster offers" [17:23:25] that's a sound approach for now ^. At least we aren't overcomplicating it [17:23:37] but it should also be communicated to devs I guess [17:24:08] I have not thought this through, though. It's just my gut feeling :) [17:24:09] jayme: Me too. Thanks. [17:25:15] tbh I think you will have kind of a hard time limiting the operator. It creates webhooksconfigs on the fly for example which are not namespaced objects [17:26:19] What about if I make the ClusterRole and RoleBinding part of the chart, but limit access to the /etc/kubernetes/spark-operator-dse-k8s-eqiad-admin.config file? [17:27:12] limiting access to that file is going to happen anyway [17:27:13] jayme: OK, thanks. Will think about that. If it turns out that the operator just won't fly, then we'll have to look at other options, but I'd like to try first. [17:27:16] that gets limited per namespace (sparkJobNamespace) is the scope of a spark cluster AIUI [17:27:22] *what gets... [17:28:03] https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/charts/spark-operator-chart/templates/rbac.yaml vs https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/charts/spark-operator-chart/templates/spark-rbac.yaml [17:28:31] but I'm seriously unsure here :D [17:29:38] We can take it to: https://phabricator.wikimedia.org/T322635 [17:29:38] I had envisaged that a member of `analytics-privatedata-users` launches a SparkApplication into the `spark` namespace. [17:31:00] Within this namespace, a spark-driver pod is created, which then spawns several spark-executor pods. They are all within this namespace. I'm not yet sure how the webhook actually functions. [17:31:01] yeah...I guess that is actually a matter of allowing those users to deploy a custom resource to the namespace (which the operator picks up and does it's thing) [17:31:43] AIUI the webhooks are created by the operator on startup and used to validate/mutate said custom resources [17:32:34] Yep. I've got it working on minikube and my WIP helm chart using the common_templates is starting to come together. [17:33:38] btw. we have something that is close to an operator in the clusters. cert-manager that is [17:33:51] if you want to look into how that is deployed currently [17:34:22] Cool. Many thanks again. [17:36:23] good luck :) [17:39:44] 10serviceops, 10API Platform, 10SRE: Block non-browser requests that use generic user agent (UA) headers - https://phabricator.wikimedia.org/T319423 (10daniel) [17:55:39] 10serviceops, 10API Platform, 10SRE: Block non-browser requests that use generic user agent (UA) headers - https://phabricator.wikimedia.org/T319423 (10daniel) We have rate limits in place for some generic UA strings: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/m... [17:57:23] 10serviceops, 10SRE: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10Dzahn) per T316223#8381863 serviceops-core is taking this over [17:57:34] 10serviceops, 10SRE: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10Dzahn) 05Stalled→03Open [17:57:40] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Dzahn) [17:57:54] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Dzahn) [17:58:00] 10serviceops, 10SRE: service implementation tracking: arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319429 (10Dzahn) 05Stalled→03Open per T316223#8381863 serviceops-core is taking this over [18:02:20] 10serviceops, 10Arc-Lamp, 10Performance-Team (Radar): Expand RAM on arclamp hosts and move them to baremetal - https://phabricator.wikimedia.org/T316223 (10Dzahn) @akosiaris Thank you! So what this is is: 1) hardware has been procured (I reviewed/approved in T316906 (eqiad) and T316907 (codfw). done -> 2... [18:05:18] I'm off, see you tomorrow :)