[09:08:16] Morning :) [09:08:58] hi [09:09:47] hello folks [09:41:02] Morning all. [10:00:41] elukey: fixed the ml_k8s pod ranges. Thanks for checking! [10:28:39] jayme │ if the resource limits are not different for canaries - I have not checked < They're not, afaik [10:41:21] 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 3 others: Upgrade maps servers to node >= 14 - https://phabricator.wikimedia.org/T321789 (10jijiki) @awight If I understand correctly, we are planning to deploy Karthotherian to Kubernetes using node 16? [11:01:02] If I count correctly it's 5CPU/pod and 1900MiB (say 2GiB) per pod requested, x(8 main + 2 canary) +25% = 75CPU, 25GiB (going by requests), so we shouldn't be hitting the cap, except if we are between requests and limits, then yeah we can hit namespace quotas (100+CPU, 50+GiB if everything hits limits) [11:03:44] Or is there something I'm not understanding correctly? [11:14:54] Limit is what counts here which is 8250m CPU [11:15:31] Right [11:15:41] 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 3 others: Upgrade maps servers to node >= 14 - https://phabricator.wikimedia.org/T321789 (10awight) >>! In T321789#8391863, @jijiki wrote: > @awight If I understand correctly, we are planning to deploy Karthothe... [11:15:53] and it's 25% of the number of replicas, which will probably be roundet up to 3 [11:16:32] ack, so my calculations for the quota are off, checking [11:19:35] Not that off, there's a bit of leeway (I'd gone with 120CPU in the new quota, 13*8250 is 107.25) [11:20:00] Since it'll undoubtedly be changed as we accomodate more traffic, I'd say that's good enough, yeah? [11:22:27] maxUnavailable is 25% as well IIRC which means that 2 (rounded down in that case) pods will enter terminated state right away but they still count towards quota until they are actually terminated/gone [11:23:22] Oh, so if they actually take too much time you can end up with (10+3+2) pods right? [11:23:32] so I'd say you should at least accomondate for 5 additional pods...although that's not completely true as well as canary and main are two different deployment, so it's not 25% of 10 but 25% of 8 + 25% of 2 :) [11:23:39] Momentarily, but counting against quota [11:24:23] jayme: Yeah, I gathered it was per release, but in that case it doesn't change much :P [11:24:32] Hello. I wonder if anyone could help with this helmfile issue. I can't see what I'm doing wrong. On my spark-operator deployment to dse-k8s I'm trying to set the namespace here: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/855674/4/helmfile.d/admin_ng/spark-operator/helmfile.yaml#9 [11:25:15] ...but when it runs through CI the namespace comes out as default: https://integration.wikimedia.org/ci/job/helm-lint/8343/consoleFull [11:25:18] claime: no it's kubernetes doing the "rolling" which has no understanding of teleases [11:25:26] *releases [11:25:32] https://usercontent.irccloud-cdn.com/file/9E6a5HF2/image.png [11:25:36] it's just deployements for k8s [11:26:05] baseline is: don't be shy on bumping quota for mw namespaces :D [11:26:11] lol ok [11:27:26] btullis: you set the namespace where helm should deploy stuff into [11:28:41] jayme: OK, but doesn't that allow me to use `namespace: {{ .Release.Namespace }}` in a template? [11:28:43] the screenshot you posted is an clusterrolebinging (which is a non-namespaced object) that references a serviceaccount (namespaced object) in the default namespace [11:29:20] it should...I have not looked at the template of the clusterrolebinding tbh [11:30:32] It's here: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/855674/4/charts/spark-operator/templates/rbac.yaml#72 [11:30:37] Sorry for the trouble. [11:34:58] ah, okay. [11:35:07] yes, that looks strange [11:36:24] When I use `helm template` or `helm install` with this chart it all uses the correct (`spark-operator` and `spark`) namespaces. This is the first time that I've seem the helmfile output, when running it through CI. [11:44:22] helmfile -e dse-k8s-eqiad template is what is run there [11:45:29] OK, maybe it's worth installing helmfile on my workstation, so I can try to replicate it before sending it up to gerrit. [11:53:34] tbh I'm not sure if this is maybe an issue of how helmfile template works [11:55:59] self-merging mw-web quota update https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/856516 [12:04:56] jayme: OK, thanks. I'll continue investigating. [12:04:56] I have an additional diff for cert-manager, clusterissuers.cfssl-issuer.wikimedia.org, CustomResourceDefinition : - creationTimestamp: null that doesn't appear in CI [12:05:37] (on admin_ng helmfile diff for both eqiad and codfw) [12:06:12] That seems related to a note in ../../charts/cfssl-issuer-crds/README.md "creationTimestamp: null" fields need to be removed from updated CRDs as those will trigger validation errors in kubeconform. [12:14:17] I'm having trouble finding out what's causing that change [12:25:16] 10serviceops, 10SRE: Deploy etcddump (or another etcd dump & load tool) to production - https://phabricator.wikimedia.org/T135124 (10jcrespo) I believe this was mislabeled, although please ask for help for dumping scheduling and monitoring, we have tooling we want to extend to services other than databases. [12:25:56] 10serviceops, 10SRE, 10Technical-Debt: Reduce etcd technical debt - https://phabricator.wikimedia.org/T135122 (10jcrespo) [12:26:24] I'll apply anyways, but if anyone has an idea of why this change appears in a totally unrelated deployment, I'm curious [12:27:04] 10serviceops, 10SRE: Deploy etcddump (or another etcd dump & load tool) to production - https://phabricator.wikimedia.org/T135124 (10jcrespo) Is this related to T281447? [12:29:49] Actually, I'll apply only my namespaces quota change until I find out what's up with this definition disappearing [13:13:18] jayme: I'm getting nowhere trying to track why that metadata disappeared. I suspect it's no big deal and I can apply, but if you have any idea how that came to be, I'll take it. [13:20:19] 10serviceops, 10Dumps-Generation, 10SRE, 10MW-1.39-notes, and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jijiki) 05Open→03Resolved a:03jijiki This task itself looks like it is done, please reopen if you disagreen or if I am missing somet... [13:23:14] hnowlan: can I mark this as resolved https://phabricator.wikimedia.org/T319279 (Increased session loss since 20221001) ? [13:23:36] latest graphs shows that things are back to normal? [13:31:27] 10serviceops, 10Sustainability: Automate the provisioning and management of MediaWiki clusters - https://phabricator.wikimedia.org/T118829 (10jijiki) 05Open→03Invalid I feel like this task is not relevant anymore, or if it is, it need to be rewritten in a way to reflect our current needs and infra. Closing:) [13:31:40] jayme: Sorry to trouble again. Do you think that this could be related to the version of helm in use? I cannot replicate with the same version of helmfile on my workstation, but I'm using `helm` (3.7.0) instead of `helm3`. [14:18:42] btullis: helm and helm3 should be the same if you installed helm3 on you machine. So `helm3 version` and `helm version` should return the same output. We are running helm v3.9.4 in production afaics. [14:18:42] Regarding the missing namespace field: admin environments don't have a default namespace set in the kubeconfig/context (in contrast to service environments also used with kube_env). So I think the resources in that chart need to set their metadata.namespace explicitly to something like namespace: {{ .Release.Namespace }}. [14:18:42] Maybe you can compare that with other admin services like cert-manager, helm-state-metrics or calico. jayme: correct me if I'm wrong ;) [14:21:56] jelto. Thanks ever so much. On my workstation, helm and helm3 are different binaries with different version at the moment. [14:22:00] https://www.irccloud.com/pastebin/916xQWYl/ [14:23:54] I am already setting the metadata.namespace to `{{ Release.namespace }}` in the resources: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/855674/4/charts/spark-operator/templates/rbac.yaml#72 unless I've misunderstood what you mean. [14:25:49] Would anyone be strongly opposed to my merging: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/855674 so that I could try a `helmfile diff` from a deployment host? If it still shows the incorrect namespace then I could revert it. If it doesn't, then it points to a CI problem of some sort. [14:34:17] effie: yep, I guess so - little chance we'll find anything interesting out now [14:34:40] cool tx [14:35:36] 10serviceops, 10MediaWiki-Authentication-and-authorization, 10Platform Engineering, 10SRE: Increased session loss since 20221001 - https://phabricator.wikimedia.org/T319279 (10jijiki) 05Open→03Resolved a:03jijiki Per @hnowlan's latest comment, I am marking this as resolved [14:36:49] claime: sorry, was afk for some woodwork - that creationTimestamp mess you saw by running helmfile on deploy1002 or in CI? [14:40:44] btullis: I would argue to try to use the same helm version as we do in CI and prod if you want to make sure. I can replicate that exact same output on my maching [14:40:54] jayme: can you run pcc again for https://gerrit.wikimedia.org/r/c/operations/puppet/+/855997 ? [14:41:24] elukey: absolutely, sorry [14:41:30] <3 [14:42:28] btullis: jelto's node is a good one, though. I do see the same behaviour for cert-manager. That makes it even more likely do be a helmfile limitation [14:44:25] at this time, without any reviews, I would also definitely not merge that chart! [14:44:35] > I can replicate that exact same output on my machine [14:44:35] Oh, you can see the `namespace: default` when running on your workstation? I've installed the .deb for helmfile so that's the same, but I'll try the 3.9.4 deb instead of my local version. [14:45:02] >I can replicate that exact same output on my maching [14:45:02] Yep, understood. Thanks. [14:45:26] Sorry, wrong paste. That was meant to reply to `I would also definitely not merge that chart!` [14:46:23] yes. When running "helmfile -e dse-k8s-eqiad template" I do get "namespace: default" for that serviceaccount subject [14:47:57] OK, thanks. Yes I have now tried it with helm 3.9.4 and I can replicate. [15:05:34] jayme: On deploy1002, doesn't appear on CI [15:06:06] (and no worries, there was nothing urgent since I could still apply my change on the namespace release) [15:07:04] claime: it's a kubeconform quirk anyways. "null" is a valid value for that field in k8s api. So applying should just work [15:09:26] jayme: Yeah, just that as far as I can tell it's removing that key from the CRD [15:10:07] yep. That's fine. It's a k8s managed field anyways [15:10:17] ok, applying then [15:11:13] fingers crossed :D [15:14:47] Doesn't seem to have changed the creationTimestamp of the existing CRD [15:14:57] So I guess that's good [15:15:14] cool. Thanks for applying [15:42:21] 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure): Reduce usage of public IPv4 addresses on GitLab hosts - https://phabricator.wikimedia.org/T310265 (10Jelto) @Volans , @ayounsi , @cmooney , @BBlack and I had a chat about this topic during the SRE summit. We talked about multiple options which wo... [15:59:22] <_joe_> i'm going to be 1-2 minutes late [15:59:28] <_joe_> start without me please [16:03:29] argh trying to join one sec [16:04:26] 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure): Reduce usage of public IPv4 addresses on GitLab hosts - https://phabricator.wikimedia.org/T310265 (10thcipriani) Thanks for the detailed write up as always @Jelto 🎉 >>! In T310265#8392890, @Jelto wrote: > For this and the previous option I have... [16:09:38] 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure): Reduce usage of public IPv4 addresses on GitLab hosts - https://phabricator.wikimedia.org/T310265 (10dancy) [16:21:02] 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure): Reduce usage of public IPv4 addresses on GitLab hosts - https://phabricator.wikimedia.org/T310265 (10dancy) Seconding what @thcipriani said, I'm strongly against disabling git over ssh. Using HTTP only requires plaintext passwords to be stored on... [16:42:03] so first task in our queue is https://phabricator.wikimedia.org/T320403 (Run the timezone update script periodically in prod and in beta) [16:50:14] alright, no comments, moving along [16:50:34] effie: That may fall under "things I should learn how to do", but I am not sure where to start [16:51:00] Regarding priority, I have no idea [16:51:02] I think you could start from Amir [16:51:42] I believe he will be able to get you started on this, it should not be much work [16:52:42] * claime side-eyes footgun [16:54:26] claime: we could put it in this year's backlog, and chat with the folks about details, if amir can't help, we can figure it out [16:54:34] 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, 10Campaign-Tools (Campaign-Tools-Sprint-24): Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10jijiki) @Daimona Do you have a timeline as to when you need this... [16:54:45] sounds good? [16:54:49] yep [16:55:03] great [16:55:33] https://phabricator.wikimedia.org/T320241 (Incorrect handling of ETags taking precedence over timestamps in conditional requests) [16:56:42] I can take that and figure out what is up [16:57:00] 10serviceops, 10SRE, 10Wikibase Product Platform, 10Wikimedia-Apache-configuration: Incorrect handling of ETags taking precedence over timestamps in conditional requests - https://phabricator.wikimedia.org/T320241 (10jijiki) [16:58:38] _joe_: I think we need your help with https://phabricator.wikimedia.org/T320929 (Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing) [17:00:19] 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, 10Campaign-Tools (Campaign-Tools-Sprint-24): Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10jijiki) a:03Clement_Goubert [17:14:53] 10serviceops, 10SRE, 10conftool: Not all confd errors throw icinga alerts - https://phabricator.wikimedia.org/T110933 (10jijiki) 05Open→03Declined Bluntly closing this as there has been no update for quite some years now [17:26:29] 10serviceops, 10SRE, 10conftool: confctl no longer logs a non-changing state change - https://phabricator.wikimedia.org/T161096 (10MoritzMuehlenhoff) 05Open→03Declined After five years we can now consider the established status quo, let's just keep it as-is. [18:09:57] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Papaul) [18:33:31] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Dzahn) [18:34:07] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Dzahn) updated subteam contacts based on T316223#8381863 [18:34:51] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Dzahn) [18:35:02] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Dzahn) updated sub-team contacts based on T316223#8381863 [18:40:00] 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, 10Campaign-Tools (Campaign-Tools-Sprint-24): Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Daimona) >>! In T320403#8393213, @jijiki wrote: > @Daimona Do you... [18:49:28] 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 3 others: Create helm chart for kartotherian k8s deployment - https://phabricator.wikimedia.org/T231006 (10MSantos) > Where should the OSM sync to master postgres be run? Perhaps in a specialized variant of the... [23:48:22] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Papaul) [23:49:44] 10serviceops, 10Arc-Lamp, 10Performance-Team (Radar): Expand RAM on arclamp hosts and move them to baremetal - https://phabricator.wikimedia.org/T316223 (10Papaul) [23:52:15] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host arclamp2001.codfw.wmnet with OS bullseye