[09:08:16] <claime>	 Morning :)
[09:08:58] <jayme>	 hi
[09:09:47] <elukey>	 hello folks
[09:41:02] <btullis>	 Morning all.
[10:00:41] <jayme>	 elukey: fixed the ml_k8s pod ranges. Thanks for checking!
[10:28:39] <claime>	  jayme │ if the resource limits are not different for canaries - I have not checked  <  They're not, afaik
[10:41:21] <wikibugs>	 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 3 others: Upgrade maps servers to node >= 14 - https://phabricator.wikimedia.org/T321789 (10jijiki) @awight If I understand correctly, we are planning to deploy Karthotherian to Kubernetes using node 16?
[11:01:02] <claime>	 If I count correctly it's 5CPU/pod and 1900MiB (say 2GiB) per pod requested, x(8 main + 2 canary) +25% = 75CPU, 25GiB (going by requests), so we shouldn't be hitting the cap, except if we are between requests and limits, then yeah we can hit namespace quotas (100+CPU, 50+GiB if everything hits limits)
[11:03:44] <claime>	 Or is there something I'm not understanding correctly?
[11:14:54] <jayme>	 Limit is what counts here which is 8250m CPU
[11:15:31] <claime>	 Right
[11:15:41] <wikibugs>	 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 3 others: Upgrade maps servers to node >= 14 - https://phabricator.wikimedia.org/T321789 (10awight) >>! In T321789#8391863, @jijiki wrote: > @awight If I understand correctly, we are planning to deploy Karthothe...
[11:15:53] <jayme>	 and it's 25% of the number of replicas, which will probably be roundet up to 3
[11:16:32] <claime>	 ack, so my calculations for the quota are off, checking
[11:19:35] <claime>	 Not that off, there's a bit of leeway (I'd gone with 120CPU in the new quota, 13*8250 is 107.25)
[11:20:00] <claime>	 Since it'll undoubtedly be changed as we accomodate more traffic, I'd say that's good enough, yeah?
[11:22:27] <jayme>	 maxUnavailable is 25% as well IIRC which means that 2 (rounded down in that case) pods will enter terminated state right away but they still count towards quota until they are actually terminated/gone
[11:23:22] <claime>	 Oh, so if they actually take too much time you can end up with (10+3+2) pods right?
[11:23:32] <jayme>	 so I'd say you should at least accomondate for 5 additional pods...although that's not completely true as well as canary and main are two different deployment, so it's not 25% of 10 but 25% of 8 + 25% of 2 :)
[11:23:39] <claime>	 Momentarily, but counting against quota
[11:24:23] <claime>	 jayme: Yeah, I gathered it was per release, but in that case it doesn't change much :P
[11:24:32] <btullis>	 Hello. I wonder if anyone could help with this helmfile issue. I can't see what I'm doing wrong. On my spark-operator deployment to dse-k8s I'm trying to set the namespace here: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/855674/4/helmfile.d/admin_ng/spark-operator/helmfile.yaml#9 
[11:25:15] <btullis>	 ...but when it runs through CI the namespace comes out as default: https://integration.wikimedia.org/ci/job/helm-lint/8343/consoleFull 
[11:25:18] <jayme>	 claime: no it's kubernetes doing the "rolling" which has no understanding of teleases
[11:25:26] <jayme>	 *releases
[11:25:32] <btullis>	 https://usercontent.irccloud-cdn.com/file/9E6a5HF2/image.png
[11:25:36] <jayme>	 it's just deployements for k8s
[11:26:05] <jayme>	 baseline is: don't be shy on bumping quota for mw namespaces :D
[11:26:11] <claime>	 lol ok
[11:27:26] <jayme>	 btullis: you set the namespace where helm should deploy stuff into
[11:28:41] <btullis>	 jayme: OK, but doesn't that allow me to use `namespace: {{ .Release.Namespace }}` in a template?
[11:28:43] <jayme>	 the screenshot you posted is an clusterrolebinging (which is a non-namespaced object) that references a serviceaccount (namespaced object) in the default namespace
[11:29:20] <jayme>	 it should...I have not looked at the template of the clusterrolebinding tbh
[11:30:32] <btullis>	 It's here: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/855674/4/charts/spark-operator/templates/rbac.yaml#72
[11:30:37] <btullis>	 Sorry for the trouble.
[11:34:58] <jayme>	 ah, okay.
[11:35:07] <jayme>	 yes, that looks strange
[11:36:24] <btullis>	 When I use `helm template` or `helm install` with this chart it all uses the correct (`spark-operator` and `spark`) namespaces. This is the first time that I've seem the helmfile output, when running it through CI.
[11:44:22] <jayme>	 helmfile -e dse-k8s-eqiad template is what is run there
[11:45:29] <btullis>	 OK, maybe it's worth installing helmfile on my workstation, so I can try to replicate it before sending it up to gerrit.
[11:53:34] <jayme>	 tbh I'm not sure if this is maybe an issue of how helmfile template works
[11:55:59] <claime>	 self-merging mw-web quota update https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/856516
[12:04:56] <btullis>	 jayme: OK, thanks.  I'll continue investigating.
[12:04:56] <claime>	 I have an additional diff for cert-manager, clusterissuers.cfssl-issuer.wikimedia.org, CustomResourceDefinition : -   creationTimestamp: null that doesn't appear in CI
[12:05:37] <claime>	 (on admin_ng helmfile diff for both eqiad and codfw)
[12:06:12] <claime>	 That seems related to a note in ../../charts/cfssl-issuer-crds/README.md "creationTimestamp: null" fields need to be removed from updated CRDs as those will trigger validation errors in kubeconform.
[12:14:17] <claime>	 I'm having trouble finding out what's causing that change
[12:25:16] <wikibugs>	 10serviceops, 10SRE: Deploy etcddump (or another etcd dump & load tool) to production - https://phabricator.wikimedia.org/T135124 (10jcrespo) I believe this was mislabeled, although please ask for help for dumping scheduling and monitoring, we have tooling we want to extend to services other than databases.
[12:25:56] <wikibugs>	 10serviceops, 10SRE, 10Technical-Debt: Reduce etcd technical debt - https://phabricator.wikimedia.org/T135122 (10jcrespo)
[12:26:24] <claime>	 I'll apply anyways, but if anyone has an idea of why this change appears in a totally unrelated deployment, I'm curious
[12:27:04] <wikibugs>	 10serviceops, 10SRE: Deploy etcddump (or another etcd dump & load tool) to production - https://phabricator.wikimedia.org/T135124 (10jcrespo) Is this related to T281447?
[12:29:49] <claime>	 Actually, I'll apply only my namespaces quota change until I find out what's up with this definition disappearing
[13:13:18] <claime>	 jayme: I'm getting nowhere trying to track why that metadata disappeared. I suspect it's no big deal and I can apply, but if you have any idea how that came to be, I'll take it.
[13:20:19] <wikibugs>	 10serviceops, 10Dumps-Generation, 10SRE, 10MW-1.39-notes, and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jijiki) 05Open→03Resolved a:03jijiki This task itself looks like it is done, please reopen if you disagreen or if I am missing somet...
[13:23:14] <effie>	 hnowlan: can I mark this as resolved https://phabricator.wikimedia.org/T319279  (Increased session loss since 20221001) ?
[13:23:36] <effie>	 latest graphs shows that things are back to normal?
[13:31:27] <wikibugs>	 10serviceops, 10Sustainability: Automate the provisioning and management of MediaWiki clusters - https://phabricator.wikimedia.org/T118829 (10jijiki) 05Open→03Invalid I feel like this task is not relevant anymore, or if it is, it need to be rewritten in a way to reflect our current needs and infra. Closing:)
[13:31:40] <btullis>	 jayme: Sorry to trouble again. Do you think that this could be related to the version of helm in use? I cannot replicate with the same version of helmfile on my workstation, but I'm using `helm` (3.7.0) instead of `helm3`.
[14:18:42] <jelto>	 btullis: helm and helm3 should be the same if you installed helm3 on you machine. So `helm3 version` and `helm version` should return the same output. We are running helm v3.9.4 in production afaics.
[14:18:42] <jelto>	 Regarding the missing namespace field: admin environments don't have a default namespace set in the kubeconfig/context (in contrast to service environments also used with kube_env). So I think the resources in that chart need to set their metadata.namespace explicitly to something like namespace: {{ .Release.Namespace }}.
[14:18:42] <jelto>	 Maybe you can compare that with other admin services like cert-manager, helm-state-metrics or calico. jayme: correct me if I'm wrong ;)
[14:21:56] <btullis>	 jelto. Thanks ever so much. On my workstation, helm and helm3 are different binaries with different version at the moment.
[14:22:00] <btullis>	 https://www.irccloud.com/pastebin/916xQWYl/
[14:23:54] <btullis>	 I am already setting the metadata.namespace to `{{ Release.namespace }}` in the resources:  https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/855674/4/charts/spark-operator/templates/rbac.yaml#72 unless I've misunderstood what you mean.
[14:25:49] <btullis>	 Would anyone be strongly opposed to my merging: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/855674 so that I could try a `helmfile diff` from a deployment host? If it still shows the incorrect namespace then I could revert it. If it doesn't, then it points to a CI problem of some sort.
[14:34:17] <hnowlan>	 effie: yep, I guess so - little chance we'll find anything interesting out now
[14:34:40] <effie>	 cool tx
[14:35:36] <wikibugs>	 10serviceops, 10MediaWiki-Authentication-and-authorization, 10Platform Engineering, 10SRE: Increased session loss since 20221001 - https://phabricator.wikimedia.org/T319279 (10jijiki) 05Open→03Resolved a:03jijiki Per @hnowlan's latest comment, I am marking this as resolved
[14:36:49] <jayme>	 claime: sorry, was afk for some woodwork - that creationTimestamp mess you saw by running helmfile on deploy1002 or in CI?
[14:40:44] <jayme>	 btullis: I would argue to try to use the same helm version as we do in CI and prod if you want to make sure. I can replicate that exact same output on my maching
[14:40:54] <elukey>	 jayme: can you run pcc again for https://gerrit.wikimedia.org/r/c/operations/puppet/+/855997 ?
[14:41:24] <jayme>	 elukey: absolutely, sorry
[14:41:30] <elukey>	 <3
[14:42:28] <jayme>	 btullis: jelto's node is a good one, though. I do see the same behaviour for cert-manager. That makes it even more likely do be a helmfile limitation
[14:44:25] <jayme>	 at this time, without any reviews, I would also definitely not merge that chart!
[14:44:35] <btullis>	 > I can replicate that exact same output on my machine
[14:44:35] <btullis>	 Oh, you can see the `namespace: default` when running on your workstation? I've installed the .deb for helmfile so that's the same, but I'll try the 3.9.4 deb instead of my local version.
[14:45:02] <btullis>	 >I can replicate that exact same output on my maching 
[14:45:02] <btullis>	 Yep, understood. Thanks.
[14:45:26] <btullis>	 Sorry, wrong paste. That was meant to reply to `I would also definitely not merge that chart!`
[14:46:23] <jayme>	 yes. When running "helmfile -e dse-k8s-eqiad template" I do get "namespace: default" for that serviceaccount subject
[14:47:57] <btullis>	 OK, thanks. Yes I have now tried it with helm 3.9.4 and I can replicate. 
[15:05:34] <claime>	 jayme: On deploy1002, doesn't appear on CI
[15:06:06] <claime>	 (and no worries, there was nothing urgent since I could still apply my change on the namespace release)
[15:07:04] <jayme>	 claime: it's a kubeconform quirk anyways. "null" is a valid value for that field in k8s api. So applying should just work
[15:09:26] <claime>	 jayme: Yeah, just that as far as I can tell it's removing that key from the CRD
[15:10:07] <jayme>	 yep. That's fine. It's a k8s managed field anyways
[15:10:17] <claime>	 ok, applying then
[15:11:13] <jayme>	 fingers crossed :D
[15:14:47] <claime>	 Doesn't seem to have changed the creationTimestamp of the existing CRD
[15:14:57] <claime>	 So I guess that's good 
[15:15:14] <jayme>	 cool. Thanks for applying
[15:42:21] <wikibugs>	 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure): Reduce usage of public IPv4 addresses on GitLab hosts - https://phabricator.wikimedia.org/T310265 (10Jelto) @Volans , @ayounsi , @cmooney , @BBlack and I had a chat about this topic during the SRE summit.  We talked about multiple options which wo...
[15:59:22] <_joe_>	 i'm going to be 1-2 minutes late
[15:59:28] <_joe_>	 start without me please
[16:03:29] <rzl>	 argh trying to join one sec
[16:04:26] <wikibugs>	 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure): Reduce usage of public IPv4 addresses on GitLab hosts - https://phabricator.wikimedia.org/T310265 (10thcipriani) Thanks for the detailed write up as always @Jelto 🎉   >>! In T310265#8392890, @Jelto wrote: > For this and the previous option I have...
[16:09:38] <wikibugs>	 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure): Reduce usage of public IPv4 addresses on GitLab hosts - https://phabricator.wikimedia.org/T310265 (10dancy)
[16:21:02] <wikibugs>	 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure): Reduce usage of public IPv4 addresses on GitLab hosts - https://phabricator.wikimedia.org/T310265 (10dancy) Seconding what @thcipriani said, I'm strongly against disabling git over ssh.  Using HTTP only requires plaintext passwords to be stored on...
[16:42:03] <effie>	 so first task in our queue is https://phabricator.wikimedia.org/T320403 (Run the timezone update script periodically in prod and in beta)
[16:50:14] <effie>	 alright, no comments, moving along 
[16:50:34] <claime>	 effie: That may fall under "things I should learn how to do", but I am not sure where to start
[16:51:00] <claime>	 Regarding priority, I have no idea
[16:51:02] <effie>	 I think you could start from Amir
[16:51:42] <effie>	 I believe he will be able to get you started on this, it should not be much work
[16:52:42] * claime side-eyes footgun
[16:54:26] <effie>	 claime: we could put it in this year's backlog, and chat with the folks about details, if amir can't help, we can figure it out
[16:54:34] <wikibugs>	 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, 10Campaign-Tools (Campaign-Tools-Sprint-24): Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10jijiki) @Daimona Do you have a timeline as to when you need this...
[16:54:45] <effie>	 sounds good?
[16:54:49] <claime>	 yep
[16:55:03] <effie>	 great
[16:55:33] <effie>	 https://phabricator.wikimedia.org/T320241 (Incorrect handling of ETags taking precedence over timestamps in conditional requests)
[16:56:42] <effie>	 I can take that and figure out what is up 
[16:57:00] <wikibugs>	 10serviceops, 10SRE, 10Wikibase Product Platform, 10Wikimedia-Apache-configuration: Incorrect handling of ETags taking precedence over timestamps in conditional requests - https://phabricator.wikimedia.org/T320241 (10jijiki)
[16:58:38] <effie>	 _joe_: I think we need your help with https://phabricator.wikimedia.org/T320929  (Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing)
[17:00:19] <wikibugs>	 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, 10Campaign-Tools (Campaign-Tools-Sprint-24): Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10jijiki) a:03Clement_Goubert
[17:14:53] <wikibugs>	 10serviceops, 10SRE, 10conftool: Not all confd errors throw icinga alerts - https://phabricator.wikimedia.org/T110933 (10jijiki) 05Open→03Declined Bluntly closing this as there has been no update for quite some years now
[17:26:29] <wikibugs>	 10serviceops, 10SRE, 10conftool: confctl no longer logs a non-changing state change - https://phabricator.wikimedia.org/T161096 (10MoritzMuehlenhoff) 05Open→03Declined After five years we can now consider the established status quo, let's just keep it as-is.
[18:09:57] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Papaul)
[18:33:31] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Dzahn)
[18:34:07] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Dzahn) updated subteam contacts based on T316223#8381863
[18:34:51] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Dzahn)
[18:35:02] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Dzahn) updated sub-team contacts based on T316223#8381863
[18:40:00] <wikibugs>	 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, 10Campaign-Tools (Campaign-Tools-Sprint-24): Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Daimona) >>! In T320403#8393213, @jijiki wrote: > @Daimona Do you...
[18:49:28] <wikibugs>	 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 3 others: Create helm chart for kartotherian k8s deployment - https://phabricator.wikimedia.org/T231006 (10MSantos) > Where should the OSM sync to master postgres be run? Perhaps in a specialized variant of the...
[23:48:22] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Papaul)
[23:49:44] <wikibugs>	 10serviceops, 10Arc-Lamp, 10Performance-Team (Radar): Expand RAM on arclamp hosts and move them to baremetal - https://phabricator.wikimedia.org/T316223 (10Papaul)
[23:52:15] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host arclamp2001.codfw.wmnet with OS bullseye