[06:26:16] <_joe_>	 jayme: Ia5f55c5703e7f878640a61613af08a196ae17cbd broke sextant
[06:26:27] <_joe_>	 you re-introduced a file only in the vendor directory of flink
[06:26:30] <_joe_>	 not as a module
[06:33:07] <_joe_>	 jayme: as punishment, you need to go review https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/901767 and followups
[07:38:39] <jayme>	 oh, yw :p
[07:58:02] <jayme>	 _joe_: there are no chart version bumps in those changes. Is that on purpose?
[07:58:26] <_joe_>	 jayme: yes, they're all practical noops
[07:58:35] <_joe_>	 I don't think it warrants a deployment
[07:58:45] <_joe_>	 all changes are whitespace changes
[08:00:05] <jayme>	 well...after what happend with the flink chart after upgrading mesh.configuration I'm a bit wary 
[08:00:48] <_joe_>	 what happened?
[08:01:25] <_joe_>	 didn't we actually already upgrade it because it doesn't need a tls listener?
[08:01:28] <jayme>	 see my comment on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/895765
[08:01:43] <jayme>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/895765/1#message-c45355ff35344dcf085589858fb75f56bfe30818
[08:02:19] <jayme>	 might be a thing with flink only, though
[08:03:19] <_joe_>	 that... is not a factual change?
[08:04:30] <_joe_>	 https://integration.wikimedia.org/ci/job/helm-lint/9771/consoleFull seems to agree btw nothing should really change
[08:04:52] <_joe_>	 it's just whitespace changes I didn't find a way to avoid
[08:08:27] <jayme>	 yeah, well...maybe it's just a thing in flink then. tbh. I've no idea why helm acted that way and maybe it has to do with in which template files objects are rendered
[08:08:51] <jayme>	 I did not investigate further
[08:09:10] <jayme>	 but it made flink undeployable for sure
[08:10:00] <_joe_>	 sigh
[08:10:06] <_joe_>	 ok we can try one :P
[08:10:16] <jayme>	 and it looks very harmless in CI diff as well https://integration.wikimedia.org/ci/job/helm-lint/9536/console
[08:11:02] <_joe_>	 I mean that is clearly a defect in help
[08:11:04] <_joe_>	 *helm
[08:11:10] <jayme>	 the difference there is that in the flink diff some new object seems to appear "+# Source: flink-session-cluster/templates/configmap.yaml"
[08:11:12] <_joe_>	 the configmap didn't change name
[08:11:23] <jayme>	 yeah, sure
[08:11:29] <_joe_>	 ah yeah I was about to say
[08:11:37] <_joe_>	 that's the only thing I can see there
[08:12:08] <jayme>	 still no reason for it to behave that way but it could be an indicator that it has to do with how templates are organized in the chart
[08:13:10] <_joe_>	 it indicates a helm bug to me
[08:13:14] <jayme>	 absolutely
[08:13:27] <jayme>	 but I had no strong intention of figuring that out tbh
[08:13:56] <jayme>	 but let's bump at least one of the charts to verify
[08:21:10] <jayme>	 aah, wait
[08:21:19] <jayme>	 maybe you sneaked in the fix actually
[08:21:27] <elukey>	 hello folks, I am going to stop/upgrade/etc.. kafka-main1004 today
[08:21:45] <jayme>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/901679/4/charts/tegola-vector-tiles/templates/vendor/mesh/configuration_1.1.1.tpl line 27-28
[08:24:15] <jayme>	 this has maybe fixed whitespace chomping, making sure the objects are properly separated
[08:37:26] <elukey>	 jayme, hnowlan - o/ I was about to deploy changeprop in eqiad/codfw for a lift wing change, but I noticed a chart bump + some network policies dropped (likey due to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/893019). Is it expected? Mind to take a look, just to verify that those are the intended changes?
[09:05:49] <_joe_>	 elukey: I will look
[09:06:02] <_joe_>	 also, we might want to piggiback another change to your release
[09:06:29] <_joe_>	 elukey: what's your change?
[09:06:37] <_joe_>	 you didn't link it :P
[09:07:29] <elukey>	 _joe_ I just merged it, it is not related to the network policy change
[09:07:42] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/886918
[09:08:00] <_joe_>	 elukey: ok, can you wait to deploy until I've merged my change too?
[09:08:00] <elukey>	 but in the diff I found the network policy diff and didn't proceed
[09:08:11] <elukey>	 sure sure, not really urgent
[09:08:17] <elukey>	 I can deploy tomorrow as well
[09:08:41] <_joe_>	 can you paste the diff you're seeing?
[09:08:54] <_joe_>	 helmfile -e <dc> diff | phaste 
[09:08:56] <_joe_>	 :)
[09:10:57] <elukey>	 https://phabricator.wikimedia.org/P45909
[09:11:01] <_joe_>	 <3
[09:12:29] <_joe_>	 that definitely DOESN'T look right
[09:12:32] <_joe_>	 jayme: ^^
[09:13:18] <_joe_>	 jayme: I'm looking at that change of yours and it looks like a ticking time bomb tbh
[09:15:13] <_joe_>	 jayme: like, I picked zotero - it would fail to work anymore
[09:15:59] <_joe_>	 ah it's now in a global network policy
[09:16:46] <_joe_>	 elukey: I think it's safe to deploy, but let's wait for jayme to confirm
[09:16:59] <elukey>	 yes yes
[09:22:33] <_joe_>	 elukey: what I mean is that now I do see the global network policies
[09:22:56] <elukey>	 so in theory nothing changes if I remove those, ack
[09:22:58] <elukey>	 BUT
[09:23:05] <elukey>	 let's wait in any case, it is changeprop :D
[09:24:03] <_joe_>	 changeprop is the one thing that doesn't use envoy to reach restbase
[09:24:11] <_joe_>	 so it needs the global network policy :)
[09:24:38] <_joe_>	 oh btw, are you changing the chart or just the deployment?
[09:26:38] <elukey>	 just the deployment, the chart bump was from Janis' patch IIUC
[09:26:40] <_joe_>	 in the latter case, we can work independently
[09:26:44] <_joe_>	 ack
[09:29:08] <_joe_>	 ok, I'm going afk for a bit while we wait for a response
[09:30:23] <jayme>	 sorry, had to pick up my new passport
[09:31:29] <jayme>	 yeah, those network policy changes should be fine. I did only deploy all services in staging last week because of sprint
[09:33:40] <jayme>	 I did not come back to re-deploying everything in prod until now but it should be safe
[09:34:22] <elukey>	 I am fine to hold off the deployment until we want to consistently roll out the changes in all prod, nothing really urgent
[09:35:48] <jayme>	 thb. I was planning to not do all services at once but rather trickle them in.
[09:36:00] <jayme>	 no reason to hold off from my end
[09:38:20] <wikibugs>	 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main1004.eqiad.wmnet with OS bullseye
[10:03:04] <hnowlan>	 weird that those changes are showing up now - changeprop got redeployed on Monday 
[10:16:07] <wikibugs>	 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main1004.eqiad.wmnet with OS bullseye completed: - kafka-main1004 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Dis...
[10:18:36] <elukey>	 hnowlan: ok if I proceed with changeprop's deployment?
[10:20:04] <_joe_>	 hnowlan: and, can we also ddeploy -jobqueue?
[10:20:15] <_joe_>	 specifically - anything I should be mindful about?
[10:22:02] <elukey>	 _joe_ kafka-main1004 reimaged, I'd like to fix idrac+nic+bios of kafka-main200[4,5] as well (need to stop kafka in there, but I'll not reimage right now). Is it ok if I do it?
[10:22:27] <_joe_>	 elukey: yes, ofc
[10:22:41] <elukey>	 super thanks
[10:23:39] <_joe_>	 elukey: can you wait for changeprop to be deployed everywhere?
[10:23:50] <_joe_>	 (the jobqueue version)
[10:24:14] <elukey>	 _joe_ already stopped kafka on 2005, I can bring it back if you want
[10:24:40] <elukey>	 it should take around 10/15 mins max though
[10:25:01] <_joe_>	 no it's ok
[10:25:14] <_joe_>	 you just won a changeprop-jobqueue deployment
[10:25:19] <_joe_>	 :D
[10:26:17] <elukey>	 I thought that reimaging both kafka main clusters was the right punishments for my sins :D
[10:28:37] <wikibugs>	 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10envoy, and 2 others: Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Joe) a:03Joe
[10:36:26] <hnowlan>	 yep, all good for both
[10:37:54] <hnowlan>	 for jobqueue it's probably best to manually bump replicas to 29 on deploy as it's hitting our limits 
[10:38:01] <hnowlan>	 if you're willing to wait a few minutes I can get a fix in for that
[10:40:28] <_joe_>	 hnowlan: sure, sync with elukey, he's deploying :D
[10:40:41] <_joe_>	 jokes aside, sure, go on and you can test your changes with my patch :)
[10:53:08] <akosiaris>	 fwiw, on Monday I pinned chart version in helmfile.yaml to the one already deployed 0.10.21 IIRC and lowered replicas to 20 to avoid hitting the resource-quota limits
[10:53:21] <akosiaris>	 then after the upgrade was done, went back to 30 replicas
[10:56:13] <hnowlan>	 ahh ok, that explains the diff 
[10:56:35] <hnowlan>	 I've pushed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/902048 to fix the replica messiness 
[11:29:08] <akosiaris>	 should we just bump the resource quota limits for that namespace btw ?
[11:30:27] <_joe_>	 +1
[11:31:05] <hnowlan>	 We can also but I don't think it's necessary for now. It's pretty overprovisioned 
[11:31:12] <elukey>	 kafka-main2005 up and running with new firmwares etc.., kafka recovered nicely
[11:31:30] <elukey>	 going afk for a couple of hours, will keep going later on
[12:19:47] <akosiaris>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/902067
[12:56:21] <wikibugs>	 10serviceops, 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Wikimedia-Incident: Add etcdmirror connection retry on etcd-tls-proxy unavailability - https://phabricator.wikimedia.org/T317535 (10Volans) a:03Volans
[12:58:52] <_joe_>	 jayme: bear with me - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/901679 has now a different sequence of patches
[12:59:08] <_joe_>	 I realized what the bug was indeed and I think we can just remove base.kubernetes directly
[13:00:43] <jayme>	 _joe_: was it the whitespace chomping as I suggested?
[13:00:53] <_joe_>	 yes
[13:01:28] <jayme>	 feels like we need some way to test all if-guards in modules 
[13:01:42] <_joe_>	 you mean test coverage
[13:01:52] * _joe_ mumbles again something about writing actual code
[13:21:25] <wikibugs>	 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog-Deprecated, 10Patch-For-Review, 10User-jijiki: Maps 2.0 roll-out plan - https://phabricator.wikimedia.org/T280767 (10MoritzMuehlenhoff)
[13:22:06] <wikibugs>	 10serviceops, 10Maps, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability), and 2 others: Disable unused services on maps nodes - https://phabricator.wikimedia.org/T298246 (10MoritzMuehlenhoff) 05Resolved→03Open I think we should still remove Cassandra and Java packages on...
[14:49:14] <wikibugs>	 10serviceops, 10ChangeProp, 10SRE-Sprint-Week-Sustainability-March2023, 10Kubernetes, 10Sustainability (Incident Followup): Raise an alarm on container restarts/OOMs in kubernetes - https://phabricator.wikimedia.org/T256256 (10akosiaris) a:03akosiaris
[14:50:21] <_joe_>	 hnowlan: do you want to try and deploy your change?
[14:51:04] <hnowlan>	 _joe_: sure 
[14:58:12] <hnowlan>	 _joe_: this will apply your change, but I assume that's okay 
[14:58:16] <wikibugs>	 10serviceops, 10Kubernetes: Set  scaling_governor to performance for wikikube workers - https://phabricator.wikimedia.org/T332788 (10Joe)
[14:58:25] <_joe_>	 hnowlan: yes, I was waiting for you :)
[15:00:24] <hnowlan>	 _joe_: done
[15:00:54] <wikibugs>	 10serviceops, 10Kubernetes: Set  scaling_governor to performance for wikikube workers - https://phabricator.wikimedia.org/T332788 (10Joe) As an alternative, we should probably enable hardware P-states on most of those servers.
[15:00:55] <_joe_>	 hnowlan: thanks!
[15:19:35] <elukey>	 hey folks, going to upgrade firmware etc.. on kafka-main2004
[15:19:43] <elukey>	 should be the last one in need of this
[15:24:17] <wikibugs>	 10serviceops, 10Maps, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability), and 2 others: Disable unused services on maps nodes - https://phabricator.wikimedia.org/T298246 (10hnowlan) Cassandra and Java packages removed!
[15:25:54] <akosiaris>	 sigh, we only did the mw hosts, never k8s hosts for the scaling governor? 
[15:47:22] <_joe_>	 akosiaris: nope
[15:47:32] <_joe_>	 and I suspect our throttling issues are also caused by it
[16:07:51] <wikibugs>	 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push - https://phabricator.wikimedia.org/T322453 (10dduvall) I'm finally circling back to this, and I ran another test yesterday...
[16:25:29] <claime>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/900645 https://gerrit.wikimedia.org/r/c/operations/puppet/+/900645
[16:25:39] <claime>	 Re: k8s performance
[16:40:16] <elukey>	 folks is it ok if I deploy changeprop?
[16:40:32] <elukey>	 Cc: hnowlan: --^
[16:40:33] <hnowlan>	 elukey: okay from me end 
[16:40:36] <hnowlan>	 *my end 
[16:40:36] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Metrics changes with Kubernetes v1.23 - https://phabricator.wikimedia.org/T322919 (10JMeybohm) a:03JMeybohm
[16:40:51] <elukey>	 super, I'll report when it is deployed :)
[16:41:54] <elukey>	 hnowlan: ah I see that there is also the new rolling strategy, ok to deploy it right?
[16:42:16] <hnowlan>	 elukey: yep! That's our default so it shouldn't change anything 
[16:57:57] <wikibugs>	 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Relax nodeAffinity of sessionstore - https://phabricator.wikimedia.org/T325139 (10eoghan) We've deployed the change to relax the `nodeAffinity` setting, tomorrow morning we'll drain one of the nodes to test that t...
[17:20:37] <wikibugs>	 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398 (10akosiaris) >>! In T320398#8711722, @Eevans wrote: > TL;DR Is there someone(s) —who isn't as close to this as I am— who has...
[17:26:03] <wikibugs>	 10serviceops, 10Kubernetes: Kubernetes hosts lack free space when pulling MediaWiki images - https://phabricator.wikimedia.org/T332803 (10hashar)
[18:25:05] <wikibugs>	 10serviceops, 10Kubernetes: Kubernetes hosts lack free space when pulling MediaWiki images - https://phabricator.wikimedia.org/T332803 (10Clement_Goubert) `kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet` are using devicemapper instead of overlay2 for some reason.  I've cleaned up some spac...
[18:27:20] <wikibugs>	 10serviceops, 10Kubernetes: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2  - https://phabricator.wikimedia.org/T332803 (10Clement_Goubert) p:05Triage→03High
[19:11:35] <wikibugs>	 10serviceops, 10Kubernetes: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 (10hashar) Thank you Clément for the quick assessment :]
[20:10:16] <wikibugs>	 10serviceops, 10Kubernetes: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 (10dancy)
[20:34:55] <wikibugs>	 10serviceops, 10Kubernetes: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 (10akosiaris) configuration looked fine, a drain, stop of docker, delete of /var/lib/docker and reboot fixed it for kubernetes10...
[20:48:24] <wikibugs>	 10serviceops, 10Kubernetes: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 (10akosiaris) I have a theory about this. Role was applied to already imaged nodes, but without a full re-image. The end result...