[06:26:16] <_joe_> jayme: Ia5f55c5703e7f878640a61613af08a196ae17cbd broke sextant [06:26:27] <_joe_> you re-introduced a file only in the vendor directory of flink [06:26:30] <_joe_> not as a module [06:33:07] <_joe_> jayme: as punishment, you need to go review https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/901767 and followups [07:38:39] oh, yw :p [07:58:02] _joe_: there are no chart version bumps in those changes. Is that on purpose? [07:58:26] <_joe_> jayme: yes, they're all practical noops [07:58:35] <_joe_> I don't think it warrants a deployment [07:58:45] <_joe_> all changes are whitespace changes [08:00:05] well...after what happend with the flink chart after upgrading mesh.configuration I'm a bit wary [08:00:48] <_joe_> what happened? [08:01:25] <_joe_> didn't we actually already upgrade it because it doesn't need a tls listener? [08:01:28] see my comment on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/895765 [08:01:43] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/895765/1#message-c45355ff35344dcf085589858fb75f56bfe30818 [08:02:19] might be a thing with flink only, though [08:03:19] <_joe_> that... is not a factual change? [08:04:30] <_joe_> https://integration.wikimedia.org/ci/job/helm-lint/9771/consoleFull seems to agree btw nothing should really change [08:04:52] <_joe_> it's just whitespace changes I didn't find a way to avoid [08:08:27] yeah, well...maybe it's just a thing in flink then. tbh. I've no idea why helm acted that way and maybe it has to do with in which template files objects are rendered [08:08:51] I did not investigate further [08:09:10] but it made flink undeployable for sure [08:10:00] <_joe_> sigh [08:10:06] <_joe_> ok we can try one :P [08:10:16] and it looks very harmless in CI diff as well https://integration.wikimedia.org/ci/job/helm-lint/9536/console [08:11:02] <_joe_> I mean that is clearly a defect in help [08:11:04] <_joe_> *helm [08:11:10] the difference there is that in the flink diff some new object seems to appear "+# Source: flink-session-cluster/templates/configmap.yaml" [08:11:12] <_joe_> the configmap didn't change name [08:11:23] yeah, sure [08:11:29] <_joe_> ah yeah I was about to say [08:11:37] <_joe_> that's the only thing I can see there [08:12:08] still no reason for it to behave that way but it could be an indicator that it has to do with how templates are organized in the chart [08:13:10] <_joe_> it indicates a helm bug to me [08:13:14] absolutely [08:13:27] but I had no strong intention of figuring that out tbh [08:13:56] but let's bump at least one of the charts to verify [08:21:10] aah, wait [08:21:19] maybe you sneaked in the fix actually [08:21:27] hello folks, I am going to stop/upgrade/etc.. kafka-main1004 today [08:21:45] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/901679/4/charts/tegola-vector-tiles/templates/vendor/mesh/configuration_1.1.1.tpl line 27-28 [08:24:15] this has maybe fixed whitespace chomping, making sure the objects are properly separated [08:37:26] jayme, hnowlan - o/ I was about to deploy changeprop in eqiad/codfw for a lift wing change, but I noticed a chart bump + some network policies dropped (likey due to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/893019). Is it expected? Mind to take a look, just to verify that those are the intended changes? [09:05:49] <_joe_> elukey: I will look [09:06:02] <_joe_> also, we might want to piggiback another change to your release [09:06:29] <_joe_> elukey: what's your change? [09:06:37] <_joe_> you didn't link it :P [09:07:29] _joe_ I just merged it, it is not related to the network policy change [09:07:42] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/886918 [09:08:00] <_joe_> elukey: ok, can you wait to deploy until I've merged my change too? [09:08:00] but in the diff I found the network policy diff and didn't proceed [09:08:11] sure sure, not really urgent [09:08:17] I can deploy tomorrow as well [09:08:41] <_joe_> can you paste the diff you're seeing? [09:08:54] <_joe_> helmfile -e diff | phaste [09:08:56] <_joe_> :) [09:10:57] https://phabricator.wikimedia.org/P45909 [09:11:01] <_joe_> <3 [09:12:29] <_joe_> that definitely DOESN'T look right [09:12:32] <_joe_> jayme: ^^ [09:13:18] <_joe_> jayme: I'm looking at that change of yours and it looks like a ticking time bomb tbh [09:15:13] <_joe_> jayme: like, I picked zotero - it would fail to work anymore [09:15:59] <_joe_> ah it's now in a global network policy [09:16:46] <_joe_> elukey: I think it's safe to deploy, but let's wait for jayme to confirm [09:16:59] yes yes [09:22:33] <_joe_> elukey: what I mean is that now I do see the global network policies [09:22:56] so in theory nothing changes if I remove those, ack [09:22:58] BUT [09:23:05] let's wait in any case, it is changeprop :D [09:24:03] <_joe_> changeprop is the one thing that doesn't use envoy to reach restbase [09:24:11] <_joe_> so it needs the global network policy :) [09:24:38] <_joe_> oh btw, are you changing the chart or just the deployment? [09:26:38] just the deployment, the chart bump was from Janis' patch IIUC [09:26:40] <_joe_> in the latter case, we can work independently [09:26:44] <_joe_> ack [09:29:08] <_joe_> ok, I'm going afk for a bit while we wait for a response [09:30:23] sorry, had to pick up my new passport [09:31:29] yeah, those network policy changes should be fine. I did only deploy all services in staging last week because of sprint [09:33:40] I did not come back to re-deploying everything in prod until now but it should be safe [09:34:22] I am fine to hold off the deployment until we want to consistently roll out the changes in all prod, nothing really urgent [09:35:48] thb. I was planning to not do all services at once but rather trickle them in. [09:36:00] no reason to hold off from my end [09:38:20] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main1004.eqiad.wmnet with OS bullseye [10:03:04] weird that those changes are showing up now - changeprop got redeployed on Monday [10:16:07] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main1004.eqiad.wmnet with OS bullseye completed: - kafka-main1004 (**PASS**) - Downtimed on Icinga/Alertmanager - Dis... [10:18:36] hnowlan: ok if I proceed with changeprop's deployment? [10:20:04] <_joe_> hnowlan: and, can we also ddeploy -jobqueue? [10:20:15] <_joe_> specifically - anything I should be mindful about? [10:22:02] _joe_ kafka-main1004 reimaged, I'd like to fix idrac+nic+bios of kafka-main200[4,5] as well (need to stop kafka in there, but I'll not reimage right now). Is it ok if I do it? [10:22:27] <_joe_> elukey: yes, ofc [10:22:41] super thanks [10:23:39] <_joe_> elukey: can you wait for changeprop to be deployed everywhere? [10:23:50] <_joe_> (the jobqueue version) [10:24:14] _joe_ already stopped kafka on 2005, I can bring it back if you want [10:24:40] it should take around 10/15 mins max though [10:25:01] <_joe_> no it's ok [10:25:14] <_joe_> you just won a changeprop-jobqueue deployment [10:25:19] <_joe_> :D [10:26:17] I thought that reimaging both kafka main clusters was the right punishments for my sins :D [10:28:37] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10envoy, and 2 others: Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Joe) a:03Joe [10:36:26] yep, all good for both [10:37:54] for jobqueue it's probably best to manually bump replicas to 29 on deploy as it's hitting our limits [10:38:01] if you're willing to wait a few minutes I can get a fix in for that [10:40:28] <_joe_> hnowlan: sure, sync with elukey, he's deploying :D [10:40:41] <_joe_> jokes aside, sure, go on and you can test your changes with my patch :) [10:53:08] fwiw, on Monday I pinned chart version in helmfile.yaml to the one already deployed 0.10.21 IIRC and lowered replicas to 20 to avoid hitting the resource-quota limits [10:53:21] then after the upgrade was done, went back to 30 replicas [10:56:13] ahh ok, that explains the diff [10:56:35] I've pushed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/902048 to fix the replica messiness [11:29:08] should we just bump the resource quota limits for that namespace btw ? [11:30:27] <_joe_> +1 [11:31:05] We can also but I don't think it's necessary for now. It's pretty overprovisioned [11:31:12] kafka-main2005 up and running with new firmwares etc.., kafka recovered nicely [11:31:30] going afk for a couple of hours, will keep going later on [12:19:47] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/902067 [12:56:21] 10serviceops, 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Wikimedia-Incident: Add etcdmirror connection retry on etcd-tls-proxy unavailability - https://phabricator.wikimedia.org/T317535 (10Volans) a:03Volans [12:58:52] <_joe_> jayme: bear with me - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/901679 has now a different sequence of patches [12:59:08] <_joe_> I realized what the bug was indeed and I think we can just remove base.kubernetes directly [13:00:43] _joe_: was it the whitespace chomping as I suggested? [13:00:53] <_joe_> yes [13:01:28] feels like we need some way to test all if-guards in modules [13:01:42] <_joe_> you mean test coverage [13:01:52] * _joe_ mumbles again something about writing actual code [13:21:25] 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog-Deprecated, 10Patch-For-Review, 10User-jijiki: Maps 2.0 roll-out plan - https://phabricator.wikimedia.org/T280767 (10MoritzMuehlenhoff) [13:22:06] 10serviceops, 10Maps, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability), and 2 others: Disable unused services on maps nodes - https://phabricator.wikimedia.org/T298246 (10MoritzMuehlenhoff) 05Resolved→03Open I think we should still remove Cassandra and Java packages on... [14:49:14] 10serviceops, 10ChangeProp, 10SRE-Sprint-Week-Sustainability-March2023, 10Kubernetes, 10Sustainability (Incident Followup): Raise an alarm on container restarts/OOMs in kubernetes - https://phabricator.wikimedia.org/T256256 (10akosiaris) a:03akosiaris [14:50:21] <_joe_> hnowlan: do you want to try and deploy your change? [14:51:04] _joe_: sure [14:58:12] _joe_: this will apply your change, but I assume that's okay [14:58:16] 10serviceops, 10Kubernetes: Set scaling_governor to performance for wikikube workers - https://phabricator.wikimedia.org/T332788 (10Joe) [14:58:25] <_joe_> hnowlan: yes, I was waiting for you :) [15:00:24] _joe_: done [15:00:54] 10serviceops, 10Kubernetes: Set scaling_governor to performance for wikikube workers - https://phabricator.wikimedia.org/T332788 (10Joe) As an alternative, we should probably enable hardware P-states on most of those servers. [15:00:55] <_joe_> hnowlan: thanks! [15:19:35] hey folks, going to upgrade firmware etc.. on kafka-main2004 [15:19:43] should be the last one in need of this [15:24:17] 10serviceops, 10Maps, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability), and 2 others: Disable unused services on maps nodes - https://phabricator.wikimedia.org/T298246 (10hnowlan) Cassandra and Java packages removed! [15:25:54] sigh, we only did the mw hosts, never k8s hosts for the scaling governor? [15:47:22] <_joe_> akosiaris: nope [15:47:32] <_joe_> and I suspect our throttling issues are also caused by it [16:07:51] 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push - https://phabricator.wikimedia.org/T322453 (10dduvall) I'm finally circling back to this, and I ran another test yesterday... [16:25:29] https://gerrit.wikimedia.org/r/c/operations/puppet/+/900645 https://gerrit.wikimedia.org/r/c/operations/puppet/+/900645 [16:25:39] Re: k8s performance [16:40:16] folks is it ok if I deploy changeprop? [16:40:32] Cc: hnowlan: --^ [16:40:33] elukey: okay from me end [16:40:36] *my end [16:40:36] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Metrics changes with Kubernetes v1.23 - https://phabricator.wikimedia.org/T322919 (10JMeybohm) a:03JMeybohm [16:40:51] super, I'll report when it is deployed :) [16:41:54] hnowlan: ah I see that there is also the new rolling strategy, ok to deploy it right? [16:42:16] elukey: yep! That's our default so it shouldn't change anything [16:57:57] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Relax nodeAffinity of sessionstore - https://phabricator.wikimedia.org/T325139 (10eoghan) We've deployed the change to relax the `nodeAffinity` setting, tomorrow morning we'll drain one of the nodes to test that t... [17:20:37] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398 (10akosiaris) >>! In T320398#8711722, @Eevans wrote: > TL;DR Is there someone(s) —who isn't as close to this as I am— who has... [17:26:03] 10serviceops, 10Kubernetes: Kubernetes hosts lack free space when pulling MediaWiki images - https://phabricator.wikimedia.org/T332803 (10hashar) [18:25:05] 10serviceops, 10Kubernetes: Kubernetes hosts lack free space when pulling MediaWiki images - https://phabricator.wikimedia.org/T332803 (10Clement_Goubert) `kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet` are using devicemapper instead of overlay2 for some reason. I've cleaned up some spac... [18:27:20] 10serviceops, 10Kubernetes: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 (10Clement_Goubert) p:05Triage→03High [19:11:35] 10serviceops, 10Kubernetes: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 (10hashar) Thank you Clément for the quick assessment :] [20:10:16] 10serviceops, 10Kubernetes: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 (10dancy) [20:34:55] 10serviceops, 10Kubernetes: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 (10akosiaris) configuration looked fine, a drain, stop of docker, delete of /var/lib/docker and reboot fixed it for kubernetes10... [20:48:24] 10serviceops, 10Kubernetes: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 (10akosiaris) I have a theory about this. Role was applied to already imaged nodes, but without a full re-image. The end result...