[07:01:25] <_joe_> akosiaris: I just realized this isn't the first time it happens [07:01:26] 10serviceops, 10Kubernetes: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 (10Joe) >>! In T332803#8719449, @akosiaris wrote: > I have a theory about this. Role was applied to already imaged nodes, but wi... [07:03:50] 10serviceops, 10Kubernetes: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 (10elukey) Alex I think that your theory is correct, I definitely applied the role without rebooting IIRC, didn't check the dock... [07:32:34] hi folks, I see that kubernetes1024 and 20[23,24] are still on device mapper, I can fix them if you are ok (one at the time) [07:32:53] the codfw ones are probably my bad, we added them in emergency and I didn't check docker info after applying the role [07:33:04] (the procedure is very simple, I've done it in the past) [07:33:58] _joe_ --^ [07:34:22] <_joe_> elukey: if you want to, go ahead :) [07:34:36] ack [07:34:46] <_joe_> basically it amounts to cordon the nodes, stop the kubelet, stop docker, rm /var/lib/docker [07:35:06] <_joe_> (not the dir, just the contents [07:35:36] <_joe_> elukey: we can take care of this anyways, I prefer if you work on dismissing ORES :) [07:36:08] yes yes I've done it a lot of times when we were testing the move to overlay, I feel bad for the two codfw nodes since I added them [07:36:27] also if Alex wakes up with the work done he may hate me less when it comes to Redis [07:38:43] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10TimedMediaHandler-Transcode, 10WMF-JobQueue, 10Sustainability (Incident Followup): Add rate limiting to the jobqueue vidoscalers to prevent overloads - https://phabricator.wikimedia.org/T278945 (10Joe) 05Open→03Resolved [07:47:15] <_joe_> elukey: btw I'm not sure we need to upgrade the redis thing [07:48:09] I'd be happy if we have not to upgrade :) [07:48:23] (kubernetes1024 done, proceeding with kubernetes202[34] as well) [08:01:32] 2023 done [08:02:10] going to wait a bit for alerts to stabilize etc.. and I'll do 2024 (the last one) [08:15:02] <_joe_> elukey: <3 [08:26:44] 2024 in service :) [08:32:54] 10serviceops, 10Kubernetes: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 (10elukey) All nodes should be running overlay now :) Before closing - should we add an alarm for this use case? [08:34:55] * elukey commute, back in a bit [08:45:07] 10serviceops, 10Kubernetes: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 (10akosiaris) >>! In T332803#8720439, @elukey wrote: > All nodes should be running overlay now :) Awesome thanks! Happy that m... [09:10:47] 10serviceops, 10Kubernetes: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 (10JMeybohm) >>! In T332803#8720455, @akosiaris wrote: > It's quite possibly better to just force `overlay2` in our config. If t... [09:20:17] 10serviceops, 10Kubernetes: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 (10elukey) Ack makes sense, let's do it :) [09:57:02] folks I am going to reimage kafka-main2005 [10:01:32] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main2005.codfw.wmnet with OS bullseye [10:33:46] 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push - https://phabricator.wikimedia.org/T322453 (10JMeybohm) We don't have access to the actual requests obviously, but the log... [10:38:20] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main2005.codfw.wmnet with OS bullseye completed: - kafka-main2005 (**PASS**) - Downtimed on Icinga/Alertmanager - Dis... [10:39:35] kafka-main2005 on bullseye [10:39:38] 15 [10:46:05] folks I see instance 6380 on rdb1011 with memory full, netstat points to ores nodes and the pair 2 doc suggests that it is related to ores queue [10:46:37] other alerts have been downtimed with "ORES cache redises. They're expected to be full" [10:46:40] should I do the same? [10:46:53] _joe_ --^ (IIUC it was you adding the downtimes) [10:47:10] <_joe_> elukey: yes, those are for queues [10:47:16] <_joe_> I should've added the downtime [10:47:20] <_joe_> thanks [10:47:29] <_joe_> elukey: you have 4 months to replace ORES btw [10:47:57] <_joe_> elukey: wait, actually the queue redises being full *is a problem* [10:51:09] _joe_ from the graphs it seems stable during the past month, not sure if ORES removes the task once it gets completed [10:51:20] <_joe_> amazing [10:51:45] yes definitely weird [10:53:32] <_joe_> anyways, downtime them [10:57:36] done [10:57:41] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 (10JMeybohm) [11:03:18] jayme: o/ [11:03:27] the only hosts using device mapper seem to be alert[1001,2001].wikimedia.org,cloudweb2002-dev.wikimedia.org afaics [11:03:37] just saw your comment [11:06:44] going to reimage kafka-main2004 as well :) [11:07:55] elukey: maybe merge the config change for k8s now (to be on the safe side) and open an issue tracking the migration og the remaining devicemapper hosts? I'm not sure how complicated it is to migrate alert hosts tbh [11:08:46] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main2004.codfw.wmnet with OS bullseye [11:09:34] jayme: yes yes I +1ed your change, go ahead. I spoke with Filippo and it should be easy to do next week [11:12:43] well, in that case we can also wait I suppose. There won't be any new nodes coming next week [11:13:20] and then we don't have a "useless" hiera key that needs to be removed again :) [11:20:54] I'll ask around regarding cloudweb [11:52:31] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main2004.codfw.wmnet with OS bullseye completed: - kafka-main2004 (**PASS**) - Downtimed on Icinga/Alertmanager - Dis... [11:53:55] kafka-main2004 back in service [11:54:14] 4 nodes out of 10 moved to bullseye (the sneaky ones, in need of firmware upgrades etc..) [11:54:18] the others should be easier [11:54:20] in theory [11:59:41] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10elukey) Status: * kafka-main[12]00[45] upgraded to Bullseye. Next steps: * Reimage kafka-main[12]00[1-3] The above nodes should already have a working reuse recipe (different from the more recent nodes [12]00[45]), bu... [12:07:13] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm) [12:07:31] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm) [12:07:59] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Metrics changes with Kubernetes v1.23 - https://phabricator.wikimedia.org/T322919 (10JMeybohm) 05Open→03Resolved Cleaned up the dashboards and alerts from k8s 1.16 metrics [15:08:59] If anyone would be available to review this, I'd be grateful. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/902391 [15:09:29] It relates to the spark-operator and the efforts to get it working with Kerberos/HDFS/Hive. Thanks. [15:16:31] I can do next week but it does not immediately make sense as deploy users do have rights for secret objects in their namespace [15:17:09] maybe a problem description/phab task would help :) [15:17:15] OK, thanks. Will double-check. [15:19:28] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite) [15:31:47] jayme: Ah, I know why it is. This deploy user only has rights on SparkApplications and ScheduledSparkApplications and their status. I agree, more docs would be helpful. [16:03:38] folks I am going to stop kafka on kafka-main2002 and reimage it to bullseye [16:06:34] 10serviceops, 10SRE: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10jhathaway) a:03jhathaway [16:07:29] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main2002.codfw.wmnet with OS bullseye [16:17:05] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main2002.codfw.wmnet with OS bullseye executed with errors: - kafka-main2002 (**FAIL**) - Downtimed on Icinga/Alertmana... [16:17:29] going to restart kafka-main2002 back, dhcp failures.. [16:17:52] probably all the other nodes (not only the newer ones) need idrac+nic+bios upgrades [16:17:55] sigh [16:18:51] jayme: I would like to perform the bullseye upgrade on our chart museum servers, if you think that is ill advised, or needs more careful planning, please let me know. [16:20:03] jhathaway: Thank you! chartmuseum is a go binary so you should be fine. Also the servers are active/active so you can depool and break one :) [16:20:45] jayme: sounds good, thanks! [16:21:09] the debian package is build by us, though. I'm not 100% sure about the apt repo structure [16:21:18] you might need to copy the deb to bullseye [16:21:33] "should work" ;) [16:22:53] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10elukey) Attempted kafka-main2002, but I got the same dhcp issues that I found on the newer nodes. This means that we probably need to update idrac+nic+bios on all nodes before being able to reimage them.. [16:24:18] jhathaway: possibly https://wikitech.wikimedia.org/wiki/Reprepro#Copying_between_distributions then would work [16:24:54] ah, heh, it even says "e.g. for Go packages" [16:25:19] jayme & mutante thanks I might just do that, or if it is trivial, I'll just rebuild [18:46:31] 10serviceops, 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Wikimedia-Incident: Add etcdmirror connection retry on etcd-tls-proxy unavailability - https://phabricator.wikimedia.org/T317535 (10Volans) a:05Volans→03None