[09:18:10] there's some alerts on nova-api down, looking [09:41:43] FYI, I'm looking at the runc update. Anyone want to collaborate on that? [09:47:24] hmm, nova-api on 1005 is having issues to connect to rabbit it seems [09:47:48] https://www.irccloud.com/pastebin/ODXaWL8X/ [09:48:18] arturo: I'm around if you need anything, should be 'drain - upgrade - undrain' right? [09:48:41] dcaro: I don't know yet, I'm researching [09:48:53] ack [09:51:18] is the old `projectadmin` openstack role now `admin`? [09:51:26] I need to add myself to toolsbeta [09:52:47] I think it's `member` no? [09:52:58] might have changed [09:53:10] dcaro, arturo: running pods should not be impacted, so upgrading in general and then drain/undraining is also an option [09:53:21] but best to doublecheck with a single node first [09:53:29] moritzm: ACK [09:54:09] docker still needs to be restarted though no? [10:08:14] how do I add myself to toolsbeta.admin unix group? LDAP update by hand? This can't be done via striker, no? [10:14:14] hmmm, I guess so, I am part of toolsbeta.admin [10:14:39] ack [10:15:13] hmpf.... I started a tcpdump on cloudrabbit to debug the connection issues, but restarting nova-api this time seems not to fail anymore :p [10:15:13] I'm not familiar with the details of the toolforge container setup, but docker (as packaged in Debian, most probably not for the legacy docker-ce packages from Docker Inc we use on buster for some setups) depends on runc, as such anything started via docker will most certainy also need a restart to fully effect the rollout [10:16:21] ack thanks, we use also runc with containerd instead of docker in a few nodes (bookworm), that'd be similar I guess [10:17:19] ha! just say that the re's no errors and the errors start :) [10:17:46] [10:19:04] I think I found how to add myself to toolsbeta.admin [10:19:25] arturo: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin [10:19:45] there's a note about toolsbeta.admin [10:20:23] dhinus: yeah [10:20:42] dhinus: I was mostly reading https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolsbeta#ssh which lacks this info [10:22:21] I still cannot ssh :-( [10:22:36] oh, it worked now! [10:22:46] some cache expired I guess [10:22:52] caches I guess yep [10:25:24] ok, now back to the runc update [10:26:32] between cloudcontrol and cloudrabbit.private, the router is the cloudsw? (as in that's the hop linking the networks), is there a chance that the switch is terminating connections? [10:30:31] hmm, I see a lot of retransmissions/duplicated acks on the communication between cloudrabbit1003<->cloudcontrol1005 [10:32:10] https://usercontent.irccloud-cdn.com/file/Id1XgHDx/image.png [10:32:21] I'll open a task [10:33:10] dcaro: yes, per the IP addresses, they are connecting using cloud-private, thus cloudsw [10:37:45] T356621 [10:37:46] T356621: [nova-api,cloudrabbit] Connectivity issues from all cloudcontrols to all cloudrabbit nodes - https://phabricator.wikimedia.org/T356621 [10:37:58] it seems I'm missing phabricator permissions to update the task tags [10:38:11] https://usercontent.irccloud-cdn.com/file/FP94MbSP/image.png [10:39:08] oh, the task has nothing special no? [10:39:15] arturo: the "projectadmin" role is now "member", not "admin". I've updated https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Projects_lifecycle#Add_access [10:39:48] dhinus: ok, thanks [10:40:02] I can edit that phab task, not sure where the "custom policy" is coming from [10:40:15] dcaro: I don't think so. I bet my phab account is still in a bit weird state [10:40:23] arturo: I've added you to NDA [10:41:05] can you try again? [10:41:17] still no [10:41:45] hmm, maybe logout and login? [10:41:56] can you see yourself in https://phabricator.wikimedia.org/project/members/974/ ? [10:42:03] oh wow, it allowed me when I clicked subscribe [10:42:25] and if I unsuscribe, then I cannot edit the task again [10:44:37] I added you to a couple more groups I'm part of and you are not (security_sre, trusted contributors) [10:44:45] thanks [10:45:11] FYI I'm about to update containerd.io and related packages on apt1001 [10:45:21] (the paste by t.aavi https://phabricator.wikimedia.org/P56134 ) [10:57:05] given each k8s worker node runs a number system containers, maybe the most elegant way to restart all of them, and making sure the runc update is applied, is to reboot the worker node itself. What do you think? [11:00:54] sgtm! [11:10:24] docker has a jump from 23 -> 25, there's some differences and backward-incompatible changes it seems, can we do it one-by-one? (and make sure the containers run ok on the new ones) [11:19:04] hmm, the retransmissions are like 0.000002 seconds after the original package :/, that's weird [11:24:13] dcaro: I can't find any reference to the backward-incompatible change, do you have a link? [11:25:15] https://github.com/moby/moby/issues/47215 maybe this? [11:28:00] https://docs.docker.com/engine/release-notes/24.0/#deprecated and https://docs.docker.com/engine/release-notes/25.0/#deprecated [11:29:20] there's a storage driver (overlay) that we don't use (we use overlay2 iirc), and some other minor stuff, that I don't think will break anything but I'm not very familiar with how k8s uses docker [11:29:38] https://github.com/moby/moby/issues/47215 this is definitely concerning [11:30:08] the kubelet can't work with docker v25 and there is no fix https://github.com/kubernetes/kubelet/issues/49 [11:31:20] that's not good :/ [11:31:32] I will try to limit the repository update [11:32:47] that does stop us from upgrading no? [11:33:29] I will check if containerd requires docker v25 [11:33:31] (we can try in a worker, see if we hit the same issue, but seems likely) [11:33:40] aaahhh, okok, that sounds good yes [11:34:46] created T356629 to track this, anyway [11:34:46] T356629: kubelet: cannot work with docker >= v25 - https://phabricator.wikimedia.org/T356629 [11:36:04] I don't think they depend on each other, so I'll try to pin the docker version in the apt repo [11:36:05] hmm, the issue seems not to happen on all k8s distributions though "This is still happening with Minikube. I was seeing it previously with k3d but since I updated again to version 25, k3d appears to be working." [11:38:00] /changing subject: I see the retransmissions for incoming packages on both sides, I think the router (cloudsw) might be duplicating traffic, could that be? (maybe some routing+switching thing?) [11:38:47] dcaro: yes, that is definitely possible, but I cannot explain why it was working... last week [11:39:04] it kind of works, it just start bouncing around the connection [11:39:20] it could be some assymetric routing, and stuff like that [11:39:25] I will be happy to debug later [11:39:35] ack [11:39:38] also, maybe ping cathal if you want it sorted earlier [11:41:39] https://gerrit.wikimedia.org/r/c/operations/puppet/+/997345 <-- prevent docker v25 from slipping in for this round of runc updates [11:48:59] dhinus: https://gerrit.wikimedia.org/r/997353 [11:49:22] dcaro: good catch with the upgrade ^^^ [11:49:23] hmm see my next comment, maybe we need v24 [11:50:21] which comment? [11:50:48] ah, it's in my gerrit draft :D [11:51:07] v23 does not have a patched version with the updated deps [11:51:21] buy maybe we can install the containerd dep manually? [11:51:21] ok, so v24 then? [11:52:02] I'm not sure what's better, if going with v24, or sticking on v23 and manually upgrading the containerd package [11:52:22] I don't know if there is some dep version constraint [11:52:50] docker depends like this [11:53:04] Depends: containerd.io (>= 1.6.4) [11:53:08] (is pretty loose) [11:53:25] containerd.io doesn not depend on docker: [11:53:26] Depends: libc6 (>= 2.14), libseccomp2 (>= 2.3.0) [11:53:39] then upgrading containerd only should work [11:53:41] so I think if we keep docker v23 we will be fine upgrading containerd [11:53:42] and we can stick on docker v23 [11:53:45] nice [11:54:01] I think we can abandon https://gerrit.wikimedia.org/r/997353 [11:54:13] sgtm [11:54:15] err.. no, actually, apply it [11:54:28] abandon the other one :D [11:54:39] the other is already merged heh [11:54:43] LOL [11:54:58] merging this one now [11:59:19] ok, successfully downgraded the docker package _in the repo_ to v23, so it is no longer a pending update [12:00:11] well, is just a minor point release: [12:00:14] https://www.irccloud.com/pastebin/nYcfr79F/ [12:01:05] be back in a bit, if you want to continue from here, feel free [12:10:38] upgrading from 23.0.3 to 23.0.6 seems fine [12:24:43] * dcaro going for lunch, things got delayed, will be late for the collab [14:00:31] I have been testing the update in toolsbeta, I think we can do: [14:01:04] * apt-get update [14:01:04] * apt-get install containerd.io [14:01:04] * reboot [14:01:17] I don't think we need the depool/repool dancing [14:01:30] kubernetes will just detect the kubelet reboot and handle the containers accoirdingly [14:07:18] this is for buster [14:07:22] for bookworm nodes [14:07:46] both packages (runc, containerd) come from the debian archive [14:09:32] * arturo food time [15:27:49] ssh-key-ldap-lookup doesn't work on bookworm? [15:27:57] are you able to ssh into bookworm VMs? [15:32:16] can you ssh as your user to `toolsbeta-test-k8s-worker-10` ? [15:41:29] file /etc/block-ldap-key-lookup is present on toolsbeta-test-k8s-worker-10, that's why it fails [15:41:44] content: [15:41:45] https://www.irccloud.com/pastebin/VFLJW8ib/ [15:42:55] there was a bug where cookbooks ran from cloudcumins blocked the first puppet run from completing and that file thus never got deleted, but I thought it was fixed already [15:43:33] ah, https://gerrit.wikimedia.org/r/c/operations/puppet/+/992677 still needs to be merged and a new image to be built with that included [15:51:17] it is not 100% clear to me why we need all these safeguards, but ok [15:56:17] Can everyone live without codfw1dev today? I'm seized with the urge to move designate to cloudcontrols which will probably not work perfectly on my first attempt. [15:56:19] First attempt: https://gerrit.wikimedia.org/r/c/operations/puppet/+/995369 [16:06:32] andrewbogott: go for it! you're missing the cloudlb config changes at least [16:48:41] last reminder to update the outgoing notices for the SRE meeting [16:52:09] blocking file: /etc/block-ldap-key-lookup [17:01:04] arturo: no promises that this works but this seems promising [17:01:06] https://www.irccloud.com/pastebin/tekYy8ij/ [17:02:35] andrewbogott: ACK [17:14:49] and it should be possible to run that from cloudcumin1001 where you can start a screen/tmux session and just let it run for hours [17:16:46] dhinus: ok, since I haven't ever used that before, I will probably ping you tomorrow for first-time instructions [17:17:55] no prob. it should be straightforward but let me know [17:18:12] 👍 [17:19:57] basic instructions are here: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Cookbooks#Running_a_cookbook_from_a_cloudcumin_host [17:20:08] they don't mention screen/tmux though, I should probably add a section about it [17:22:25] ok, will follow up tomorrow [17:22:33] going offline now! [17:22:35] * arturo off [17:45:40] I've added some screen/tmux examples to that wiki page [17:48:56] What kind of odd being is set_proxy? [17:48:58] https://www.irccloud.com/pastebin/jKDJ47Ef/ [17:52:07] shell function probably [17:53:28] yeah, https://phabricator.wikimedia.org/P56254 [17:53:34] ah yeah, here it is [17:53:47] I was hoping it would already have logic for excluding certain URLs but... nope [17:55:04] andrewbogott: https://wikitech.wikimedia.org/wiki/HTTP_proxy#Manual_config might be helpful [17:56:26] yep! [17:59:24] The profile::environment puppet module looks like it would try to set a reasonable default for $no_proxy. [18:00:01] NO_PROXY="wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wmnet,127.0.0.1,::1" for me on mwmaint2002 [18:05:14] * dcaro of [18:21:36] that works but oddly not in combination with set_proxy [19:10:05] * bd808 lunch