[07:52:45] good morning! [07:53:11] for some reason that I don't understand, docker on bullseye nodes uses device mapper [07:54:50] ah yeah the overlay module is not present in lsmod's output [07:54:55] lovely, checking hwy [07:54:57] *why [08:14:01] 10serviceops, 10MW-on-K8s, 10SRE-swift-storage, 10Shellbox, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) After more digging: I have no idea why envoy would report the upstream time spent as 2 seconds, when it really is 20. Looks like a bug there. So: mos... [08:16:09] 10serviceops, 10Observability-Logging, 10Shellbox: Shellbox's http container does not log in wmfjson or ecs format - https://phabricator.wikimedia.org/T301757 (10Joe) [08:16:17] 10serviceops, 10Observability-Logging, 10Shellbox: Shellbox's http container does not log in wmfjson or ecs format - https://phabricator.wikimedia.org/T301757 (10Joe) p:05Triage→03Medium [08:30:59] so if I explicitly modprobe overlay, stop docker, clean up /var/lib/docker and start docker, I see overlay being used in docker info [08:31:15] this is different from what I've tested on buster [08:31:36] trying to reboot a node to see if it comes up with overlay mounted [08:40:27] 10serviceops, 10SRE, 10Wikimedia-Etherpad, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it to 1.8.16) - https://phabricator.wikimedia.org/T300568 (10Volans) >>! In T300568#7708409, @Dzahn wrote: > @Volans Yes, it has been fixed by making etherpad listen on "::"... [08:47:24] <_joe_> elukey: if docker has started with a driver, it will still use it as long as you don't remove /var/lib/docker indeed [08:48:32] _joe_ yep yep, for some reason overlay was loaded when I tested the flag on buster, with bullseye I have to modprobe it [08:48:48] but I haven't tested it with a reboot before applying the k8s role [08:48:57] <_joe_> I see [08:49:05] <_joe_> so we do have a race condition here [08:49:19] <_joe_> we do by default install nodes blacklisting overlay [08:49:39] <_joe_> so if we apply the role, we unblacklist overlay but I don't think we load it [08:49:56] <_joe_> I hoped it would be enough to make docker DTRT [08:50:10] <_joe_> but clearly it's not [08:50:40] Janis and I discussed the `profile::base::overlayfs: true` flag, I tried to avoid the extra step of deploying it before the k8s role and it worked on buster, but it may have been only luck [08:50:54] <_joe_> we might add a modprobe to ExecStartPre= for the docker engine [08:51:09] one thing that I am wondering is if it is better to explicitly add the storage-driver: overlay option to the docker config [08:51:15] rather than relying on the defaults [08:51:30] <_joe_> possibly yes [08:51:31] so if, for some reason, devicemapper is the only one available, then docker fails [08:51:37] <_joe_> does it? [08:51:44] I hope so :D [08:54:23] anyway, 3 nodes on bullseye on the ml-serve-codfw cluster atm [08:54:35] I am going to report warnings and findings in the task [12:19:24] 10serviceops, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move Docker settings for kubernetes workers to overlay fs - https://phabricator.wikimedia.org/T300744 (10elukey) Some notes: * On bullseye nodes the device mapper storage driver was picked up by Docker when I applied the k8s node role... [12:40:09] very interesting [12:40:10] elukey@ml-serve2007:~$ sudo grep overlay /var/log/kern.log [12:40:10] Feb 15 07:45:35 ml-serve2007 kernel: [ 163.688043] request_module fs-overlay succeeded, but still no fs? [12:40:40] IIUC overlay is not loaded by the os if it is not being used [12:41:03] so this could lead to the docker issue mentioned above [12:41:18] I can add a specific setting to force the overlay module to be loaded [12:41:29] (on boot I mean) [12:49:14] 10serviceops, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move Docker settings for kubernetes workers to overlay fs - https://phabricator.wikimedia.org/T300744 (10elukey) I think I got what led to Docker starting with device-mapper instead of overlay. I found this log: ` elukey@ml-serve2007:... [12:55:30] https://gerrit.wikimedia.org/r/c/operations/puppet/+/762806 :) [14:10:40] 10serviceops, 10MW-on-K8s, 10SRE-swift-storage, 10Shellbox, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10akosiaris) >>! In T292322#7710140, @Joe wrote: > After more digging: I have no idea why envoy would report the upstream time spent as 2 seconds, when it r... [14:39:54] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: Migrate gitlab-test instance to puppet - https://phabricator.wikimedia.org/T297411 (10Jelto) [14:46:55] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: Migrate gitlab-test instance to puppet - https://phabricator.wikimedia.org/T297411 (10Jelto) >>! In T297411#7669397, @Dzahn wrote: > [] installed gitlab-ce package post-installation script subprocess returned error exit status 1 > [] nginx initial... [15:18:22] I reimaged ml-serve2005 from Buster to Bullseye, all good from the overlay+docker side [15:18:31] (drain + reimage + uncordon) [15:18:51] so now we have 4 nodes (ml-serve200[5-8]) on Bullseye with Overlay [15:36:22] I think that the next step is somebody from ServiceOps checking these nodes, to validate what's missing/wrong/etc.. [15:36:45] after that we can think about a staging node of wikikube? [16:20:48] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: Migrate gitlab-test instance to puppet - https://phabricator.wikimedia.org/T297411 (10Jelto) [16:22:25] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: Migrate gitlab-test instance to puppet - https://phabricator.wikimedia.org/T297411 (10Jelto) [16:36:47] <_joe_> elukey: +1 [16:37:55] perfect, next question - who's volunteering to do the review? :D [16:51:00] <_joe_> jayme of course [16:51:19] <_joe_> who's not here today, so he's the perfect pawn [17:37:29] 10serviceops, 10MW-on-K8s, 10SRE-swift-storage, 10Shellbox, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) >>! In T292322#7710927, @akosiaris wrote: >>>! In T292322#7710140, @Joe wrote: >> After more digging: I have no idea why envoy would report the upstr... [18:02:07] 10serviceops, 10MW-on-K8s, 10SRE-swift-storage, 10Shellbox, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) As I feared, no significant change is seen when using an host-mounted emptyDir in the container. I would assume the shellbox server spends most of it... [18:40:30] 10serviceops, 10Release-Engineering-Team (Seen): contint hardware refresh - https://phabricator.wikimedia.org/T294276 (10Papaul) [18:40:35] 10serviceops, 10Gerrit, 10SRE: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Papaul) [20:43:43] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10Papaul)