[07:21:41] both tools and toolsbeta harbor's UIs are perpetually 'loading', is that 'normal' (given there were some issues yesterday)? [07:40:53] also, failed pulling the volume-admission image from toolsbeta in lima-kilo [07:41:00] https://www.irccloud.com/pastebin/9hpigvc2/ [07:41:47] there's this in the harbor-core logs [07:42:02] https://www.irccloud.com/pastebin/dYlwndEv/ [08:03:24] blancadesal: that sounds like the tmp filling up on the proxy (happened yesterday) [08:03:51] yep [08:03:53] https://www.irccloud.com/pastebin/B9jD5cqD/ [08:04:00] du -hs shows 0 :/ [08:04:38] :-( [08:04:50] try now [08:04:53] blancadesal: ^ [08:06:05] I did an ls recursive, I see only directories :/ [08:06:12] dcaro: both UIs load fine now [08:07:35] pulling images seems to work again [08:08:04] what did you do? [08:09:03] root@proxy-03:~# systemctl restart nginx.service [08:09:23] https://phabricator.wikimedia.org/T354116 [08:12:03] so it's an nginx cache thing? [08:15:48] I think so yes, somehow it fills up and does not clean up [08:16:03] just added a couple more notes of what I saw during the 100% fullness [08:16:19] ideas are welcome xd [08:17:47] next time we should do an lsof, might be an open file that's not yet persisted [08:20:59] what if we disable the nginx cache entirely, and if we need cache, then we deploy a proper cache server? [08:22:04] that's what we thought we were doing :/ [08:22:20] but it seems we are not configuring the right bits somehow (also for the proxy buffering) [08:22:54] the fact that it creates directories under /proxy makes me think that the proxy buffering is still happening somehow [08:28:06] if we get a reproducible it would help a lot tweaking the configurations [08:28:27] *reproducer? reproduction? ... [08:35:51] dcaro: juust updated the changelog in https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/44 [08:36:34] taavi: awesome thanks :) [08:37:29] taavi: approved 👍 [08:41:47] thanks, pushed the tag and publishing the deb just now [08:44:20] I'll merge this in the afternoon if nobody has any opinions https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/11 [08:47:50] dcaro: toolforge-weld v1.5.0 pushed to bookworm-toolsbeta, do you want to do any tests before I push it to -tools? [08:51:31] let me just test on toolsbeta, is it installed in the bastions there? [08:51:48] yep, I updated toolsbeta-bastion-6 [08:52:39] awesome, looks good 👍 [08:53:46] great, running the copy cookbook now [08:55:19] dcaro: done [08:58:05] \o/ [09:06:18] FYI, I've temporarily disabled puppet on cloudcephosd nodes to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1019063 [09:06:45] moritzm: ack [09:11:39] merged and puppet re-enabled, looks all fine [09:12:05] great, thanks! [10:14:49] we have a k8s worker with stuck D processes [10:14:52] https://usercontent.irccloud-cdn.com/file/1nM7AyWj/image.png [10:14:59] (the alert will trigger in a bit) [10:15:17] arturo: anything you want me to check? I'll try restarting one of the pods there at least [10:16:00] yeah, let me jump in an poke myself too, to verify the state [10:17:25] ssh very slow to that node [10:18:09] as root worked [10:19:18] I see a bunch of containers were OOM-killed [10:19:18] my ssh is still loading xd [10:20:01] I wonder if the oom-killer is the one creating the D procs [10:20:39] I would expect the other way around, D procs filling up the RAM and the OOM trying to kill them [10:20:55] dmesg excerpt [10:20:57] https://www.irccloud.com/pastebin/PEXH51Rx/ [10:21:27] loadavg is very high, 37 (expected because the D procs) [10:21:45] but mem usage is less than half of the available in the worker node [10:21:49] 5G/15G [10:22:43] is that a container process? if so it only needs to reach the container limit, not the node RAM full [10:23:53] there's usually a 'Memory cgroup out of memory' or similar when that's the case [10:24:58] the list of D procs are all from toolforge tools, apparently [10:25:03] https://www.irccloud.com/pastebin/eDVrkpKv/ [10:25:37] dcaro: I'm ready for magic restart to run now [10:25:47] the two bash there are my ssh and taavi's I think [10:26:15] https://www.irccloud.com/pastebin/iYFWWW1b/ [10:26:57] i wasn't able to log in as my normal user, that session is stale now [10:27:44] mine too (I'm waiting for a shell) [10:28:12] I was only able to log in using `ssh root@tools-...` [10:28:50] me too [10:29:02] my guess is that nfs is still misbehaving there [10:29:09] (as in, any attempt to use it will block) [10:29:27] (and in contrast with nfs having been gone and come back) [10:29:59] I see no obvious nfs logs in journal/dmesg [10:30:02] https://www.irccloud.com/pastebin/XPgOSIJS/ [10:30:25] maybe this wasn't the nfs server misbehaving, but the oomkiller acting [10:30:48] and killing nfs? [10:31:36] the D processes started piling up yesterday ~20:10UTC [10:31:38] killing the proccess just when doing some NFS io, therefore leaving it in D state [10:31:49] that would not leave it in D state no? [10:32:01] you only enter D state when you request io to the kernel [10:33:16] I don't know, but I don't see any reason why the NFS server would be involved on this at all [10:33:39] I see the oomkiller acting, which I know can act in very cumbersome ways [10:34:39] not only the oomkiller, but the cgroup running out of memory [10:34:41] "Memory cgroup out of memory" [10:35:07] why would then our ssh sessions still get stuck? [10:35:31] try ls -l /data/project/ (might get you shell stuck) [10:36:22] when the cgroup runs out of memory, the oom killer gets triggered to kill something inside it, so both of those go hand in hand [10:37:17] root@tools-k8s-worker-nfs-1:~# ls /data/project/ | wc -l [10:37:17] 3280 [10:39:02] this variant is not returning though [10:39:03] root@tools-k8s-worker-nfs-1:~# ls -l /data/project/ | wc -l [10:39:31] oh, yes, there's a difference there on the nfs side [10:39:48] one only reads the directory, the other reads the attributes of every inode listed by the directory [10:40:10] let me check the logs on the nfs server [10:40:44] no D processes there [10:41:06] nothing pops up from dmesg/journalctl [10:41:55] it's tools-nfs-2 the one it's using? [10:42:45] https://www.irccloud.com/pastebin/oaOnK6sX/ [10:44:46] yep [10:44:49] https://www.irccloud.com/pastebin/iRpiA0b1/ [10:45:59] I wonder if a pod excessing memory quota getting oomkilled is the expected behavior, or if k8s should behave differently [10:46:52] it's the expected behavior yes, or at least that was before [10:47:18] the oom killer kills only processes inside the cgroup [10:49:32] seems still valid https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/#exceed-a-container-s-memory-limit [10:50:30] what is the status of the pods in this node? [10:51:39] the ones running are still running, the ones that were killed I guess killed (though might have been restarted) [10:54:46] I just checked one of them [10:54:52] Last State: Terminated [10:54:52] Reason: OOMKilled [10:54:52] Exit Code: 137 [10:54:52] Started: Mon, 15 Apr 2024 20:01:39 +0000 [10:54:52] Finished: Mon, 15 Apr 2024 21:01:06 +0000 [10:55:06] this is from tools.toc [10:55:07] kubectl describe pod k8s-20170915.topic-list.ja-56df69f86c-h27rf [10:55:55] https://www.irccloud.com/pastebin/4b3RKBFh/ [10:56:48] how did you get the name of that pod? [10:57:15] https://www.irccloud.com/pastebin/IaJFWFRr/ [10:57:19] https://www.irccloud.com/pastebin/2ZFP1gs0/ [10:58:26] they show running there [10:59:36] oh wait, that's an old state, okok [11:00:13] anyway, your plan was to try a restart them? [11:00:26] without rebooting the node? [11:01:16] yep, though let's make sure that one is one of the processes in D state [11:01:41] look like it [11:02:00] but as long as NFS is still stuck, there's not much to do [11:02:14] the old issues were that nfs had gone away, and then had gotten unstuck [11:02:29] so we were able to ls all the tree, but old D processes would not go away [11:02:36] this is different, nfs is still stuck for some reason [11:02:46] so as long as that's not sorted out it will get stuck again [11:03:15] (we would see the 'nfs server has gone away' and 'nfs is back' kind of messages in dmesg) [11:03:41] I agree [11:04:29] we can taint the node, to force the pods to go away, and try debugging it [11:04:53] my theory is we cannot do anything with this, other than reboot the node [11:05:01] but you can try with a drain [11:05:37] as long as nfs is stuck rebooting seems the only generic option yes [11:06:27] we don't know why it's stuck though, it seems different than the last issues we had [11:06:50] * arturo brb [11:08:32] interesting is that draining the node got rid of all the D processes, so it might be that k8s is able to stop D processes [11:08:59] (without rebooting) [11:09:15] (well, except the shells xd) [11:09:16] almost all of them! [11:09:18] https://www.irccloud.com/pastebin/JkBjjcpY/ [11:09:29] everything that was in a pod [11:09:35] so that's a good sign! [11:09:47] gtg for lunch [11:09:50] * dcaro lunch [11:09:59] feel free to try to get nfs unstuck [11:10:05] I'll be back in ~1h [11:10:41] load avg trending down drastically [11:10:42] load average: 8.34, 27.09, 34.91 [11:43:53] would it be ok to add a few things to lima-kilo like fzf and kubectl bash completion, or should I do "customization" independently? [11:45:37] 100% fine for me [11:46:22] anything to add to the "wishlist"? [11:46:45] have my .bashrc and .bash_aliases linked inside the VM :-P [11:46:55] and atuin [11:47:52] for .bashrc/.bash_aliases, how did you do that? [11:48:10] I haven't. Is in my wishlist :-P [11:48:18] ah ok! [11:50:42] tcpdump, htop I think they might not be installed [11:51:12] also, some way to cache container images across rebuilds [11:51:26] `limactl copy` can copy files between host and guest [11:51:50] arturo: about that, what is it that you were trying to do? [11:52:32] something like this (not working ATM) [11:52:51] https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/116 [11:53:43] that's going to be very problematic for macos [11:54:18] anyhow, I meant what were you doing to need the caching, not what the caching means [11:55:46] rebuilding the lima VM [11:55:53] btw. I think that we can re-think the NFS options we use now that the grid is not around [11:56:30] arturo: just once to refresh the environment? or for any other reason? [11:57:37] it takes between 15 to 20 minutes to rebuild the lima VM, mostly because the amount of downloads, most importantly the container images [11:57:44] (in my laptop) [11:58:09] and is even worse if using poor network connection [12:01:12] I know yes, I'm wondering if there's any specific flow you are doing that can be worked around (given that you seemed to be doing this really often by your comments in the sync meeting) [12:02:12] note that we also set 'ImagePull: always' for lima-kilo usually, so the first deployment will pull the image (so we should change that too even if we get a cache warmed up already) [12:06:35] `limactl copy ~/.bash_aliases lima-kilo:~` works [12:09:23] * arturo food time [12:17:11] blancadesal: maybe that can be some flag in the `start_dev.sh` script or similar (I don't want my whole bashrc copied inside as is, it's kinda complex) [12:17:47] what about adding our personal ones in the repo and choosing from there? (that way we can share/reuse/etc.) [12:19:31] sounds good. so a flag then allowing optionally overriding the default .bashrc? [12:20:53] something like that yes [12:31:46] this one just adds the packages + default .bashrc customization: https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/117 [12:41:40] LGTM [12:45:49] thx [13:00:15] dcaro: @app.get("/harakiri") lol [13:00:32] xd [14:21:12] taavi: in your example in the task description https://phabricator.wikimedia.org/T362525 why nginx let the request despite not using a client-side TLS cert? [14:22:13] arturo: you have spotted the exact issue! note how the request is made to :8000, which is handled by builds-api directly [14:23:00] taavi: right ... ok [14:25:12] what do you think about this templating approach https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/18/diffs vs embedding the dict in the code? [14:25:34] i.e, using python's Template [14:27:01] will it fail if the variable does not exist? [14:27:20] (the syntax is ok, we should make sure we parse it as valid yaml too) [14:27:46] hmm, that makes me think, is there a yaml-based version of that? (so we can use some static checker) [14:27:51] I have no idea, is my first time using Template [14:28:34] so far, maintain-kubeusers has been embedding the resources as dicts in the code, which I find a bit more cumbersome to work with [14:29:53] what about using a plain yaml and replacing the fields in the code? (not sure I like it much better, but at least you get the yaml checker+formatting on the template file directly) [14:30:13] I think that the current file is already valid yaml [14:30:49] yes, the vars are just string [14:30:55] you can definitely yaml-validate the file [14:31:00] (that was the intention anyway) [14:34:46] I think it will raise if the identifiers variables are not defined, we can use this also in case it does not https://docs.python.org/3/library/string.html#string.Template.is_valid [14:35:14] not sure actually if that does what I think it does xd [15:22:07] I merged https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/29, what's the current procedure to deploy jobs-cli? [15:28:03] you have to make a release (you can use the utils/bump_version.sh script to create the release commit for you) [15:28:37] example of MR https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/27 [15:29:52] https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli#building-the-debian-packages [15:30:09] I think we might want to share/put those docs somewhere common [15:30:19] (hmm... I think there was a task for it...) [15:31:40] it does not mention there that the package will also be built in CI, and you can use that one, like https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/jobs/241067 [15:31:52] (that's the build cli) [15:32:28] thanks! [15:32:42] yep I noticed the CI was building the package [15:35:52] hopefully soon 🤞 we will automate that somehow :) [15:35:59] (the deployment I mean) [15:38:00] bump_version.sh is still using buster by default, which no longer works [15:38:24] oh, I thought that had been changed already :/ [15:39:01] https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/blob/main/utils/build_deb.sh?ref_type=heads#L16 [15:39:06] adding "bullseye" fixes it but it's still failing with "gbp:error: /src is not a git repository" [15:39:07] ^it seems to be using bullseye [15:39:28] wait no, wrong script [15:39:35] yep :D [15:39:57] we should fix it then [15:40:12] I can create a patch [15:40:23] that'd be great yes :) [15:40:26] but first I want to understand the second error (/src is not a git repository) [15:41:20] it worked for me though [15:41:24] https://www.irccloud.com/pastebin/yLan69k1/ [15:41:31] so might be docker/macos related [15:42:36] (I can do the branch/release if you don't want to debug right now) [15:42:52] I'm ok debugging though, no rush on my side [15:43:51] did you run it from the root of the repository or from the directory containing the script? [15:44:09] (not sure it matters though) [15:45:07] from the root, I'm debugging with "--entrypoint bash", /src/ contains the right files [15:45:18] but "git status" yields "fatal: detected dubious ownership in repository at '/src'" [15:45:44] oooohhhh [15:46:28] I see, for me the container runs as my user (podman using --userns=keep-id) [15:46:32] I guess you have the same UIDs inside Docker yeah [15:46:37] but if Docker is in a VM it breaks :/ [15:47:40] hmm, iirc you can bypass that with `git config --global --add safe.directory /src` [15:48:34] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-cli/-/blob/main/utils/debuilder-bullseye/generate_changelog.sh?ref_type=heads#L18 [15:49:16] hmmm if I mount the same dir in a simple debian container, it seems to work without any extra params [15:51:37] ok and indeed it does work if I remove --user, but I guess it will break for you [15:52:35] it might also create files as root in your working directory (that then you'll have to `sudo chown -R $USER:$USER`) [15:55:20] it seems to be doing some smart user mapping, if I "touch" a file as root inside the container, it's created as my user in my host filesystem [15:56:55] that's nice, it was a real mess when it would create it as root [15:57:17] I wonder if it still does create it as root in Linux [15:57:46] we can try inside lima-kilo [15:57:50] (it uses docker iirc) [16:02:42] there's one more layer, but let me try a vanilla debian VM in lima-kilo [16:04:13] * dcaro in a meeting [16:07:07] yep, docker will create files as root if you create them from inside a container in a Linux host [16:09:14] I guess a solution could be to create the deb file in the container filesystem, then "docker cp" the file to the host [16:24:27] * arturo offline [16:30:58] dhinus: did you try using the `git config ...` command I passed before? [16:31:05] yep, it does not work :/ [16:31:06] that should allow you to run as a different user without issues [16:31:13] dammit xd [16:31:19] what's the exact error this time? [16:31:43] s/this time/with the `git config...` thingie applied/ [16:31:51] because "error: could not lock config file //.gitconfig: Permission denied" [16:32:09] interesting [16:32:28] I might just switch to podman :D [16:33:54] xd, hmm, what user is it using inside the container for you? [16:34:50] it's starting the container with --user=501, if I run "whoami" I get "whoami: cannot find name for user ID 501" [16:35:13] can you use `--user=0` instead? (what the script does) [16:35:40] yep that works [16:36:08] but I guess it would create files owned by root if you start the script from Linux [16:36:10] wait, that is the other script again xd (build_deb.sh) [16:36:48] yep, it would I think, though you said that in macos they are owned by the user? [16:36:54] *they end up being owned [16:36:57] yep in macos there is some magic that fixes it [16:37:11] maybe we can exploit that, and set the --user=0 only on macos [16:37:47] yep it's a bit of a hack to check the OS, but it would probably work [16:38:20] I have to log off but I can try creating a patch tomorrow, I was hoping to find a cleaner way that can work on both Linux and mac [16:38:58] 👍 cya tomorrow [16:39:57] * dcaro off too 👋 [18:11:41] fyi. I'm leaving tools-worker-nfs-1 out of the pool for now to test why NFS is not working as it should, if you see that tools k8s is getting busy, feel free to reboot it if needed, that should fix it [18:22:26] * bd808 lunch