[07:48:36] morning [07:49:04] o/ [07:50:34] o/ [07:59:19] goooood morning [08:22:06] o/ happy friday :) [08:42:47] quick review? (blocks ci for some projects) https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/33 [08:44:37] hm, I think I might able to use 'include: local: ...' and avoid one misdirection [08:47:51] dcaro: LGTM [08:48:41] thanks! I can use local it seems :), I'll merge this one, and send a new one using 'local' for all the includes (that helps a lot testing things as otherwise you have to add `ref: branchname` to every include to test a branch) [08:51:22] https://www.irccloud.com/pastebin/LiT8uTW8/ [08:52:31] hmm, why is that using tools-harbor.wmcloud.org for the dev chart and not toolsbeta-harbor.wmcloud.org? [08:52:48] that's my question :) [08:53:04] in the second case, it does use toolsbeta [08:53:52] that's defined in the toolforge-deploy repo [08:54:38] probably has the wrong repo source defined there [08:55:01] (it's a non-often updated component, so we probably never noticed) [08:55:27] i'll send a fix [08:55:32] 👍 [08:59:38] dcaro: should the chart repository in all components' local.yaml be toolsbeta? (i'm seeing a few others where it's tools) [09:00:25] I think so yes [09:00:33] hmm [09:01:29] we should sit down and figure out the whole thing at some point (as in, do we want to deploy toolsbeta images on toolsbeta, or tools-harbor ones? as we are testing right before deploying to tools?) [09:04:32] why woould `toolforge-jobs logs contjob -f` stop following the streaming and return for no apparent reason? I can reproduce in lima-kilo [09:04:54] some timeout somewhere? [09:05:12] maybe, the job is only producing a log entry every 30 seconds [09:06:27] I think that there's a task for it somewhere [09:07:17] my guess is that there's a timeout somewhere yes [09:08:09] I think https://phabricator.wikimedia.org/T359953 might be the same, but now we don't show the ugly errors [09:08:21] (that was because of a version incompatibility iirc) [09:08:24] quick review: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/249 [09:09:11] blancadesal: LGTM [09:09:22] dcaro: ACK [09:09:30] arturo: thanks [09:17:49] btw. this is ready to review https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/34 (have to remove the last commit before merging, but keeping it until then otherwise the task fails) [09:18:16] that's the pre-commit autoupdate pipeline thingie [09:19:23] dcaro: LGTM [09:20:15] I think I pasted the wrong one xd, the one I was talking about is https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/32 (thanks for the other review too tohugh) [09:21:54] sneaky way to get your stuff reviewed :P [09:22:05] sorry :S [09:22:54] was just kidding :)) [09:26:59] dcaro: lgtm [09:28:12] 👍 next step is add the pre-commit caching to speed up the ci, hopefully chopping the time from hours (it timed out at 1h a few times) to less 1 min (my goal xd) [09:34:42] xD [09:39:19] ex. https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/jobs/240111 -- ~15 min running so far [09:42:00] too many safety checks, maybe we should write less safe code :-P [09:42:56] it would definitely have it's advantages yes :) [09:50:32] oh, I think that the local includes don't work as I expected :/, https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/pipelines/48508 [09:50:47] even though the tests I made using the custom branch seemed to work [09:50:53] I'll revert [09:54:56] hmpf... it seems it might not be possible :/ https://gitlab.com/gitlab-org/gitlab/-/issues/35180 [09:55:13] anyhow, will look at it another time, focusing on the pre-commit issues [09:55:43] what was the way to curl the api-gateway from within lima-kilo? [09:56:02] you can check the /etc/toolforge.yaml [09:56:05] file for the endpoint [09:56:52] then you need to pass the --cert /home/tf-test/.toolskube/....cert --key .key (don't remember the full path, might be private.cert instead of key or something like that) [09:57:38] /data/project/tf-test/.toolskube/client.crt and client.key [09:58:09] ah yes it's /etc/toolforge/common.yaml [09:58:10] and /etc/toolforge/common.yaml xd I'm the worst remembering details [09:58:32] (never had to with autocomplete...) [09:58:43] we still have minikube instructions in all the readme's [09:59:02] oh, we might want to sort that out [10:02:59] please review and approve: https://gitlab.wikimedia.org/repos/cloud/toolforge/foxtrot-ldap/-/merge_requests/6 [10:03:27] local.tf-test@lima-lima-kilo:~$ curl --cert /data/project/tf-test/.toolskube/client.crt --key /data/project/tf-test/.toolskube/client.key --insecure 'https://localhost:30003/oapi/' [10:03:27] {"message":"Hello World"} [10:03:30] yay [10:03:33] \o/ [10:04:59] \o/ [10:05:07] 🎉 [10:06:33] I learned some plumbing skills, xd [10:09:49] arturo: approved [10:10:46] dcaro: thanks [10:10:52] please review and approve please review and approve https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/41 [10:11:05] xd, pretty pretty please [10:12:14] that one will take a bit more [10:12:27] heh, copy paste error [10:15:49] running this on tools-k8s-control-9 gets mee too many open files [10:15:50] root@tools-k8s-control-9:~# kubectl -n envvars-admission logs deployment/envvars-admission -f [10:16:15] ulimit -a show 1024 open files limit only [10:16:39] I think might be related to the build issues people were seeing lately [10:18:43] it's running on workers 102 and 103 [10:20:05] dcaro: I think that line is actually in the log file, no? [10:20:17] running without -f does not show it [10:20:41] right [10:21:08] it seems our file open limit on all the nodes is 1024, that seems low [10:22:16] this may be https://github.com/kubernetes/kubernetes/issues/64315 [10:23:38] https://www.irccloud.com/pastebin/JseCHxWg/ [10:23:43] it only gives 6 on each xd [10:24:24] before and after the logs command on the control node [10:25:41] maybe the limit is from within the container itself [10:25:43] not the node? [10:27:32] probably yes, on one of the workers, the number of open files per-pid [10:27:43] https://www.irccloud.com/pastebin/4CG8k88j/ [10:27:55] (only the 14 on the top) [10:28:04] there's processes using >1024 there [10:28:56] that last one is containerd-shim [10:31:57] https://www.irccloud.com/pastebin/Dyl68nEY/ [10:32:05] not even close to it [10:35:58] interesting [10:36:03] https://www.irccloud.com/pastebin/Xlib1kSA/ [10:36:11] I'll open a task/find the one already there [10:38:27] weird! [10:44:58] it's the worker only, tailing the pod on the other worker works well [10:47:09] the other worker does not have errors on journalctl either [10:59:57] * dcaro lunch [12:27:03] got a patch for the open files issue: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1019277 [12:27:21] :eyes [12:27:22] : [12:27:55] dcaro: LGTM [12:29:09] thanks :) [12:29:35] ^ that wasn't me :/ [12:29:40] (did nothing yet) [15:12:34] thanks for digging into that fsnotify bug dcaro. I will keep my eye open for more problems but hope that the limit bump fixes it. :) [15:51:09] * arturo offline [17:40:04] * bd808 lunch [23:16:50] * bd808 off