[07:07:43] πŸ‘‹ I don't seem to be able to access login.toolforge.org β†’ I've tried different angles on my ssh config but I've been rejected each time, was trying to restart stashbot and maybe sal's webui [07:27:15] arnaudb: in general, you need to be maintainer of a toolforge tool to be able to join the bastion [07:31:01] arnaudb: should be all set now [07:31:09] yay thanks [08:14:00] hey dcaro good morning. This MR seems stuck in the pre-commit stage https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/merge_requests/8 [08:14:27] I thought this commit was to avoid just that https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/commit/774571d50c66e5c821269ffc774bb961afde0cf5 [08:15:02] do you have any hints about what could be going wrong here? [08:15:24] looking [08:16:01] it's not using the memory optimized runner `Running on runner--whwgubpe-project-1394-concurrent-0-vyxyao0s via gitlab-runner-566784c547-n2ws6...` [08:18:12] did you retry an old pipeline, or hit the 'run pipeline' button above? [08:20:02] I think I've tested every action the gitlab UI allows me to do [08:20:03] hitting 'run pipeline' to create a new one from scratch picks the memory optimized runners: `Running on runner-9e2abdumz-project-1394-concurrent-0-jt6bhiol via gitlab-runner-memopt-76b4bb5cdf-8mw8s...` [08:20:33] finished already https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/pipelines/64455 [08:20:38] -_- [08:20:52] is this top-right button? [08:20:54] https://usercontent.irccloud-cdn.com/file/ikRSYEmb/image.png [08:21:13] if you hit 'retry' or the retry icon it will reuse the same pipeline definition than the one you ran (even if the code is newer) [08:21:22] yep, that one [08:21:39] if I use it, I get this [08:21:40] https://usercontent.irccloud-cdn.com/file/cywngce9/image.png [08:22:28] I run it from the MRs pipelines tab [08:22:29] https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/merge_requests/8/pipelines [08:22:38] it has to be associated to an MR [08:22:41] not just the branch [08:23:20] ok, so this top-right 'Run pipeline' button? [08:23:21] https://usercontent.irccloud-cdn.com/file/iwOaFk8R/image.png [08:23:56] yep that one [08:24:11] works for me now! [08:24:14] I think that otherwise you might have to set some 'special' variables so it thinks it's an mr event [08:24:25] (or might not be possible even from the general pipeline page) [08:24:39] for sure there's an api url that can be triggered (we do in some flows) [08:25:08] oh, interesting, irccloud failed to upload the image I sent xd [08:25:14] (invalid_form_token) [08:25:37] mm... let me refresh [08:25:53] https://usercontent.irccloud-cdn.com/file/xap4r7cQ/image.png [08:25:55] now [08:26:18] 🀷 [08:26:41] ok [08:38:34] rebuilding lima-kilo from scratch is currently failing for me [08:38:38] https://www.irccloud.com/pastebin/RHqkqSqc/ [08:39:25] blancadesal: what are the logs of the registry-admission webhook? [08:39:39] we might need a `wait` after it comes in [08:40:26] * dcaro is trying to rebuild too [08:41:37] https://www.irccloud.com/pastebin/dkixegJO/ [08:43:22] TIL if you click on the irccloud paste link, it opens full-page [08:44:03] fancy xd [08:45:03] interesting, so it did work before started failing [08:45:14] so it's not a startup issue (though we might want to wait anyhow) [08:45:57] unrelated: could I get a +1 here? T370010 [08:45:57] T370010: Request quota increase for huma project - https://phabricator.wikimedia.org/T370010 [08:53:03] done [08:53:28] thanks [08:58:34] hmm... fresh lima-kilo for me fails not finding the user certs [08:58:36] https://www.irccloud.com/pastebin/YdPYwn3H/ [08:58:44] https://www.irccloud.com/pastebin/WSqAlrr2/ [09:00:15] maintain-kubeusers is up and running (`starting a run`) but seems not really doing anything [09:01:45] it seems stuck yep [09:01:46] Warning Unhealthy 29s (x9 over 4m29s) kubelet Liveness probe failed: find: β€˜/tmp/run.check’: No such file or directory [09:02:44] trying to get a debug container in it to check the actual status [09:03:09] hmpf. got restarted, and now it ran ok [09:03:29] https://www.irccloud.com/pastebin/F75KGVx3/ [09:17:37] hmm... is gitlab down? [09:21:04] hmm, that failed pipeline looks like it's a failure on gitlab's side [09:21:42] it seems there was a restart or something [09:34:49] argh, I used --gigabytes instead of --ram when running the cookbook for one of the quota bumps. how to undo it? [09:35:23] use the options again to set to the desired values [09:35:27] it should be able to handle a negative number [09:35:32] (I think) [09:37:27] it's not complaining, at least [09:37:33] https://www.irccloud.com/pastebin/mWZwTIZL/ [09:38:42] you can check the quotas of the project just to make sure [09:42:05] spacemedia has 80GB [10:22:53] blancadesal: this is the fix to the problem you reported earlier: https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/168 [10:26:47] arturo: testing now [10:27:54] I think something else was going on there, as the logs show that it did actually process requests, but it failed when it tried doing so for the metrics component, with what seems to be the k8s api failing to list stuff there [10:28:08] E0715 08:40:04.143474 28321 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request [10:28:23] though the wait should be there too, so does not hurt [10:31:49] yeah I reproduced the problem in my system [10:31:54] and the next log line was [10:31:58] E0715 11:50:41.522332 26267 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request [10:31:58] Error: Internal error occurred: failed calling webhook "registry-admission.tools.wmcloud.org": failed to call webhook: Post "https://registry-admission.registry-admission.svc:443/?timeout=10s": dial tcp 10.96.99.14:443: connect: connection refused [10:32:18] and with the patch I was no longer able to reproduce the problem [10:45:46] (still building – I'm going for lunch) [10:45:50] oooohhh, the first error is from kubectl itself right? [11:48:12] * dcaro lunch running late :/ [13:53:16] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/416 [14:06:55] https://www.irccloud.com/pastebin/jrmZQyxl/ [16:02:54] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/410 [16:32:19] cteam: I have a use case for Kubernetes in deployment-prep (hosting mediawiki) and am researching options. Would Magnum + OpenTofu be something y'all see as reasonable to use in that project? [16:35:26] I think so, we're using it elsewhere. [16:35:54] sounds reasonable to me (though I'm not very familiar with magnum), iirc though there's a couple limitations on magnum side, like you can't upgrade in-place, you have to use the fedora images that come with it, and you have to sort out the ingress with haproxy or similar if you want HA [16:41:17] The "easy" idea I have heard so far is running minikube or similar on a single node as the whole cluster. It feels like having multiple nodes would be a better idea for a shared environment. I'm open to hearing that really is not a big deal though. [16:43:01] Since Wikitech is the only wiki still powered by scap rsyncing files in production it feels like we need to move the beta cluster to k8s somehow to keep from drifting radically from how things work for the train deploys. [16:44:29] bifdh [16:44:36] oops, shifted one key xd [16:45:06] might be interesting to ping the catalyst group to see what they ended up doing and why (I see k3s in horizon, but might be outdated) [16:45:46] yeah, thanks for the reminder to check on that :) [16:54:40] dhinus: are you still working enough to help me understand a cumin/keyholder thing? [16:54:43] deployment-cumin-3.deployment-prep.eqiad1.wikimedia.cloud [17:06:59] I'm about to log off, I can have a quick look [17:08:26] "keyholder status" is looking good [17:08:31] andrewbogott: what is not working? [17:08:55] The key is rejected, both when used by cumin and with ssh [17:09:24] for instance SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh deployment-memc12.deployment-prep.eqiad1.wikimedia.cloud [17:09:52] The reason I expect this to work is a) it was working on the older cumin server and b) I see the public key on the target host [17:10:09] that's a reasonable expectation :) [17:10:37] I assume I'm missing some hiera setting on the client config [17:12:11] * andrewbogott waits for the 'works for me!' [17:14:50] this seems suspicious: "sign_and_send_pubkey: signing failed for ED25519 "root@cumin" from agent: agent refused operation" [17:14:56] (using ssh -vv) [17:15:10] is this a new key? [17:15:46] I believe it to be the same key that the old cumin server was using [17:16:00] which is deployment-cumin.deployment-prep.eqiad.wmflabs [17:16:30] (but that one doesn't work at the moment because it's firewalled out, because the puppet code that supports multiple cumin servers seems broken) [17:20:42] some random stackoverflow suggests it might be a file permission issue [17:22:11] I read that one too :) I don't think that can be literally true because the key is being provided by keyholder rather than from a file (as I understand it) [17:22:38] I'm reading it as the permission on the file that is used when doing "keyholder add" [17:23:13] as in keyholder is adding it happily, but failing to use it later [17:23:44] did you run "keyholder add" manually or does it happen on boot? [17:24:02] Both :) But let's chmod that file, reboot, and see what happens. [17:25:40] cloudcumin is also bullseye, right? So we should be able to compare [17:25:59] 444 is still too wide I suspect [17:26:12] cloudcumin is bullseye and has 440 [17:26:23] (/etc/keyholder.d) [17:26:25] andrewbogott: I leave running in a loop the cookbook to add/remove osd 247 (/dev/sdc) from cloudcephosd1034 to create load on that hard drive, should have no issues, but if you see anything weird going one page me :) [17:26:26] now it works! [17:26:33] ok dcaro [17:26:36] * dcaro off [17:26:38] cya tomorrow [17:26:45] dhinus: I'm not sure what changed, maybe it was just the reboot? [17:26:54] I'm going to force puppet run, reboot again, and if it works declare victory [17:27:00] sgtm :) [17:27:09] have a good dinner! [17:27:15] thanks for looking [17:27:19] rebooting might have helped [17:27:20] yw! [17:27:28] * dhinus off