[07:08:22] morning [07:32:28] morning [08:21:18] good morning [08:21:30] dcaro: what do you think about this? [08:21:30] https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/26 [08:21:43] I don't think we use anything similar for other components? [08:23:11] not that I know of, that would require to create an alert of sorts so we don't overlook it being down right? [08:23:19] yes [08:24:19] I think I will create a phab ticket [08:24:20] will partial runs mess with configmap data? [08:24:36] what I saw yesterday is: [08:24:48] 1) initial run of maintain-kubeusers, it created some resources for some accounts [08:25:18] 2) when it got an account that needed new PSP, it crashed, because somehow in toolsbeta there are some missing permissions for the daemon to be able to create PSP resources [08:25:40] 3) the pod was restarted because the crash, and looped again the same accounts [08:25:51] 4) until it found the same account with the missing PSP, then crashed again [08:26:36] it did not do anything weird to the configmap data, but it feel out of control to me. And maybe we want a full stop instead, so we can investigate and fix [08:29:16] what about using backoffLimit instead? so it will still retry a few times (in case of temp failure) [08:29:53] is that available for deployments? I thought that was for jobs only [08:30:23] maybe, let me check [08:36:05] I think you are right, bummer [08:36:36] what we do in maintain-dbusers, is just skip that account and report an error [08:36:44] so we don't completely block new users, just that one [08:37:02] maybe we can do that instead (and add the alert for the failed accounts) [08:37:03] mmm [08:38:44] feels better than getting new user creation blocked completely until the bug is fixed [08:38:56] from a certain point of view, I like the idea of the dameon crashing and forcing us to fix the bug ASAP [08:39:18] daemon* [08:41:36] but we can definitely surface errors via prometheus and then alert on them [08:42:40] let me give it a spin [08:42:53] I'll report back after some tests [08:44:21] that's what we did with maintain-dbusers, but it was very common for new accounts to get stuck for hours when any error happened, so we started just reporting the errors instead [08:49:41] ok [09:34:16] hi, I'd merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/890001 in ~ 10mins, unless it's currently a bad time, just let me know if that's the case [09:37:44] moritzm: go ahead! [09:40:08] dcaro: I saw your MR https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/288 , is this for T357977 ? [09:40:09] T357977: [toolforge.infra] create fullstack tests - https://phabricator.wikimedia.org/T357977 [09:44:53] arturo: kinda yes, though started as me being tired of running the same thing many times and copying into a script [09:46:55] I'm interested in something we can run locally, like that MR, and something that can report via prometheus [09:47:40] Agree, that's why I split the 'lima-kilo' script from the tests themselves [09:47:49] https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/133/diffs#8ec9a00bfd09b3190ac6b22251dbb1aa95a0579d [09:48:00] the tests in toolforge-deploy should be runnable from within any tool in toolforge [09:48:15] while the script above does it for lima-kilo (re-using the tests from toolforge deploy) [09:49:53] ok [09:57:27] ack, now merged [10:33:01] arturo: quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/86 [10:33:14] 👀 [10:34:01] dcaro: +1'd [10:34:12] thanks! [10:45:38] we might have overlooked that when we changed it everywhere else [12:05:08] dcaro: could you please help me make sense of the toolforge-tb-psp podsecuritypolicy ? what is it used for? [12:06:17] toolforge-tfb-psp* [12:06:29] https://www.irccloud.com/pastebin/HstS7mwY/ [12:11:26] 👀 [12:12:52] mmm I discovered google gemini using our repos as sources for the code examples [12:12:54] https://usercontent.irccloud-cdn.com/file/y1Alk6NV/image.png [12:13:22] xd [12:13:29] https://g.co/gemini/share/b37cf50c8e96 [12:17:29] it seems to be the one specifying what tool pods should be restricted to no? [12:18:23] dcaro: there is another PSP for that. This is one was kind of "buildpacks-related" which is what is confusing me [12:19:28] it's mentioned here operations-puppet/modules/toolforge/files/k8s/toolforge-tool-roles.yaml [12:20:55] I think it's not used anywhere anymore [12:21:02] it's from 2020 (before I was around) [12:21:17] ok [12:21:27] I will drop it as part of the refactor, then [12:22:19] ack, it was added with https://phabricator.wikimedia.org/rLTMK844f21d480b65e1d58db3529e84576955ddccda6 [12:22:23] that we don't do anymore [12:22:38] (we don't have a specific user for buildpack pods, it runs as the tool user) [12:23:21] ok [12:25:00] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1036640 [12:51:10] do we have k9s for toolforge k8s? [12:51:14] I guess we need to package it [12:51:25] T366061 [12:51:26] T366061: toolforge: package k9s for use in kubernetes - https://phabricator.wikimedia.org/T366061 [13:01:36] * arturo food [13:05:28] I think there's a copy of it somewhere in one of the control nodes that I used at some point [14:27:21] andrewbogott: are you working today? how do you disable a tool account in toolsbeta for testing purposes? I tried to modify the LDAP entry but I'm hitting stupids problems with LDIF and the tooling [14:31:31] I am working and will see you in the checkin in a moment [14:52:56] arturo, this might be an easy answer: [14:53:01] andrew@cloudcontrol1003:~$ sudo mark_tool [14:53:01] usage: mark_tool [-h] [--ldap-user LDAP_USER] [--ldap-password LDAP_PASSWORD] [14:53:01] [--ldap-base-dn LDAP_BASE_DN] [--project PROJECT] [--disable] [14:53:01] [--delete] [--enable] [14:53:01] tool [14:53:18] hmmm it doesn't let you specify a different ldap server though does it [14:53:31] toolsbeta is the same LDAP server as tools [14:53:59] is just accounts are in the form of `toolsbeta.mytool` instead of `tools.mytool` [14:55:09] hmmm ok [14:55:21] sudo mark_tool --project toolsbeta --disable test8 [14:55:26] this did not report an error ^^^ [14:55:36] ok so it is easy, I'm just confused [14:55:37] but also, I don't see the LDAP entry being modified at all [14:56:36] yeah, it's an extended flag which doesn't show up by default... [14:56:43] what are you using to query? [14:57:00] aborrero@mwmaint1002:~ $ ldapsearch -x uid=toolsbeta.test8 [14:57:40] maintain-kubeusers also doesn't detect the account to be disabled [14:58:08] hm [15:02:59] I'm still digging [15:04:20] arturo: I see two easy ways to check if mark_tool --disable did its business. [15:04:28] ok [15:04:56] here's one: [15:04:59] andrew@mwmaint1002:~$ ldapsearch -x uid=tools.andrewtesttoolfour | grep loginShell [15:05:00] loginShell: /bin/disabledtoolshell [15:05:02] here's the other: [15:05:28] andrew@mwmaint1002:~$ ldapsearch -x uid=tools.andrewtesttoolfour + | grep disabled [15:05:28] pwdPolicySubentry: cn=disabled,ou=ppolicies,dc=wikimedia,dc=org [15:05:31] oh! I see loginShell: /bin/disabledtoolshell [15:05:47] That '+' shows the hidden things [15:06:09] great, I see stuff now [15:06:11] there's also [15:06:13] thanks! [15:06:14] andrew@mwmaint1002:~$ ldapsearch -x uid=tools.andrewtesttoolfour + | grep pwdAccountLockedTime [15:06:14] pwdAccountLockedTime: 20240528150217.539103Z [15:06:17] yep [15:06:32] That dangling + is such weird syntax that I have to re-learn it every time [15:06:46] ok, so I think the news here is that maintain-kubeusers is not correctly detecting disabled accounts [15:09:42] That's possible although I'm sure that the subsequent steps were happening properly last time I checked [15:09:45] (which was a while ago) [15:10:04] arturo: I think it also checks a lockfile before doing anything [15:10:21] yeah [15:10:26] all that was refactored [15:10:46] but this function [15:10:48] https://www.irccloud.com/pastebin/dpeohy1f/ [15:10:56] is what I suspect is not working as expected [15:11:23] hm, that should be the easy part [15:11:24] I'll double check all that logic soon [15:11:33] thanks for the assistance! [15:12:21] np [15:32:10] rook: how much effort do you think it would take to integrate this with PAWS, if at all feasible?: https://quarto.org/ (asking on behalf of a volunteer) [15:32:30] Can answer in an hour, in a meeting [15:32:48] thanks, no rush! [16:11:53] * arturo offline [16:21:36] Rook: do you have time/patience to walk me through finishing the paws deploy that you just started? Or want to ping me after you eat lunch? [16:23:06] Sure we can walk through it [16:23:59] ok. this will be very tedious as I don't spend much time in github. it looks to me like the current issue is fixing the pip issue in https://github.com/toolforge/paws/actions/runs/9272318205/job/25509784853 [16:24:55] Yeah that looks like the problem. I suspect it is because it went past python 3.11 and now wants a virtual env or sys packages [16:25:00] Want to go through it on call? [16:25:48] yep, sure [16:26:15] oh, that's pip refusing to install stuff system-wide, I guess you are not using a venv? [16:36:59] * dcaro off [16:37:00] cya [19:17:05] d.caro for tomorrow or other good work times if you see this. Do you have an opinion on using the unappealing looking --break-system-packages for containers running pip? I feel like it doesn't matter, as I'm fine with containers having pip installed packages over system packages. But is there something I'm missing?