[10:16:25] what's up with all the "PuppetAgentNoResources" alerts? [10:16:53] I have no idea [10:17:02] some kind of puppet failure, I guess [11:04:57] anyone looked at it? [11:05:08] not yet, I'm chasing some other things [11:05:24] okok, let me check [11:05:30] thanks :) [11:07:27] there was some change to the puppet git module that is missing a variable in the cloud hiera [11:38:56] the alerts should be vanishing shortly [13:51:31] * arturo food time [14:25:33] andrewbogott: it is true that we no longer have cloudvirt-wdqs servers? [14:27:34] correct [14:27:52] thanks [14:59:29] dcaro: I think I'm done with meetings, did you want to talk more about auth things? [15:00:36] sure [15:00:51] voice or text? [15:03:39] andrewbogott: both work for me :), but give me 10 min? [15:03:46] sure [15:04:11] thanks, brb [15:13:25] andrewbogott: I'm back [15:15:07] hello! [15:16:39] The keystoneify script I wrote stopped at making username/password but I will see about expanding it to spit out ec2 creds and an API token, then we can see if those work for what we need. [15:16:57] Does that sound like an OK next step, or are you thinking of other things? [15:17:49] I have mixed feelings about having a separate daemon, the reconciliation loop overlaps with what maintain-kubeusers already does [15:18:10] we don't need a reconciliation loop [15:18:17] that's the magic of on-demand [15:18:59] you need logic to setup and teardown. That's the same maintain-kubeuser does [15:19:07] Isn't generating an on-demand token the same as just authing with username/password? [15:19:14] or app creds? [15:19:19] to many parallel questions xd [15:19:37] * andrewbogott steps back [15:20:09] I guess my topic is not important at this stage [15:20:10] arturo: I don't know what you mean with setup and teardown, all it needs an endpoint that toolforge users call to create the ec2 creds, and another one to show them [15:20:22] (and delete I guess) [15:21:05] then users that need private buckets, can call that endpoint (through the cli probably) to generate them, and regenerate them whenever needed [15:21:26] how do you teardown for disabled accounts? [15:21:31] (that I estimate would be like ~1% of the tools for the next year) [15:21:47] arturo: you call the api to delete them [15:21:55] who calls the api? [15:22:06] the script to deactivate an account [15:22:57] dcaro: what about the token to access other toolforge APIs? Can that be on-demand too? [15:23:12] andrewbogott: if we use our own token service yes [15:23:26] same, simple create/delete endpoints [15:24:04] ok, but we just had a conversation (in the meeting) about you not wanting to do app credential auth and instead having a ready-made keystone token available. [15:24:08] I thought [15:24:26] Where as 'app credential auth yields token' is the same as 'on demand token issue' I think [15:24:37] which certainly works for me :) [15:25:17] afaik, the app credentials are to request a short-lived token from keystone (kind of like a session token), you'll have to do that for every interaction as they are very short-lived [15:25:18] how do we protect the delete API endpoint from unwanted access? [15:25:30] arturo: we have toolforge auth :) [15:26:04] I believe all openstack tokens last 7 days, that's how fernet is set up. [15:26:06] I mean, the script to deactivate an account runs outside toolforge [15:26:34] so if this script will need to contact an API endpoint that runs inside toolforge, like the other components API we have [15:26:44] then we would need a new way to protect this endpoint [15:26:54] (so we can enable it outside the cluster) [15:27:06] that's one of the points of adding authentication to toolforge apis no? [15:27:13] so we can open them to the world? [15:27:43] that's not what I'm trying to ask [15:27:54] okok, can you rephrase? [15:28:10] you open the APIs to the world, you have users accessing the toolforge APIs from their laptop [15:28:30] then you need an special access, "admin"-like, for the mark_tool to trigger the cleanup [15:28:38] separate from normal user access [15:29:38] like replica_cnf does? [15:29:53] I'm not familiar [15:30:08] it's the webservice that creates the envvars+replica.cnf file in the user's homes [15:30:30] it's called by the maintain-dbusers script from cloudcontrol1006 (iirc, maybe 1007 now) [15:32:50] I think I'm imagining using reconciliation (as part of the existing tool disabling code) to clean up/remove keystone projects after a tool is deleted. That strikes me as somewhat separate from the question of whether creation of creds is on-demand or not. [15:33:10] It does still mean that those scripts will acquire scary keystone admin creds, but they already contain some pretty scary rights. [15:34:04] for the toolforge on-demand ec2 creds, if we have to remove them (if removing the openstack project is not enough), can also be executed by maintain-kubeusers, as in, maintain-kubeusers can call the 'delete all ec2 creds' endpoint [15:34:46] following the same example, I believe db creds are provisioned regardless of the demand for them, no? [15:34:52] (something that we might want to do for builds eventually, as right now we don't clean those up afaik) [15:35:06] arturo: that's true, although it wouldn't have to me [15:35:26] But it sounds like we're agreeing that we'll use a reconciliation-style solution for removing access when a tool is deleted. [15:35:30] arturo: currently yes, though we are almost at the point we might not need to (and that's good) [15:36:33] so, the workflow would be something like [15:36:41] 1) user gets a new tool account for a new tool [15:36:55] 2) everything is provisioned by maintain-kubeusers BUT not storage creds [15:37:27] 3) user calls some API endpoint, likely using a CLI `toolforge storage whatever`. This creates a keystone project, generates creds, etc [15:37:37] everything as in (home + k8s cert + k8s namespace) [15:37:45] not harbor projects for example [15:37:47] or the db cred [15:38:38] ok [15:38:41] In step 3, where do the creds go after creation? [15:38:50] I assume they aren't regenerated on every call [15:39:33] they are pulled from openstack when requested [15:39:50] then the user does whatever they want with them [15:40:20] (we could auto-populate an envvar or similar if we wanted to, though not really needed) [15:40:44] what happens if the creds expire? [15:40:58] they stop working, the user has to renew them [15:41:38] andrewbogott: the ec2 creds give access to all the buckets right? they don't have fine grained access per-bucket? [15:41:57] (as in, every tool will have only 1 set of ec2 creds? or many?) [15:42:05] I've basically never used ec2/s3 so I'm not sure. But I think that's correct, just one set. [15:42:42] I think one thing I'm not following is: are you thinking that users will use s3 apis directly, or will they be using some toolforge api that wraps s3? [15:42:55] s3 directly [15:43:12] ok [15:43:25] that's the rados gateways right? (not swift) [15:43:31] yeah [15:43:32] dcaro: if the creds have expired, the user has to manually renew them, would that affect normal operations, like deploy a tool? How long would be creds valid for? [15:43:39] radosgw supports both swift and s3 apis [15:44:18] arturo: it should not affect deploying a tool, an that flow does not use buckets for anything, it might break a tool that is running and uses those creds. [15:44:30] the expiration time should not be very long, but not very short either [15:45:03] mostly to avoid forgetting that you have to rotate (ex. someone leaves the project and you forget to rotate them, then in 6m you will have to rotate again anyhow) [15:45:30] all that can be adapted to whatever is more useful (expiration times) [15:46:05] moreover, we were thinking on creds that are also valid for trove. Is this flow/schema valid for trove too? [15:46:21] I don't think I have more questions at the moment :-) [15:46:58] not directly, but the keystone tokens (that the users don't see, and toolforge uses to manage ec2 buckets/credentials) could be used also for trove management [15:47:15] We're /also/ thinking that these creds are valid for all other toolforge apis. That's the bit that feels chicken/egg to me [15:47:17] so when we add a database-as-a-service api, it can reuse those [15:47:31] How can we have on-demand creds created when we need those creds to ask for them in the forst place? [15:47:35] andrewbogott: no, those are only for private buckets on ec2 [15:47:47] ugh, we keep going in circles about this [15:47:53] toolforge would authenticate against idp, k8s cert, or it's token service [15:48:55] I'm sure that only 60 minute ago we were talking about using keystone tokens for toolforge api auth [15:49:24] for the toolforge api, to authenticate itself against openstack to manage the stuff, not for the users to authenticate against toolforge api [15:49:44] (that's what I understood) [15:49:51] OK, but why not everything? I mean, once you have a keystone account and creds... [15:50:10] I need to go offline now o/ [15:50:12] * arturo out [15:50:19] you can do everything via keystone. Generate ec2 creds, talk to trove, issue tokens which other toolforge services can validate against keystone [15:50:22] because users should not have to know/use keyston login flows to be able to use toolforge, if they don't need to [15:50:36] ...why do users know/care that it involve keystone? [15:50:44] I saw something in backscroll about provisioning toolforge db creds which reminded me of this old ticket: T140832 [15:50:44] T140832: Investigate moving labsdb (replicas) user credential management to 'Striker' (codename) - https://phabricator.wikimedia.org/T140832 [15:51:15] because if you want to use toolforge apis, and you are using keystone auth, you'll have to go first to keystone to get the token [15:51:45] they have to go someplace to get a token, right? [15:51:55] Well, wait, what do you mean by 'go someplace'? [15:51:58] Do you mean 'make an api call'? [15:52:21] yep, your code has to go to keystone api, do a call to request a token, then with that token, call the toolforge api [15:52:29] that's how keystone works afaix [15:52:31] *k [15:52:43] Right. [15:52:53] So, with an alternative bespoke solution, how is the workflow different? [15:54:27] the user gets a token from the toolforge cli (ssh login.toolforge.org; toolforge token create;), then the user authenticates directly to the toolforge-api with that token as a bearer token [15:54:50] at some point (month?) the token expires, and you have to log in and regenerate it again) [15:55:27] ok, I have two questions :) [15:55:49] 1) The user doesn't know what 'toolforge token create' does, so it could as easily call keystone as anything else [15:55:55] for UI flow, the user points the browser to ui.toolforge.org (or wherever), gets redirected to idp.wmcloud.org where a form is shown to enter user/password, and gets a short-lived token back (similar to keystone), then with that goes to the toolforge-api endpoint [15:56:04] 2) I thought we were trying to eliminate need for ssh from our workflows [15:56:21] the token is only needed for non-interactive flows [15:56:29] for interactive ones you use idp [15:57:26] OK. So, for non-interactive flows the user doesn't know/care what the token backend is. [15:57:42] Now, for interactive flows. We're talking about a web UI here, right? [15:57:44] as long as it does not have to go to any other api than toolforge that's ok yes [15:58:16] it could be cli also if we wanted (as long as it's running in the user's laptop) [15:58:29] sure, ok. [15:58:35] but I would start with ui only [15:58:48] So, what is the user experience there? The give username/password? [15:59:04] same as if you go to idp.wmcloud.org [15:59:04] In order to log into the site? [15:59:24] (ex. try https://grafana-rw.wmcloud.org/) [15:59:35] Does that work for cli too? [15:59:40] or https://prometheus-alerts.wmcloud.org/ [16:00:22] not as-is, the client will have to do the redirects itself (ex. get the user's username and pass, call idp, then extract the token, then call toolforge apis, as the browser does essentially) [16:01:02] bd808: yep, that task is what I meant yes :) [16:01:20] the client being the python script you are running on your laptop [16:01:35] yep, ok [16:01:41] dcaro: I think it might be possible to redirect the user to idp.wmcloud.org and get the web page to POST the token back to a local hook (that's what several commercial CLIs do) [16:02:06] it depends on whether this flow is supported by apereocas, but I think it is maybe with some additional config [16:02:13] dhinus: I'm interested in that, I was not able to figure out how to do it, if you have more info please forward [16:02:20] I'll see what I can find [16:02:25] thanks! [16:02:37] * andrewbogott lives in fear of clis that launch web interstitials [16:02:43] xd [16:03:19] we could make the cli request the 'user/pass' only when the token is not there or it expired, and automatically regenerate it (or send you to the ui to do the login+regenerate token) [16:03:37] I remember some clis did that too, like "Go to this page, click generate, and paste here: " [16:03:47] andrewbogott: LOL. I think opening a web browser from the CLI it's safer than handling the password in the CLI and posting it to an endpoint, but I'm not sure that's actually true [16:04:11] I think it's safer too, it's just a terrible/hard to support user experience. [16:04:38] we could also try to regenerate the token using ssh + toolforge token regenerate or similar when it's not there [16:04:55] (not sure that would work for most people though) [16:05:05] So it sounds like the only real blocker to using keystone for everything is not wanting to handle username/password with any UI other than IDP. [16:05:24] So if we can get keystone to generate creds via idp we'll be back to having a unitary solution. [16:05:26] with anything inside toolforge yep [16:06:54] not sure, idp does not store api tokens, and we would need some kind of oauth (supported by idp) to authenticate toolforge to act on openstack on behalf of the user, should be doable I think (not sure how nice) [16:07:20] In theory we can make keystone consume idp instead of ldap. I don't know what that looks like exactly but it would be nice. [16:07:35] that would allow doing single-sign-on on the browser [16:07:52] That raises all the exact same UI questions that I just asked about toolforge cli :) [16:08:29] idp.wmcloud.org would become the login page for toolforge, horizon, grafana and prometheus-alerts [16:08:29] I'll try to find some time to experiment with that, it would be nice to have everything backed the same way. [16:08:40] * andrewbogott nods [16:09:20] I assume that in that case keystone cli would get username/password the same way it does now (clouds.yaml) and auth against idp but I don't know if that happens on client or server... [16:09:30] too many questions, I'll have to try it in codfw1dev and see what happens [16:10:29] usually, idp flows force you to go to the idp directly to authenticate, so it should be on the client side [16:10:43] as in, the goal is prevent you user/pass from being sent to the service you are logging into [16:10:51] and use the trusted idp instead [16:10:51] yeah, I agree that that's how it /should/ work but I have my doubts :) [16:10:54] xd [16:11:09] fair enough :) [16:11:57] It's also totally possible that keystone/idp integration just doesn't work at all [16:11:57] Will have to test it [16:12:48] I think I want to pivot to breaking elasticsearch if bd808 is available for that. [16:12:50] dcaro: for CLI auth, could something like this work with idp.wm.org? https://stackoverflow.com/questions/72981325/can-i-run-a-oidc-flow-from-a-command-line-cli-tool [16:15:16] interesting, would be nice to try, probably the custom app uri scheme might be more portable/easier to make it work everywhere [16:16:19] you'll still need some generic way to open the browser no? hmm, might be enough to dump the url on the terminal xd [16:16:30] (as opposed to launching firefox directly or whatever) [16:16:40] yes dumping the URL seems good to me [16:18:01] a similar example with more details https://medium.com/@balaajanthan/openid-flow-from-a-cli-ac45de876ead [16:19:26] yep, listening in a local port might be the one that requires the least user configuration (adding a custom app uri will require you to add it) [16:20:16] is there a standard to register those kind of private uris? [16:22:18] * dhinus is reading https://datatracker.ietf.org/doc/html/rfc8252#section-7.1 [16:27:07] my understanding is that custom-uri:/callback would only work with a separate app (like a Linux/Mac GUI), not a CLI inside a terminal app [16:27:59] probably you need to register it in the desktop file or similar (for linux, no idea for mac/windows) [16:28:14] that's also my understanding, but I might be wrong [16:28:34] I think a local port could be enough, do you see any limitations? [16:29:30] dhinus: dcaro: I think the thing to do would be to call `xdg-open` [16:30:49] that opens a file using mime-types + registered desktop apps right? [16:32:12] it also accepts a URL [16:32:29] is it installed on mac? [16:32:39] hmm I don't think so [16:33:03] I think on mac you can just exec `open` though [16:33:32] nice to know :), I guess windows would have something similar [16:35:51] yes "open http://something" on mac will open that URL with the default browser [16:36:06] start seems to be the windows version of it [16:36:43] I maybe prefer a CLI that tells you to do it "To log in, open this URL in your browser: http:////" [16:37:01] for using a local port, we just have to make sure it's available and might have some issues with the firewalls, though being localhost it's usually allowed by default (not so sure on mac/windows) [16:37:14] dhinus: a bit less intrusive xd [16:37:30] for the callback, I definitely used the localhost:port flow in macOS without opening up anything [16:37:57] nice, so that might be the best option then [16:38:05] but it's possible some people could have issues with firewalls [16:38:42] maybe it's worth checking one client that works in that way, and see what they say in their docs [16:39:15] local port for doing an oauth2 flow definitely works out of the box on windows [16:39:40] yep, it's on my pile of things to do xd, though there's all the basic "how to make auth work at all" first [16:39:55] good to know :), I'll add a note to the task with all this before I forget [16:44:27] nice [16:53:04] can I get a +1 on https://phabricator.wikimedia.org/T368669 ? [16:53:46] +1d [16:53:49] thanks! [17:01:57] * dhinus off [17:15:27] * dcaro off [17:16:26] andrewbogott: as the other day, I've left cloudcephosd1008 joining the cluster bit by bit, it should finish in a couple hours but if anything goes sideways feel free to page me [17:16:49] ok! So far it seems to work fine without yout [17:48:29] andrewbogott: I'm out of meetings for the day, but also hungry. Where are you in the things you wanted my help with? [17:49:19] draining a second node, I think all is well so far [17:49:25] the scary bit is the ip failover [17:49:30] but I'm about to eat lunch now too :) [17:51:15] cool. I see that elastic-1 is out of the cluster now. I don't want to jinx you, but if that worked it should be just a repeat process from then on :) [19:26:48] bd808: I have elastic search drained and stopped on all of the old node. Traffic is still routing through haproxy on -2 though, which is not failing over quite as I'd hoped. [19:28:37] hmmm... that's a bit that I really don't know much about. Does that mean that -2 is holding the service IP and not letting it float elsewhere? [19:29:13] maybe. I see a host priority setting here, going to try adjusting that. [19:31:07] ok, changing that setting revealed that keepalived is failing on the new hosts. That would doit! [19:32:48] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Granting_a_tool_write_access_to_Elasticsearch seems to be the only admin doc we have for the whole feature. :/ [19:36:48] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1051444 [19:36:58] ^ at least part of the problem [19:45:54] ok bd808, all the buster nodes are turned off and things look right to me. You agree? [19:46:47] andrewbogott: running some quick checks [19:48:21] andrewbogott: stashbot, sal, and bash are all working as expected with https://bd808-test.toolforge.org/elastic7.php only reporting the new nodes. LGTM [19:49:03] thanks for checking! [19:49:10] I won't delete the old nodes for a few days just in case [19:49:17] Now I have to run off and deal with car nonsense