[06:54:14] morning [07:07:12] morning [07:12:18] (hopefully) quick review: https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/90 [08:07:31] blancadesal: sorry for the back and forth, it might take me a moment to test and fix the other MR, so feel free to merge that one without rebasing if you want, otherwise I can merge once I finish fixing it [08:10:56] dcaro: already did a manual rebase (but messed something up that I'm trying to fix now) [08:12:11] ack [08:15:44] dcaro: how can I test this patch? https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/306 [08:16:03] pulling it locally and running the helper script doesn't seem to trigger them [08:16:21] it should :/ [08:16:44] :-( [08:17:12] in lima-kilo, you can also checkout the branch inside tf-test/toloforge-deploy repository once cloned, and then from outside run the toolforge_run_functional.sh script, it will not change the branch [08:17:24] what errors do you get? (or output if no errors) [08:21:26] https://usercontent.irccloud-cdn.com/file/EkrzoPYM/image.png [08:21:34] I have no issues running the script directly from within lima-kilo [08:21:48] gotcha [08:22:02] updating the repo from within the user is the step I was missing [08:22:41] there's some duplication there yes, as the tool user has a copy of toolforge-deploy too [08:23:11] now the first test fails because the buildservice is down, let me re-run ansible [08:23:52] try `toolforge_harbor_compose restart`, sometimes after a reboot harbor does not come up correctly [08:24:15] (even though docker compose might see it as UP :/) [08:24:28] arturo@lima-kilo:~/toolforge-deploy$ toolforge_harbor_compose.sh restart [08:24:28] ERROR: Couldn't find env file: /srv/ops/harbor/harbor/common/config/registryctl/env [08:24:45] that's weird [08:24:54] (as in I've never seen that error before) [08:25:21] maybe for the few things that needs to be triggered after a reboot we could have a systemd unit created via ansible [08:25:35] (like ldap users) [08:26:16] that file `/srv/ops/harbor/harbor/common/config/registryctl/env` does exist for you? (it exists for meu) [08:26:17] *me [08:26:58] I think it's created on harbor installation (by the prepare script), maybe your ansible run did not finish completely? [08:27:09] it doesn't exist [08:27:20] yeah, let me make sure the ansible run is clean [08:30:43] it failed setting up harbor :-( [08:30:50] I may need to re-create the VM [08:30:59] this should allow you to run the functional tests from any directory (issue) https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/309 [08:32:12] +1'd [08:33:08] thanks! [08:34:25] side topic -- I merged the maintain-kubeusers new alerts at https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts [08:34:41] is any additional step beyond the merge required to get alerts working? [08:34:52] no, they'll get deployed with the next puppet run [08:35:02] excellent, thanks [08:35:40] and you should eventually see them appear on https://tools-prometheus.wmflabs.org/tools/alerts [08:36:05] great, they are already there! [08:40:42] can I get a quick review for https://gerrit.wikimedia.org/r/c/cloud/metricsinfra/prometheus-manager/+/1039175? [08:43:56] +1d [08:44:04] thanks [08:48:24] https://www.irccloud.com/pastebin/RBZpEH42/ [08:48:50] hmm, buster-backports repository no longer available at the specified URL? [08:49:51] buster-backports is gone [08:50:21] the script might need updating [08:54:12] trying with deb http://archive.debian.org/debian buster-backports main in buster-backports.list [08:57:03] yeah that worked [08:59:38] dcaro: the functionals tests in my laptop: [08:59:39] 24 tests, 0 failures in 541 seconds [09:00:01] I think T362967 is more relevant today, otherwise they will become impractical for development [09:00:02] T362967: lima-kilo: container image caching - https://phabricator.wikimedia.org/T362967 [09:00:18] I don't think that will help much there though [09:00:46] how much do they take if you run them a second time? [09:01:16] let me check [09:02:18] we have quite a few imagePullPolicy: Always [09:03:09] https://usercontent.irccloud-cdn.com/file/kqkxY1WN/image.png [09:04:39] the second testsuite run failed :-( [09:05:00] interesting [09:05:03] what's the error? [09:05:03] https://www.irccloud.com/pastebin/EGTUhTew/ [09:05:30] can you run `toolforge build list`? it does not cleanup if it fails [09:05:34] (for debugging) [09:05:48] https://www.irccloud.com/pastebin/ZPzY2biG/ [09:06:03] Could not resolve host: gitlab.wikimedia.org [09:06:09] weird [09:06:17] can you ping gitlab.wikimedia.org? [09:06:57] I can't from inside the lima-vm at the moment [09:07:06] oh, something got borked then [09:09:14] :-( [09:23:02] i did a few openstack-browser fixes to better support cases where project_id != project_name, still needs some work unfortunately [09:28:28] thanks for that, is there a task to follow up on? [09:50:19] T366679 [09:50:20] T366679: openstack-browser support for projects where id != name - https://phabricator.wikimedia.org/T366679 [09:51:14] 👍 [10:03:31] cloudcephosd1031.eqiad.wmnet failed to reboot, the ceph cluster is rebalancing [10:05:09] (heads up) [10:14:20] hmpf... I think it got borked with the new hard drive not being in the raid yet: `No boot device available.` [10:15:09] the cluster has almost rebalanced though [10:23:04] :-( [10:23:15] I discovered why I can't ping from inside the lima-vm [10:23:49] it is because the qemu instance has user-mode network, which can't ping by default [10:23:50] https://www.qemu.org/docs/master/system/devices/net.html#using-the-user-mode-network-stack [10:24:16] I'll go and reimage that host [10:24:39] arturo: that's weird, it worked before no? [10:24:55] or it's just 'ping' that's forbidden? [10:25:06] I have never tried `ping` before, in the VM [10:25:21] 'Note that ICMP traffic in general does not work with user mode networking.' [10:25:48] I don't remember changing anything and it works for me though [10:26:00] you can ping 8.8.8.8 from inside the VM? [10:26:14] https://www.irccloud.com/pastebin/9JRHQJGt/ [10:26:17] no problem [10:26:44] what do you have in /proc/sys/net/ipv4/ping_group_range ? [10:26:44] https://www.irccloud.com/pastebin/15OcY18A/ [10:26:48] and it resolves ok [10:27:01] https://www.irccloud.com/pastebin/z7ETlGZl/ [10:28:18] that's why :-) [10:28:27] no, I mean, on your laptop [10:28:49] same [10:28:50] https://www.irccloud.com/pastebin/0TbYw8AZ/ [10:29:00] ok, so that's the reason you can use ping [10:29:11] does that affect name resolution? [10:29:12] you have basically all the GID ranges allowed to use it [10:29:23] I don't think so [10:29:39] maybe I had a brief actual hiccup in name resolution [10:29:56] and then the failed ping was actually a red herring [10:30:37] functional tests are back into working state in my laptop :-S [10:32:29] back to the original question, the second run cuts the runtime in half [10:32:32] https://www.irccloud.com/pastebin/rxV8aS3o/ [10:34:03] do you have the output of the first run? (with per-test times) [10:41:54] full run (not the first) on my laptop takes 212 with all the tests including webservice [10:41:57] https://www.irccloud.com/pastebin/5PmTDiwr/ [10:43:13] no, I lost it :-( [10:43:19] but it was a fresh VM install [10:43:33] ack, I'll do some tries [10:44:21] you have some of them in the comments here [10:44:22] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/306#note_86786 [10:44:40] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/307#note_86783 [10:44:55] you can compare for example [10:44:56] `tail logs and wait (slow) [302516]` [10:45:25] vs `tail logs and wait (slow) [87176]` in your paste [10:45:34] and `tail logs and wait (slow) [106516]` in my second run [10:46:03] interestingly enough, on your second run `envvars are set inside jobs [10961]` was slower than on your first [10:46:25] wait no `envvars are set inside jobs [56464]` [10:46:30] actually 5x faster [10:47:18] * dcaro lunch [10:49:43] I wonder if we can use the harbor inside lima-kilo as a caching proxy for images, and save the docker volume outside of the VM [11:22:25] arturo: there is a maintain-kubeusers alert [11:24:30] also I'm looking at the clouddumps alerts [11:32:33] * dcaro back [11:32:46] anything I can help with? [11:35:23] looking [11:37:35] taavi: did you see if the alert showed up in alertmanager? [11:37:39] alerts.w.o [11:37:50] it did before it cleared [11:38:03] ok [11:38:21] would still be nice to know what triggered it, since it's a new alert and I don't want a new repeatedly flapping alert [11:39:06] I don't think it was a real alert, likely a alert rule misconfiguration, I'm investigating [11:41:02] mmmm [11:41:13] it may have been an actual alert [11:41:22] the pod failed a liveness probe, and that's unexpected at this point [11:41:35] maybe the NFS connection hanged briefly [11:41:50] ceph was rebalancing stuff around [11:41:51] was there a way to show container logs for _previous_ runs? [11:42:07] dcaro: oh! that may explain some additional latency, no? [11:42:15] --previous [11:42:18] arturo: `kubectl logs --previous` [11:42:50] arturo: it might, though I did not see any 'slow operation' during the reshuffling [11:42:57] also, it'd be very useful if the alert had an actual duration instead of just saying 'very old' [11:43:20] ^ I thought the same xd [11:43:37] check others on how to use the value/label, I think we do in other places [11:43:45] ok [11:45:17] the ceph cluster rebalanced ~12:15 UTC+2 [11:47:20] the alerts are 1h later... the value for the metric becomes 0 at some point [11:47:29] https://prometheus.svc.toolforge.org/tools/graph?g0.expr=maintain_kubeusers_run_finished&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h [11:48:43] I'd expect no value for a bit and then a big jump in value instead :/ [11:50:48] better? https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/13 [11:50:48] oh... this is empty https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown [11:51:20] of course is empty :-) it was just created and there is no information on how to handle any of this [11:52:15] we have some basic info though no? at least how to check the logs [11:52:19] so I guess when the container is retarted, the metric becomes 0 until it has looped once [11:52:20] and where is the code [11:53:05] can it be initialized to the current time to avoid that bump? [11:53:15] let me check [11:55:05] yes, I think we can do that [11:55:06] good idea! [11:59:38] btw. I just re-added cloudcephosd1031 after reimage, the cluster is re-balancing again [12:00:05] ack [12:00:45] I don't like the number of times I've had to read the source code of the prometheus python library in the last few days [12:01:25] the function in use here is not in the docs: [12:01:26] https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/35 [12:01:41] * arturo brb [13:49:17] blancadesal are you working or refactoring the envvars-api openapi yaml? (I have a patch that changes it, so if you are working on that I can wait, otherwise I can change it myself) [13:50:00] dacro: all good, I'm working on builds-api right now, havent't started envvars-api yet [13:50:27] dcaro [13:50:50] ack [13:54:28] not sure if "quick" but a review would be appreciated here so I can move on with the other repos that need this done: https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/73 [13:55:02] 👀 [14:17:17] taavi, arturo, Raymond_Ndibe – does any of you still use the `build_deb.sh` script to build the toolforge clis (vs using the ones built by CI)? wondering if we can remove that script altogether [14:17:57] i've always either used `sbuild` directly or the CI-generated debs [14:18:19] 👍 [14:21:08] same here [14:23:06] cool [15:36:02] hmm, andrewbogott all the requests that my browser makes when browsing horizon go directly to horizon, so all the API requests will be proxied through horizon [15:36:19] as in, there's no keystone request from my browser either [15:37:07] no token around, just a cookie session [15:39:19] it does not help at all that there's a js framework called keystone xd [15:41:37] this means to me, that to authenticate on the UI side, will need a backend service that does the proxying to all the API calls if we use keystone as authentication [15:54:08] the diagram here on "how the identity service authenticates" is not true xd, there's no direct call from my browser to any API (keystone or otherwise), all goes through horizon [15:54:10] https://www.redhat.com/sysadmin/keystone-identity-openstack [15:56:47] dcaro: isn't that the flow for something like the cli tools though? [15:57:03] yes, cli tools will work like that [15:57:12] *would [15:57:40] "Whether you're using OpenStack's Horizon dashboard or the command-line interface (CLI), requests to the Identity service are made via an API call." [15:57:48] I'm pretty sure Horizon doesn't send you direct to the keystone ident api because normally that api is not exposed directly to the internet. [15:58:10] good point yes [15:58:57] dcaro: emergency revert: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/36 [16:00:51] thanks [16:01:44] bd808: that means though that all the token dance is not done by the browser, but by the horizon backend (something I was not sure about) [16:01:54] arturo: is there anything else needed? was this deployed in prod? [16:02:03] no, only toolsbeta [16:02:10] ack, nice [16:02:11] I detected the problem in toolsbeta [16:06:35] * arturo offline [16:07:06] dcaro: yeah, in the case of Horizon everything happens directly from horizon -- password access to keystone is typically forbidden except from the cloudweb servers. We can adjust that to allow e.g. usernames with a particular format or in a particular domain to do password auth from wherever. [16:08:59] andrewbogott: that's ok, just clicking things in my head. That would need changing if we use keystone for auth though then, as at least the toolforge ui backend would need to be able to authorize with password to keystone [16:09:32] yeah. The password safelist is an extension written by me so we can change it with abandon :) [16:10:42] about scopes and such, using keystone means that we would have to add toolforge apis to the list of endpoints/apis there right? otherwise we don't get scoped tokens? [16:14:34] taavi would know more. We definitely could use keystone middleware in front of the toolforge apis and add them to the catalog, that might be the easiest thing. I imagine we could also add arbitrary keystone calls to validate things within the toolforge code. [16:15:08] I don't think tokens are typically scoped to particular services, rather things like user/project/domain/global [16:15:10] what toolforge needs (at least for now) is to know if a user is member of a tool [16:15:26] that's in LDAP [16:15:41] yep, is it exposed by keystone in any way? [16:15:49] nope [16:15:57] that's a pity :/ [16:15:58] tools only exist in LDAP [16:16:03] ok, so -- you are now asking about the entire topic of our discussion today :) [16:16:09] for now yes, sorry, there's some context missing xd [16:16:17] Which is that my goal is to map tools onto keystone projects within a 'tools' domain [16:16:29] at which point we can assign keystone roles on users on tools [16:16:36] *to users on tools [16:16:41] But none of that is currently extant [16:17:36] So basically the first two bullet points at https://phabricator.wikimedia.org/T358496#9861310 [16:24:27] I'm trying to understand in that context, how would a toolforge API verify that the given token is valid to act on the given tool, I would have to do something like https://docs.openstack.org/api-ref/identity/v3/index.html#id33 ? with the tool specific project? [16:24:50] Hmm... does that mean that we have to store the openstack project id for each tool? (given that names might be weird/changed/repeated eventually) [16:29:26] wait no, it would be application credentials I guess, so scope is implied, that'd be https://docs.openstack.org/api-ref/identity/v3/index.html#authenticating-with-an-application-credential, and the bits I'd be interested in the response would be the project there? [16:31:00] We could do it either way, with app creds or with tokens. The most straightforward way is to use the same middleware plugin that we use for e.g. the novaproxy API [16:31:25] which checks creds (be they token, password or app creds) against the requested endpoint and project [16:31:35] before relaying the request to the actual api server [16:31:54] interesting, so every request to the APIs ends up doing a request to keystone? [16:32:14] I think so, yes [16:36:19] you got the code around? [16:36:37] probably! [16:36:38] * andrewbogott looks [16:39:53] I think if you look at puppet/modules/dynamicproxy/files/api/invisible-unicorn.py and search for flask_keystone you'll see one way to do it [16:40:47] thanks! [16:43:06] another example is puppet/modules/openstack/files/puppet/master/encapi/puppet-enc.py [16:43:34] those are using the flask plugin, which is moderately different from the middleware approach which I was mentioning before, but the client experience is the same [16:43:57] I remember that one yes [16:45:01] upstream openstack services mostly use https://github.com/openstack/keystonemiddleware [22:28:59] bd808: if you're already poking at the wikitech user disable code, may I interest you in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1038749? [22:38:26] taavi: heh. yeah, I can do the needful with that one