[07:33:39] * arturo online [07:35:37] morning [07:37:20] quick review? https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/24 [07:46:57] dcaro: LGTM. Did you use the helper to re-generate the cassettes? [07:47:53] I did not, I just manually changed the number [07:48:07] ok [07:53:42] this might be why it's so slow running? [07:53:44] https://www.irccloud.com/pastebin/u57hAy4t/ [07:53:58] (it's slow also when not recording though) [07:54:18] https://www.irccloud.com/pastebin/IOyN8xAD/ [07:54:25] something went awry [07:55:10] it did change stuff though [07:55:19] https://www.irccloud.com/pastebin/yKY2fi9V/ [08:03:13] oh, no, there's a bunch of `time.sleep` around the code, that's what makes it slow :/ [08:03:29] https://www.irccloud.com/pastebin/ynfoBGCZ/ [08:03:32] for example [08:10:09] not sure what you mean slow? [08:10:21] in provisioning users? [08:11:12] in lima-kilo? [08:16:21] runing pytest locally (and in ci) [08:16:25] it takes ~3s per test [08:18:45] I think it might be related to vcr, I mocked `time.sleep`, and rerun the test, and got the body of the test running from 1s to ~0.01s, but the whole test still takes ~3s [08:18:55] https://www.irccloud.com/pastebin/VDKZ1exy/ [08:20:40] hmm... pytest-vcr has gone 4 years without commits [08:21:59] vcr themselves suggest using https://github.com/kiwicom/pytest-recording [08:32:06] hmm, part of it might be the generation of the key [08:32:08] 3 0.000 0.000 1.994 0.665 rsa.py:131(generate_private_key) [08:32:38] I see [08:33:04] maybe we can mock that one to return a short random string as pk [08:33:53] looking [08:49:53] okok, got it down to <3s for all the tests (I generate one first key that then is reused) [08:49:54] ======================================================================================================================= 21 passed, 28 warnings in 2.69s ======================================================================================================================== [08:50:32] nice [08:51:00] I'll rebase my refactor on your change [08:58:56] I think it breaks the run on lima-kilo, looking [09:07:32] weird, it says the certificate failed to approve (as it's not returned in the api call) [09:07:42] but the response actually says it is (even though it does not return it) [09:07:44] https://www.irccloud.com/pastebin/7cdHdlZl/ [09:11:06] I think that it might be that I'm faking time.sleep [09:16:09] yep, that was it [09:21:21] hmm, interesting, now it fails to run tox locally after updating the vcr recordings :/ [09:23:32] oh, it works if I pass `--pdb` to pytest [09:33:01] I think there might be some weirdness happening with threading or similar, the tests fail only sometimes (other times they pass) without changing anything [09:34:04] when it fails, it's the same test though test_process_updated_quotas [09:37:09] it's also very weird that it's able to find a suitable match, even though it does not use it [09:37:10] https://www.irccloud.com/pastebin/OG8RLc37/ [09:38:27] https://github.com/kevin1024/vcrpy/issues/516 <- looks quite similar [09:49:08] I think I found it, we are recording with the default 'once' setting, that does not record all the http requests, rerun the test gathering with '--vcr-mode=all' and tests seem to pass all the time [10:04:56] given I'm refactoring all that, maybe don't invest too much time on it [10:08:27] that was it [10:08:48] tests have not failed once since I re-recorded with '--vcr-mode=all' [10:10:32] * arturo brb [10:47:04] nice finding! [10:55:07] quick +1 here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031836 [11:09:13] taavi: how do I know if a puppet role has been migrated to puppet 7? [11:09:55] arturo: there's a hiera key (profile::puppet::agent::force_puppet7) for everything that's been migrated. but you can also assume that everything we manage except ceph is on puppet 7 [11:10:52] ok thanks! [11:57:58] dcaro: re tasks, thank you for the suggestions. I just remembered that I'm out friday to monday (both included), so I'll probably want to avoid starting something larger/collab with raymond and risk stalling him. I will start with code reviews and go from there [11:59:00] 👍 [12:07:52] hi, on integration-agent-docker-1042 I have caught Puppet complaining about an invalid apt repo: [12:07:52] An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: http://mirrors.wikimedia.org/osbpo bullseye-zed-backports-nochange InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 56056AB2FEE4EECB [12:08:13] I have no idea what is that osbpo and bullseye-zed-backports-nochange :) [12:09:20] mmmm [12:10:27] hashar: would you mind opening a phab ticket? I can look into the problem [12:10:56] nop sorry too many tasks and issues going on, I just witnessed that log while investigating something else :) [12:12:53] wait, zed-backports? [12:13:02] we only have support in puppet for bobcat and antelope [12:21:05] hashar: feel free to `rm /etc/apt/sources.list.d/openstack*` [12:27:54] arturo: ah so I guess it is a left over file from a while ago and Puppet never got taught to ensure => absent? :) [12:28:06] maybe [12:28:14] we no longer deploy those files via puppet [12:28:21] -r--r--r-- 1 root root 123 Jun 8 2023 openstack-zed-bullseye-nochange.list [12:28:21] -r--r--r-- 1 root root 114 Jun 8 2023 openstack-zed-bullseye.list [12:28:28] or they were part of the base image maybe [12:29:53] done thank you arturo ! [12:29:59] np [12:40:41] root@cloudcontrol2004-dev:~# neutron [12:40:41] -bash: neutron: command not found [12:40:49] is the neutron CLI gone in bobcat? [12:55:06] I have no idea, they have been warning for years about that command being deprecated [13:03:29] * arturo food time [13:03:31] i created T365000 and am looking at updating cookbooks now [13:03:37] T365000: replace use of 'neutron' cli in wmcs-cookbooks - https://phabricator.wikimedia.org/T365000 [13:18:09] why would `wmcs-openstack network agent show` for an L3 agent have `null` in the ha_state field? [13:32:54] hmm, I would either think that there's no HA, or that there's some DB-upgrade mishap [13:38:47] can I get a +1 on T361946? [13:38:48] T361946: Request temporary storage quota increase for project iiab for migration to bookworm image - https://phabricator.wikimedia.org/T361946 [13:41:00] dhinus: done [13:43:15] thanks! [13:55:49] Just created T365014 to consolidate API paths, feel free to add anything I missed or comment on it [13:55:50] T365014: [jobs-api,builds-api,envvars-api] consolidate api paths - https://phabricator.wikimedia.org/T365014 [14:07:04] taavi: maybe that's related to what we were talking yesterday, about the HA instrumentation of the l3 agent being different with OVS compared to linuxbridge? [14:07:22] arturo: no, I'm seeing the same thing in eqiad1 too [14:08:13] let me verify that at very least there is a couple of keepalived running across the cluster [14:08:58] yes, they seem to be running at least [14:12:44] i wonder if it's a bug with the new openstack cli tool or whether somehow the openstack api itself doesn't have the data on which l3 agent is active and which is not [14:13:41] given the error reported by h.ashar earlier today [14:13:52] could you double check that package versions are the right package versions? [14:14:14] maybe a newer CLI slipped in the repo somehow [14:14:16] oh, interesting, trying to reboot tools-k8s-worker-nfs-9 got error [14:14:24] https://www.irccloud.com/pastebin/qDkcawW8/ [14:14:29] looking [14:14:30] 500 http://mirrors.wikimedia.org/osbpo bookworm-bobcat-backports/main amd64 Packages [14:15:26] dcaro: more openstack CLI shenanigans ? [14:15:51] taavi: and the packages in that repo are in the expected version? [14:16:15] for example python3-openstackclients and friends [14:16:23] are they in the bobcat version? [14:16:25] arturo, I haven't finished reading the backscroll but https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031060 is probably related to what Hashar is seeing [14:16:46] hm, and maybe what dcaro is seeing? [14:16:47] * andrewbogott reads [14:16:56] andrewbogott: yeah it could be all related [14:17:26] it is definitely the case that the 'neutron' cli tool isn't present in B (or in A I think) [14:17:38] do we have automated workflows that depend on it? [14:18:23] last thing I remember about the neutron cli, is that at least the `show-l3-hosting-router` function did not have an equivalent in the main openstack cli [14:18:45] the same for a nova command that we used to use in disaster recovery, that now I can't remember [14:19:15] I think I figured out (and documented) the show-l3-hosting-router thing [14:19:21] oh... arturo yep, the `--os-cloud novaadmin` seems to break it now :/, it worked last week [14:19:25] oh, great! [14:19:41] dcaro: :-S [14:20:13] so currently most of our cookbooks that use openstack might not work xd [14:21:26] I suspect that we don't use the standalone neutron tool in many places [14:21:37] andrewbogott: the one thing I'm missing from the 'openstack' CLI tool is the HA status for l3 agents [14:21:49] (but if we do, that's thanks to willfully ignoring the deprecationgnotice) [14:22:04] taavi: ok, let me see if I can dig that up. I'm pretty sure I found a way to get that out of the consolidated cli [14:22:12] andrewbogott: should the novaadmin setting be in /etc/openstack/clouds.yaml? or should it pull them from some other clouds.yaml file? do you know? [14:22:33] (sorry for the many messages, I can wait) [14:22:44] dcaro: I would expect it to be there but not on every host [14:23:13] cloudcontrols don't have it (at least not 1005) [14:24:05] hm what about in ~root/.config/openstack/clouds.yaml? [14:24:07] andrewbogott: arturo: I think the underlying issue is the `/etc/apt/sources.list.d/` directory is not fully managed by Puppet. Else it would be deleting left over files that have no resource defined in the catalogue [14:24:19] but I imagine it can be challenging to implement or maybe impossible [14:24:23] andrewbogott: yep, it's there :) [14:24:44] dcaro: ok, that's likely how it's always been, one file for mortals and one for root [14:24:58] hashar: I'll probably do a manual cleanup with cumin when I get a minute [14:25:13] https://www.irccloud.com/pastebin/CbrIkZ7M/ [14:25:14] hashar: fair point, indeed [14:25:14] xd [14:28:39] andrewbogott: I think that the project_id from the clouds.yaml is overriding the envvar :/ (tried changing it in the yaml and it worked) [14:29:34] dcaro: I'm still looking at taavi's thing, but that sounds like https://review.opendev.org/c/openstack/openstacksdk/+/893283 [14:31:36] andrewbogott: +1 sounds good. I did that for integration project already and I have did it right now for deployment-prep (using: `sudo cumin --force '*' 'rm -f /etc/apt/sources.list.d/openstack-zed*'`) [15:08:48] taavi: uh-oh: https://bugs.launchpad.net/python-openstackclient/+bug/2052933 [15:09:55] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031953 [15:10:04] ^re-adding the patching [15:10:22] wait, no, got to change the paths there too [15:11:31] it might not apply anymore, let me check [15:11:54] taavi: ok, wait, here we go: [15:11:56] openstack network agent list --agent-type l3 --router d93771ba-2711-4f88-804a-8df6fd03978a --long [15:13:14] * andrewbogott unimpressed by --long [15:14:38] maybe you can shorten and use the router by name instead of uuid [15:17:28] yep, --router cloudinstances2b-gw works too [15:33:32] I've done my best to remove use of the neutron cli from wikitech but have surely missed some [15:38:04] thanks for keeping the docs up to date :-) [15:38:19] well, in one case I just deleted a big section [15:47:54] andrewbogott: bah. thanks [15:48:18] I'm going to start adding --long to everything to see what other problems it solves. [15:48:29] * andrewbogott buys lottery ticket --long [15:48:40] andrewbogott: I'm having issues trying to make the puppet tests pass :/ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031957 if I have not merged it in 1h feel free to take over the patches [15:48:49] dcaro: ok! [15:48:53] * hashar neutron --fix-network --long [15:49:12] you are messing up with WMCS firewalling / networking aren't you? [15:49:20] 15:47:40 stderr: fatal: unable to access 'https://gerrit.wikimedia.org/r/p/mediawiki/core.git/': Failed to connect to gerrit.wikimedia.org port 443: Connection timed out [15:49:23] but maybe that is Gerrit [15:49:43] hashar: nothing active is happening in eqiad1 today, so that's unlikely an openstack thing [15:50:11] yeah I asked because I have seen a command above mentioning network / router :) [15:50:45] WMCS network is super stable, or at least I never encounter issues with it [15:51:45] it wont be on Tuesday :) [15:57:58] I feel like I asked this question before (and forgot the answer), but is there any reason not to redirect admin.toolforge.org to toolsadmin.wikimedia.org? [15:59:58] dhinus: email looks good, thanks for tolerating the process :) [16:00:09] andrewbogott: gtg, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031953 passed the tests, there's a question about the `--strip=2` that did not work for me when testing locally, but maybe I'm doing something wrong, I'll be back in a few hours if needed [16:00:24] dcaro: I'll take a look [16:00:52] andrewbogott: thanks, it's a bit tedious but we don't add new admins frequently, so I'm fine with that [16:08:05] * arturo offline [16:09:13] dhinus: no-one has gotten into turning the admin tool into a redirect yet [16:14:13] taavi: ack, I vaguely remembered there were some blockers, possibly some monitoring against that URL? [16:51:43] ok, here's a stack of patches to migrate wmcs-cookbooks off the now-nonexistent neutron cli: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1031934 [16:54:11] boy we really used that a lot [16:56:17] taavi: you don't need --long in https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1031934/1/wmcs_libs/openstack/common.py#393 ? [16:57:48] andrewbogott: --long is only useful for us when used with --router, so it's added later in the series [16:58:35] ah, I see, ok [18:22:46] dcaro: I'm not merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031953 because I need to go but IMO you can merge it as soon as you want.