[07:33:39] * arturo online
[07:35:37] <dcaro>	 morning
[07:37:20] <dcaro>	 quick review? https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/24
[07:46:57] <arturo>	 dcaro: LGTM. Did you use the helper to re-generate the cassettes?
[07:47:53] <dcaro>	 I did not, I just manually changed the number
[07:48:07] <arturo>	 ok
[07:53:42] <dcaro>	 this might be why it's so slow running?
[07:53:44] <dcaro>	 https://www.irccloud.com/pastebin/u57hAy4t/
[07:53:58] <dcaro>	 (it's slow also when not recording though)
[07:54:18] <dcaro>	 https://www.irccloud.com/pastebin/IOyN8xAD/
[07:54:25] <dcaro>	 something went awry
[07:55:10] <dcaro>	 it did change stuff though
[07:55:19] <dcaro>	 https://www.irccloud.com/pastebin/yKY2fi9V/
[08:03:13] <dcaro>	 oh, no, there's a bunch of `time.sleep` around the code, that's what makes it slow :/
[08:03:29] <dcaro>	 https://www.irccloud.com/pastebin/ynfoBGCZ/
[08:03:32] <dcaro>	 for example
[08:10:09] <arturo>	 not sure what you mean slow?
[08:10:21] <arturo>	 in provisioning users?
[08:11:12] <arturo>	 in lima-kilo?
[08:16:21] <dcaro>	 runing pytest locally (and in ci)
[08:16:25] <dcaro>	 it takes ~3s per test
[08:18:45] <dcaro>	 I think it might be related to vcr, I mocked `time.sleep`, and rerun the test, and got the body of the test running from 1s to ~0.01s, but the whole test still takes ~3s
[08:18:55] <dcaro>	 https://www.irccloud.com/pastebin/VDKZ1exy/
[08:20:40] <dcaro>	 hmm... pytest-vcr has gone 4 years without commits
[08:21:59] <dcaro>	 vcr themselves suggest using https://github.com/kiwicom/pytest-recording
[08:32:06] <dcaro>	 hmm, part of it might be the generation of the key
[08:32:08] <dcaro>	         3    0.000    0.000    1.994    0.665 rsa.py:131(generate_private_key)
[08:32:38] <arturo>	 I see
[08:33:04] <arturo>	 maybe we can mock that one to return a short random string as pk
[08:33:53] <dcaro>	 looking
[08:49:53] <dcaro>	 okok, got it down to <3s for all the tests (I generate one first key that then is reused)
[08:49:54] <dcaro>	 ======================================================================================================================= 21 passed, 28 warnings in 2.69s ========================================================================================================================
[08:50:32] <arturo>	 nice
[08:51:00] <arturo>	 I'll rebase my refactor on your change
[08:58:56] <dcaro>	 I think it breaks the run on lima-kilo, looking
[09:07:32] <dcaro>	 weird, it says the certificate failed to approve (as it's not returned in the api call)
[09:07:42] <dcaro>	 but the response actually says it is (even though it does not return it)
[09:07:44] <dcaro>	 https://www.irccloud.com/pastebin/7cdHdlZl/
[09:11:06] <dcaro>	 I think that it might be that I'm faking time.sleep
[09:16:09] <dcaro>	 yep, that was it
[09:21:21] <dcaro>	 hmm, interesting, now it fails to run tox locally after updating the vcr recordings :/
[09:23:32] <dcaro>	 oh, it works if I pass `--pdb` to pytest
[09:33:01] <dcaro>	 I think there  might be some weirdness happening with threading or similar, the tests fail only sometimes (other times they pass) without changing anything
[09:34:04] <dcaro>	 when it fails, it's the same test though test_process_updated_quotas
[09:37:09] <dcaro>	 it's also very weird that it's able to find a suitable match, even though it does not use it
[09:37:10] <dcaro>	 https://www.irccloud.com/pastebin/OG8RLc37/
[09:38:27] <dcaro>	 https://github.com/kevin1024/vcrpy/issues/516  <- looks quite similar
[09:49:08] <dcaro>	 I think I found it, we are recording with the default 'once' setting, that does not record all the http requests, rerun the test gathering with '--vcr-mode=all' and tests seem to pass all the time
[10:04:56] <arturo>	 given I'm refactoring all that, maybe don't invest too much time on it
[10:08:27] <dcaro>	 that was it
[10:08:48] <dcaro>	 tests have not failed once since I re-recorded with '--vcr-mode=all'
[10:10:32] * arturo brb
[10:47:04] <arturo>	 nice finding!
[10:55:07] <arturo>	 quick +1 here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031836
[11:09:13] <arturo>	 taavi: how do I know if a puppet role has been migrated to puppet 7?
[11:09:55] <taavi>	 arturo: there's a hiera key (profile::puppet::agent::force_puppet7) for everything that's been migrated. but you can also assume that everything we manage except ceph is on puppet 7
[11:10:52] <arturo>	 ok thanks!
[11:57:58] <blancadesal>	 dcaro: re tasks, thank you for the suggestions. I just remembered that I'm out friday to monday (both included), so I'll probably want to avoid starting something larger/collab with raymond and risk stalling him. I will start with code reviews and go from there
[11:59:00] <dcaro>	 👍
[12:07:52] <hashar>	 hi, on integration-agent-docker-1042  I have caught Puppet complaining about an invalid apt repo:
[12:07:52] <hashar>	 An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: http://mirrors.wikimedia.org/osbpo bullseye-zed-backports-nochange InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 56056AB2FEE4EECB
[12:08:13] <hashar>	 I have no idea what is that osbpo  and bullseye-zed-backports-nochange :)
[12:09:20] <arturo>	 mmmm
[12:10:27] <arturo>	 hashar: would you mind opening a phab ticket? I can look into the problem
[12:10:56] <hashar>	 nop sorry too many tasks and issues going on, I just witnessed that log while investigating something else :)
[12:12:53] <arturo>	 wait, zed-backports?
[12:13:02] <arturo>	 we only have support in puppet for bobcat and antelope
[12:21:05] <arturo>	 hashar: feel free to `rm /etc/apt/sources.list.d/openstack*`
[12:27:54] <hashar>	 arturo: ah so I guess it is a left over file from a while ago and Puppet never got taught to ensure => absent? :)
[12:28:06] <arturo>	 maybe
[12:28:14] <arturo>	 we no longer deploy those files via puppet
[12:28:21] <hashar>	 -r--r--r-- 1 root root 123 Jun  8  2023 openstack-zed-bullseye-nochange.list
[12:28:21] <hashar>	 -r--r--r-- 1 root root 114 Jun  8  2023 openstack-zed-bullseye.list
[12:28:28] <hashar>	 or they were part of the base image maybe
[12:29:53] <hashar>	 done thank you arturo !
[12:29:59] <arturo>	 np
[12:40:41] <taavi>	 root@cloudcontrol2004-dev:~# neutron
[12:40:41] <taavi>	 -bash: neutron: command not found
[12:40:49] <taavi>	 is the neutron CLI gone in bobcat?
[12:55:06] <arturo>	 I have no idea, they have been warning for years about that command being deprecated
[13:03:29] * arturo food time
[13:03:31] <taavi>	 i created T365000 and am looking at updating cookbooks now
[13:03:37] <stashbot>	 T365000: replace use of 'neutron' cli in wmcs-cookbooks - https://phabricator.wikimedia.org/T365000
[13:18:09] <taavi>	 why would `wmcs-openstack network agent show` for an L3 agent have `null` in the ha_state field?
[13:32:54] <dcaro>	 hmm, I would either think that there's no HA, or that there's some DB-upgrade mishap
[13:38:47] <dhinus>	 can I get a +1 on T361946?
[13:38:48] <stashbot>	 T361946: Request temporary storage quota increase for project iiab for migration to bookworm image - https://phabricator.wikimedia.org/T361946
[13:41:00] <dcaro>	 dhinus: done
[13:43:15] <dhinus>	 thanks!
[13:55:49] <dcaro>	 Just created T365014 to consolidate API paths, feel free to add anything I missed or comment on it
[13:55:50] <stashbot>	 T365014: [jobs-api,builds-api,envvars-api] consolidate api paths - https://phabricator.wikimedia.org/T365014
[14:07:04] <arturo>	 taavi: maybe that's related to what we were talking yesterday, about the HA instrumentation of the l3 agent being different with OVS compared to linuxbridge?
[14:07:22] <taavi>	 arturo: no, I'm seeing the same thing in eqiad1 too
[14:08:13] <arturo>	 let me verify that at very least there is a couple of keepalived running across the cluster
[14:08:58] <arturo>	 yes, they seem to be running at least
[14:12:44] <taavi>	 i wonder if it's a bug with the new openstack cli tool or whether somehow the openstack api itself doesn't have the data on which l3 agent is active and which is not
[14:13:41] <arturo>	 given the error reported by h.ashar earlier today
[14:13:52] <arturo>	 could you double check that package versions are the right package versions?
[14:14:14] <arturo>	 maybe a newer CLI slipped in the repo somehow
[14:14:16] <dcaro>	 oh, interesting, trying to reboot tools-k8s-worker-nfs-9 got error
[14:14:24] <dcaro>	 https://www.irccloud.com/pastebin/qDkcawW8/
[14:14:29] <dcaro>	 looking
[14:14:30] <taavi>	 500 http://mirrors.wikimedia.org/osbpo bookworm-bobcat-backports/main amd64 Packages
[14:15:26] <arturo>	 dcaro: more openstack CLI shenanigans ?
[14:15:51] <arturo>	 taavi: and the packages in that repo are in the expected version?
[14:16:15] <arturo>	 for example python3-openstackclients and friends
[14:16:23] <arturo>	 are they in the bobcat version?
[14:16:25] <andrewbogott>	 arturo, I haven't finished reading the backscroll but https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031060 is probably related to what Hashar is seeing
[14:16:46] <andrewbogott>	 hm, and maybe what dcaro is seeing?
[14:16:47] * andrewbogott reads
[14:16:56] <arturo>	 andrewbogott: yeah it could be all related
[14:17:26] <andrewbogott>	 it is definitely the case that the 'neutron' cli tool isn't present in B (or in A I think)
[14:17:38] <andrewbogott>	 do we have automated workflows that depend on it?
[14:18:23] <arturo>	 last thing I remember about the neutron cli, is that at least the `show-l3-hosting-router` function did not have an equivalent in the main openstack cli
[14:18:45] <arturo>	 the same for a nova command that we used to use in disaster recovery, that now I can't remember
[14:19:15] <andrewbogott>	 I think I figured out (and documented) the show-l3-hosting-router thing
[14:19:21] <dcaro>	 oh... arturo yep, the `--os-cloud novaadmin` seems to break it now :/, it worked last week
[14:19:25] <arturo>	 oh, great!
[14:19:41] <arturo>	 dcaro: :-S
[14:20:13] <dcaro>	 so currently most of our cookbooks that use openstack might not work xd
[14:21:26] <andrewbogott>	 I suspect that we don't use the standalone neutron tool in many places
[14:21:37] <taavi>	 andrewbogott: the one thing I'm missing from the 'openstack' CLI tool is the HA status for l3 agents
[14:21:49] <andrewbogott>	 (but if we do, that's thanks to willfully ignoring the deprecationgnotice)
[14:22:04] <andrewbogott>	 taavi: ok, let me see if I can dig that up. I'm pretty sure I found a way to get that out of the consolidated cli
[14:22:12] <dcaro>	 andrewbogott:  should the novaadmin setting be in /etc/openstack/clouds.yaml? or should it pull them from some other clouds.yaml file? do you know?
[14:22:33] <dcaro>	 (sorry for the many messages, I can wait)
[14:22:44] <andrewbogott>	 dcaro: I would expect it to be there but not on every host
[14:23:13] <dcaro>	 cloudcontrols don't have it (at least not 1005)
[14:24:05] <andrewbogott>	 hm what about in ~root/.config/openstack/clouds.yaml?
[14:24:07] <hashar>	 andrewbogott: arturo: I think the underlying issue is the `/etc/apt/sources.list.d/` directory is not fully managed by Puppet. Else it would be deleting left over files that have no resource defined in the catalogue
[14:24:19] <hashar>	 but I imagine it can be challenging to implement or maybe impossible
[14:24:23] <dcaro>	 andrewbogott: yep, it's there :)
[14:24:44] <andrewbogott>	 dcaro: ok, that's likely how it's always been, one file for mortals and one for root
[14:24:58] <andrewbogott>	 hashar: I'll probably do a manual cleanup with cumin when I get a minute
[14:25:13] <dcaro>	 https://www.irccloud.com/pastebin/CbrIkZ7M/
[14:25:14] <arturo>	 hashar: fair point, indeed
[14:25:14] <dcaro>	 xd
[14:28:39] <dcaro>	 andrewbogott: I think that the project_id from the clouds.yaml is overriding the envvar :/ (tried changing it in the yaml and it worked)
[14:29:34] <andrewbogott>	 dcaro: I'm still looking at taavi's thing, but that sounds like https://review.opendev.org/c/openstack/openstacksdk/+/893283
[14:31:36] <hashar>	 andrewbogott: +1 sounds good. I did that for integration project already and I have did it right now for deployment-prep (using: `sudo cumin --force '*' 'rm -f /etc/apt/sources.list.d/openstack-zed*'`)
[15:08:48] <andrewbogott>	 taavi: uh-oh:  https://bugs.launchpad.net/python-openstackclient/+bug/2052933
[15:09:55] <dcaro>	 andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031953
[15:10:04] <dcaro>	 ^re-adding the patching
[15:10:22] <dcaro>	 wait, no, got to change the paths there too
[15:11:31] <dcaro>	 it might not apply anymore, let me check
[15:11:54] <andrewbogott>	 taavi: ok, wait, here we go:
[15:11:56] <andrewbogott>	 openstack network agent list --agent-type l3 --router  d93771ba-2711-4f88-804a-8df6fd03978a --long
[15:13:14] * andrewbogott unimpressed by --long
[15:14:38] <arturo>	 maybe you can shorten and use the router by name instead of uuid
[15:17:28] <andrewbogott>	 yep, --router cloudinstances2b-gw works too
[15:33:32] <andrewbogott>	 I've done my best to remove use of the neutron cli from wikitech but have surely missed some
[15:38:04] <arturo>	 thanks for keeping the docs up to date :-)
[15:38:19] <andrewbogott>	 well, in one case I just deleted a big section
[15:47:54] <taavi>	 andrewbogott: bah. thanks
[15:48:18] <andrewbogott>	 I'm going to start adding --long to everything to see what other problems it solves.
[15:48:29] * andrewbogott buys lottery ticket --long
[15:48:40] <dcaro>	 andrewbogott: I'm having issues trying to make the puppet tests pass :/ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031957 if I have not merged it in 1h feel free to take over the patches
[15:48:49] <andrewbogott>	 dcaro: ok!
[15:48:53] * hashar neutron --fix-network --long
[15:49:12] <hashar>	 you are messing up with WMCS firewalling / networking aren't you?
[15:49:20] <hashar>	 15:47:40 stderr: fatal: unable to access 'https://gerrit.wikimedia.org/r/p/mediawiki/core.git/': Failed to connect to gerrit.wikimedia.org port 443: Connection timed out
[15:49:23] <hashar>	 but maybe that is Gerrit
[15:49:43] <andrewbogott>	 hashar: nothing active is happening in eqiad1 today, so that's unlikely an openstack thing
[15:50:11] <hashar>	 yeah I asked because I have seen a command above mentioning network / router :)
[15:50:45] <hashar>	 WMCS network is super stable, or at least I never encounter issues with it
[15:51:45] <andrewbogott>	 it wont be on Tuesday :)
[15:57:58] <dhinus>	 I feel like I asked this question before (and forgot the answer), but is there any reason not to redirect admin.toolforge.org to toolsadmin.wikimedia.org?
[15:59:58] <andrewbogott>	 dhinus: email looks good, thanks for tolerating the process :)
[16:00:09] <dcaro>	 andrewbogott: gtg, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031953 passed the tests, there's a question about the `--strip=2` that did not work for me when testing locally, but maybe I'm doing something wrong, I'll be back in a few hours if needed
[16:00:24] <andrewbogott>	 dcaro: I'll take a look
[16:00:52] <dhinus>	 andrewbogott: thanks, it's a bit tedious but we don't add new admins frequently, so I'm fine with that
[16:08:05] * arturo offline
[16:09:13] <taavi>	 dhinus: no-one has gotten into turning the admin tool into a redirect yet
[16:14:13] <dhinus>	 taavi: ack, I vaguely remembered there were some blockers, possibly some monitoring against that URL?
[16:51:43] <taavi>	 ok, here's a stack of patches to migrate wmcs-cookbooks off the now-nonexistent neutron cli: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1031934
[16:54:11] <andrewbogott>	 boy we really used that a lot
[16:56:17] <andrewbogott>	 taavi: you don't need --long in https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1031934/1/wmcs_libs/openstack/common.py#393 ?
[16:57:48] <taavi>	 andrewbogott: --long is only useful for us when used with --router, so it's added later in the series
[16:58:35] <andrewbogott>	 ah, I see, ok
[18:22:46] <andrewbogott>	 dcaro: I'm not merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031953 because I need to go but IMO you can merge it as soon as you want.