[08:06:12] <taavi>	 I'm going to be taking the inactive cloudnet (cloudnet1005) out of service and then reimaging it as a spare in a bit to prepare for the OVS maintenance window in a few hours
[08:10:01] <arturo>	 ack
[08:52:16] <blancadesal>	 arturo: is the wait condition for the api-gateway to be deployed in lima-kilo k8s strictly necessary? after the latest changes, deployment of this component systematically times out for me. Inside the lima-vm, api-gateway logs/events look fine though, and the component works as expected. Removing the wait flag solves the issue, but maybe there's a way to extend the wait time? 
[08:52:40] <arturo>	 mmm
[08:53:07] <arturo>	 I think it was introduced because some later component depended on it, and would fail to deploy if the api-gateway wasn't fully up and running
[08:53:10] <arturo>	 _however_
[08:53:23] <arturo>	 I kind of remember david solved that cross-dependency somehow
[08:55:29] <blancadesal>	 ok, I can ask him tomorrow if it's safe to remove
[08:56:01] <arturo>	 if you deleted it, and works fine, just send a MR and I'll test here too
[08:56:13] <arturo>	 (deleted the wait flag)
[09:02:10] <blancadesal>	 arturo: https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/131
[09:16:50] <arturo>	 mmm so it is deployed last
[09:23:51] <arturo>	 blancadesal: approved
[09:24:11] <blancadesal>	 thanks
[09:48:05] <taavi>	 arturo: do you want to review https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1034089 before I merge it?
[09:48:28] <arturo>	 sure
[09:55:59] <arturo>	 taavi: LGTM, please merge
[09:56:08] <taavi>	 thanks!
[11:00:25] <taavi>	 arturo: topranks: the cloudnet OVS maintenance window is now, any last-minute concerns or should I just get started?
[11:01:13] <topranks>	 taavi: nope, I'm aware and around, if you want to drop brief updates here I can check the status on cloudsw at the appropriate times 
[11:02:08] <arturo>	 taavi: 🚢 🇮🇹
[11:02:53] <arturo>	 let's use the google meet room?
[11:03:01] <topranks>	 oh yeah we said that 
[11:03:11] <topranks>	 and rofl again at "ship it" :)
[11:03:25] <arturo>	 xD
[11:56:41] <arturo>	 taavi: will use the reimage time to go grab some food, be back later
[11:56:47] <taavi>	 ok!
[12:07:23] <taavi>	 now applying the role to cloudnet1006
[12:18:56] <taavi>	 with that done I think we're done. we still should test a failover, but I don't want to do that at this exact moment
[12:19:55] <topranks>	 cloudnet1005 is still the active one right?
[12:22:04] <taavi>	 yes
[12:28:58] <topranks>	 do we not need to failover to 1006, then re-image 1005 and change it to the role with OVS?  (or maybe I'm completely getting the steps wrong)
[12:31:20] <taavi>	 no? before today, 1006 was active. during the migration ~an hour ago we manually failed over from 1006 to 1005 (which had before that been reimaged to OVS), and now that 1006 is standby I just reimaged that to OVS. but I still want to test the automatic failover between two OVS hosts
[12:32:00] <topranks>	 ah ok 
[12:32:08] <topranks>	 so 1005 was already done 
[12:32:09] <topranks>	 great!
[12:43:49] <topranks>	 arturo: I notice the cloudgw vrrp conf is probably not ideal
[12:43:57] <topranks>	 https://phabricator.wikimedia.org/P62774
[12:44:15] <topranks>	 it's only running VRRP on the "outside" interface vlan1120, the one between it and the cloudsw 
[12:44:36] <topranks>	 it should probably have two groups, one for vlan1120, and another for vlan1107 (which is the transport vlan between it and cloudnet)
[12:45:04] <topranks>	 given both are on the same physical port this is not hugely important - the ways one can break and not the other are rare 
[12:45:20] <topranks>	 if we're moving to BGP in the short/medium term we can probably leave it alone 
[12:47:16] <topranks>	 taavi: between the cloud nets I see no VRRP keepalives like they used to do with the linuxbridge agent 
[12:47:35] <topranks>	 which is cool I guess... do they have another mechanism to detect a failure?
[12:52:45] <taavi>	 topranks: keepalived seems to be running and I'm seeing VRRP but Neutron seems to be wrapping it in VXLAN
[12:53:44] <topranks>	 ok yeah that's the way it was (as in T319539)
[12:53:45] <stashbot>	 T319539: neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539
[12:54:03] <topranks>	 I didn't see that but perhaps I messed up the tcpdump 
[12:54:06] <taavi>	 https://phabricator.wikimedia.org/P62776
[12:55:33] <topranks>	 ah cool
[12:56:04] <topranks>	 so the nice thing here is the outer vxlan packet is a unicast one 
[12:56:13] <topranks>	 previously it was sending multicasts (i.e. spraying to everyone in the vlan)
[12:56:45] <topranks>	 wasn't a massive issue as that was on the transport vlan with only 4 nodes in it 
[12:57:28] <topranks>	 that's actually pretty cool, it's doing VRRP across different cloud-private vlans 
[12:57:35] <topranks>	 using the vxlan to transport between the networks 
[12:58:17] <topranks>	 there is a small risk of a problem on the -transport- or instances vlan, which doesn't affect cloud-private, and failover doesn't happen 
[12:58:38] <topranks>	 but it's similar to above comment about cloudgw, as they are all on the same physical port it's a very rare chance 
[12:58:43] <topranks>	 (like a misconfig of the vlans on the switch etc)
[15:19:59] <bd808>	 taavi: slyngs wants to know if there is any special care needed in creating new users in the LDAP directory that labtestwikitech uses. I couldn't think of any, but thought you might remember if there is something special needed there. Context is slyngs testing the new Bitu deploy there before handing things over.
[15:26:40] <arturo>	 bd808: over the years I've tried to collect all the info related to account creation in codfw1dev here: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Testing_deployment
[15:27:16] <taavi>	 bd808: the normal mediawiki username normalization rules apply but that's the same as the main cluster. so other than using different servers there should not be any major differences
[15:29:46] <dhinus>	 I tried restarting the Quarry Trove db to see if it would fix an intermittent db error, but it looks like I broke Quarry completely :/
[15:29:54] <dhinus>	 and it's back
[15:30:04] <arturo>	 :-)
[15:30:19] <dhinus>	 but I'm still getting the error, I sshed to the trove instance and it's logged there as
[15:30:30] <dhinus>	 (Got timeout reading communication packets)
[15:31:51] <dhinus>	 "Aborted connection X to db: 'quarry' user: 'quarry'"
[15:57:16] <bd808>	 arturo, taavi: thanks. 
[17:04:53] <bd808>	 Are broken logins at labtestwikitech a known issue? I get `Fatal exception of type "MediaWiki\Extension\LdapAuthentication\LdapAuthenticationException"` when I try to login there. The same credentials get me into labtesthorizon with no issues.
[19:19:56] <arturo>	 bd808: I have no idea