[08:06:12] I'm going to be taking the inactive cloudnet (cloudnet1005) out of service and then reimaging it as a spare in a bit to prepare for the OVS maintenance window in a few hours [08:10:01] ack [08:52:16] arturo: is the wait condition for the api-gateway to be deployed in lima-kilo k8s strictly necessary? after the latest changes, deployment of this component systematically times out for me. Inside the lima-vm, api-gateway logs/events look fine though, and the component works as expected. Removing the wait flag solves the issue, but maybe there's a way to extend the wait time? [08:52:40] mmm [08:53:07] I think it was introduced because some later component depended on it, and would fail to deploy if the api-gateway wasn't fully up and running [08:53:10] _however_ [08:53:23] I kind of remember david solved that cross-dependency somehow [08:55:29] ok, I can ask him tomorrow if it's safe to remove [08:56:01] if you deleted it, and works fine, just send a MR and I'll test here too [08:56:13] (deleted the wait flag) [09:02:10] arturo: https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/131 [09:16:50] mmm so it is deployed last [09:23:51] blancadesal: approved [09:24:11] thanks [09:48:05] arturo: do you want to review https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1034089 before I merge it? [09:48:28] sure [09:55:59] taavi: LGTM, please merge [09:56:08] thanks! [11:00:25] arturo: topranks: the cloudnet OVS maintenance window is now, any last-minute concerns or should I just get started? [11:01:13] taavi: nope, I'm aware and around, if you want to drop brief updates here I can check the status on cloudsw at the appropriate times [11:02:08] taavi: 🚢 🇮🇹 [11:02:53] let's use the google meet room? [11:03:01] oh yeah we said that [11:03:11] and rofl again at "ship it" :) [11:03:25] xD [11:56:41] taavi: will use the reimage time to go grab some food, be back later [11:56:47] ok! [12:07:23] now applying the role to cloudnet1006 [12:18:56] with that done I think we're done. we still should test a failover, but I don't want to do that at this exact moment [12:19:55] cloudnet1005 is still the active one right? [12:22:04] yes [12:28:58] do we not need to failover to 1006, then re-image 1005 and change it to the role with OVS? (or maybe I'm completely getting the steps wrong) [12:31:20] no? before today, 1006 was active. during the migration ~an hour ago we manually failed over from 1006 to 1005 (which had before that been reimaged to OVS), and now that 1006 is standby I just reimaged that to OVS. but I still want to test the automatic failover between two OVS hosts [12:32:00] ah ok [12:32:08] so 1005 was already done [12:32:09] great! [12:43:49] arturo: I notice the cloudgw vrrp conf is probably not ideal [12:43:57] https://phabricator.wikimedia.org/P62774 [12:44:15] it's only running VRRP on the "outside" interface vlan1120, the one between it and the cloudsw [12:44:36] it should probably have two groups, one for vlan1120, and another for vlan1107 (which is the transport vlan between it and cloudnet) [12:45:04] given both are on the same physical port this is not hugely important - the ways one can break and not the other are rare [12:45:20] if we're moving to BGP in the short/medium term we can probably leave it alone [12:47:16] taavi: between the cloud nets I see no VRRP keepalives like they used to do with the linuxbridge agent [12:47:35] which is cool I guess... do they have another mechanism to detect a failure? [12:52:45] topranks: keepalived seems to be running and I'm seeing VRRP but Neutron seems to be wrapping it in VXLAN [12:53:44] ok yeah that's the way it was (as in T319539) [12:53:45] T319539: neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 [12:54:03] I didn't see that but perhaps I messed up the tcpdump [12:54:06] https://phabricator.wikimedia.org/P62776 [12:55:33] ah cool [12:56:04] so the nice thing here is the outer vxlan packet is a unicast one [12:56:13] previously it was sending multicasts (i.e. spraying to everyone in the vlan) [12:56:45] wasn't a massive issue as that was on the transport vlan with only 4 nodes in it [12:57:28] that's actually pretty cool, it's doing VRRP across different cloud-private vlans [12:57:35] using the vxlan to transport between the networks [12:58:17] there is a small risk of a problem on the -transport- or instances vlan, which doesn't affect cloud-private, and failover doesn't happen [12:58:38] but it's similar to above comment about cloudgw, as they are all on the same physical port it's a very rare chance [12:58:43] (like a misconfig of the vlans on the switch etc) [15:19:59] taavi: slyngs wants to know if there is any special care needed in creating new users in the LDAP directory that labtestwikitech uses. I couldn't think of any, but thought you might remember if there is something special needed there. Context is slyngs testing the new Bitu deploy there before handing things over. [15:26:40] bd808: over the years I've tried to collect all the info related to account creation in codfw1dev here: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Testing_deployment [15:27:16] bd808: the normal mediawiki username normalization rules apply but that's the same as the main cluster. so other than using different servers there should not be any major differences [15:29:46] I tried restarting the Quarry Trove db to see if it would fix an intermittent db error, but it looks like I broke Quarry completely :/ [15:29:54] and it's back [15:30:04] :-) [15:30:19] but I'm still getting the error, I sshed to the trove instance and it's logged there as [15:30:30] (Got timeout reading communication packets) [15:31:51] "Aborted connection X to db: 'quarry' user: 'quarry'" [15:57:16] arturo, taavi: thanks. [17:04:53] Are broken logins at labtestwikitech a known issue? I get `Fatal exception of type "MediaWiki\Extension\LdapAuthentication\LdapAuthenticationException"` when I try to login there. The same credentials get me into labtesthorizon with no issues. [19:19:56] bd808: I have no idea