[07:13:55] morning! [07:55:29] morning :) [07:56:28] morning!! [08:12:39] I have a couple of puppet patches for review: https://gerrit.wikimedia.org/r/961092 https://gerrit.wikimedia.org/r/961334 https://gerrit.wikimedia.org/r/960163 https://gerrit.wikimedia.org/r/960164 [08:16:03] morning [09:26:59] arturo: can you review https://gerrit.wikimedia.org/r/c/operations/homer/public/+/961336 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/961345? [09:27:59] yes [09:30:10] done [09:58:37] what is the current status of codfw1dev? I see the bastion VM is down? [09:59:32] wow I get [09:59:34] https://www.irccloud.com/pastebin/IPbSEwNm/ [10:09:54] hmmm, the thing I know is that 1 out of 2 cloudservices hosts is broken, and me and a.ndrew are trying to sort out LDAP after the reimage [10:10:45] so I expect the cluster to be in a bad state, but I'm not sure that explains the errors you are encountering [10:11:07] if LDAP is affected, we can consider the deployment to be definitely in bad shape [10:11:22] all auth is keystone-based, which in turns is backed by LDAP [10:11:26] maybe we need a third cluster to experiment with upgrades and such, while we keep the dev cluster in a slightly more stable state :) [10:11:39] so effectively the whole control plane has a strong and direct dependency to LDAP [10:12:29] this happens all the time :-) I am definitely happy that we have a codfw1dev a dev environment [11:04:02] I'd like to replace the current restricted bastion (bastion-restricted-eqiad1-02) with a new instance I created with more CPUs (that should help with T347428) [11:04:03] T347428: cumin and cloud-vps instances not working - https://phabricator.wikimedia.org/T347428 [11:04:23] can I simply edit the DNS record set for restricted.bastion.wmcloud.org in Horizon->DNS? [11:04:55] or should I follow a different procedure? [11:07:25] dhinus: yes, that'd be basically it. And reattaching the floating IP [11:07:44] well, reattaching the floating IP is all you need actually, since the DNS entry points to the floating IP, not the VM [11:07:45] no? [11:08:02] probably yes, I didn't remember it was using a floating IP [11:09:32] first I want to make sure the new instance is working correctly as a ssh proxy, and it looks like it isn't [11:10:10] you may need some special security groups [11:10:13] or even hiera changes [11:10:26] I added some hiera from horizon [11:10:40] and I could ssh to the new one, but now it's not working anymore :/ [11:11:02] it worked briefly and then stopped working? [11:11:03] maybe I just messed up my ssh-config [11:11:10] trying to use it as the bastion [11:11:21] you won't be able to use it as a bastion without a floating IP [11:11:31] no? [11:11:51] yes you're right that's the issue :) [11:11:57] I need the other bastion to reach it atm [11:12:16] so there's no way to test it really, I'll try moving the floating IP and if that doesn't work I'll revert it [11:12:48] you can create a temporal floating IP with a temporal DNS record pointing to it [11:13:00] then update your ssh_config to use this temporal DNS record [11:13:14] something like `bastion-next.whatever.wmcloud.org` [11:13:40] makes sense [11:15:23] and maybe `!log` something to leave some traces for others [11:18:07] yes, I will !log when I switch the DNS. using a temporary floating IP does not seem to work, maybe because only the other one has the right firewall config? [11:18:27] I cannot ping or ssh to the temporary floating IP from my laptop [11:18:57] yes, you likely need a very specific security group for that VM [11:19:41] the current bastion does not have any security group, so maybe it's configured somewhere else? [11:20:42] current bastion: bastion-restricted-eqiad1-02, new bastion: bastion-restricted-eqiad1-3 [11:21:31] there's a hiera key that lists all bastions for VMs with ferm firewall [11:22:21] also note that everyone using bastion-restricted will see host key warnings when the new bastion is put in use [11:23:16] ouch, very good point. what did we do the last time that the bastion changed? maybe I should send an email to cloud a few days before? [11:24:05] it's just the "restricted" bastion that should be used by cloud-roots only [11:25:08] maybe an email to cloud-admin@ is enough [11:27:27] or ops@. I think the docs recommend using restricted. for everyone with prod root [11:27:47] * arturo brb [11:32:13] taavi: correct https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances#Accessing_Cloud_VPS_instances [11:33:23] found the hiera that needs a change, I'll prepare a patch after lunch [11:33:45] do we have any preference/convention on using -3 vs -03 as the suffix in the hostname? [11:34:11] I created the instance with the cookbook wmcs.vps.create_instance_with_prefix that used "-3" [11:34:32] I'm fine with either. -3 is more in line with what we have elsewhere I think [11:55:32] turns out maintain-dbusers has been doing totally unnecessary work to convert PAWS user ids to user names, just to not do anything useful with them: https://gerrit.wikimedia.org/r/c/operations/puppet/+/961369/ [11:56:45] not sure I understand, but good catch :-) [11:57:11] I think there's a couple patches to handle that exact issue [11:58:11] https://gerrit.wikimedia.org/r/c/operations/puppet/+/905243 probably [12:01:36] that's still doing much more than needed, ihmo [12:04:03] taavi: the cookbooks repo needs a refresh of the cloudcontrol FQDNs [12:04:35] good catch. one moment [12:05:40] arturo: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/961370/ [12:06:59] taavi: there are a few more hardcodes, see [12:07:00] git grep cloudcontrol | grep wikimedia [12:15:23] arturo: thanks, updated [12:17:10] taavi: I don't see the update. Maybe you forgot to hit ? :-P [12:17:33] indeed :/ [12:20:07] also, I was running git grep on an old checkout of the repo [12:20:26] so, way less old references than I thought [12:26:57] taavi: I just sent you an invite for a network meeting later today [12:34:05] ok! [12:46:23] arturo: ^ I assume those disconnects were a side effect of the cloudgw refactoring? [13:00:53] taavi: most likely, keepalived was restarted [13:01:05] I really want to move then to the BGP setup [13:54:19] fyi I won't be able to attend the "network sync" meeting today (cc topranks) [14:04:47] XioNoX: ack [14:56:09] I am about to reimage cloudcumin1001 (T324986) so it will be unavailable for a few minutes. you can use cloudcumin2001 in the meantime, that I have already reimaged. [14:58:09] thank you for working on cumin things dhinus, and sorry about the sudden change in priorities. I panicked a bit yesterday when I realized I couldn't maintain VMs :) [14:58:28] I think it's important to get it fixed :) [14:59:14] and it's related to work that was already in my radar anyway (T343330) [14:59:15] T343330: WMCS cookbooks: provide shared hosts for people without global root privileges - https://phabricator.wikimedia.org/T343330 [15:47:51] dhinus: have time for a quick catch-up? [15:48:03] sure [17:11:39] dcaro: thanks for working on lima-kilo, I will test your patch first thing tomorrow