[13:21:27] dhinus: I'm going to scale up the pdns recursors just in case the actual issue is as obvious as that. Don't want to change more than one thing at a time so for starters: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1105332 [13:22:09] andrewbogott: ack [13:22:52] That service was first deployed on a truly tiny misc server but now they're on config B so we should have a lot more headroom [13:27:53] hm, well now I can confirm that restarting one of the recursors doesn't cause a service interruption :) [13:28:06] wait, no! There was a delay but it did interrupt [13:28:19] huh [13:29:06] I can't decide if that's expected or unexpected [14:53:06] andrewbogott: I guess that depends if the recursor being restarted is the one holding the VIP [14:53:39] arturo: yeah, it's probably as simple as that. I checked the service restart history, it isn't happening often enough to be the primary cause of the issue. [14:53:57] although tbh the CI client should really be able to survive a single dns failure [14:54:17] I agree [14:54:30] I'll grab lunch and join the videocall [14:54:35] thanks! [14:54:58] guys I'm gonna skip the call if it's ok, in the middle of something here [14:55:09] I'm around though just ping me if there are any issues I'll jump on [14:55:16] topranks: that's fine, we'll ping you if disaster results [14:57:31] ok, it'll be fine I'm sure <3 [15:05:35] sorry I was distracted, joining now! [15:45:17] topranks: the failover was extremely trivial, Francesco just switched off keepalived on 1002 and it was perfectly smooth [15:45:36] thanks for the update [15:45:43] indeed yes that's as it should be [15:45:47] yep I will write the procedure on wikitech, but it was basically "disable-puppet && systemctl stop keepalived" [15:46:11] the only danger I would say there is keepalived needs to be up & working on the backup [15:46:38] good point, I'll add that to the procedure [15:46:44] changing the priority but leaving it running protects against that but we're unlikely to be trying to flip it unless we know the backup is ok [15:48:42] yep and "sytemctl stop" avoids touching the config file which is another thing that puppet will revert [15:50:44] Right now I'm just watching CI tests to see if they stop failing with more dns recursor threads. It's really not a productive use of my time :) [16:17:42] is clouddb-codfw1dev gone for good or only temporarily? [16:18:12] (what I mean is that the Cumin alias currently matches 0 hosts) [16:19:54] gone for good! T328079 [16:19:55] T328079: decommission clouddb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T328079 [16:20:49] or rather T369308 [16:20:50] T369308: Decommission clouddb2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T369308 [17:59:51] bd808: how much do you remember about labs-ip-aliaser? I need a sanity check [18:01:10] Is that the thing that Krenair helped work out for the PTR records? [18:01:12] I have a trivial webserver running on 185.15.56.77:80 and I've made that port public. I can curl there from elsewhere (and other projects) in cloud-vps. That makes me think that the routing issue that the labs-ip-aliaser was created for no longer exists and we can remove most of that component. [18:01:43] heh, no, it's a different thing that Krenair helped work out to work around the issue that neutron wouldn't route to floating IPs from within cloud-vps [18:02:19] you are probably thinking of dns-floating-ip-updater which I constantly confuse with labs-ip-aliaser [18:04:04] * bd808 looks at some pdns lua as a result of this conversation [18:04:34] andrewbogott: can you reach the floating IP from the instance it's mapped to? [18:05:12] taavi: yes! [18:05:44] fqdn is abogott-T374129.testlabs.eqiad1.wikimedia.cloud, IPs are 172.16.0.47 and 185.15.56.77 [18:05:45] T374129: openstack: consider removing labs-ip-aliaser - https://phabricator.wikimedia.org/T374129 [18:06:07] huh [18:06:20] sounds like we could drop it then [18:08:21] The network has been totally redone since that was added so it doesn't shock me that routing works better now. [18:08:49] (that aliaser also injects the 'puppet' hostname for instance bootstrap so I need to figure out another way to do that probably) [18:11:36] `traceroute -T 185.15.56.77` works from abogott-T374129.testlabs.eqiad1.wikimedia.cloud back to itself, so yeah I think that means that the current Neutron setup works without explicitly needing the split horizon remapping to private IPs. [18:11:37] T374129: openstack: consider removing labs-ip-aliaser - https://phabricator.wikimedia.org/T374129 [18:12:19] ok, thank you for checking. I was getting timeouts with /some/ specific telnet commands but I think I was just hitting security groups in those cases. [18:13:05] My first test was an ICMP traceroute and that definitely failed, but also likely due to security groups [18:13:37] with the host routing to itself? Seems like security groups shouldn't block that [18:15:05] bd808: I added a permissive icmp rule to that VM, try now? [18:17:06] andrewbogott: hmm... the traceroute is failing harder now. The first time it showed hop 1 to itself and then a stream of ICMP no responses (e.g. "2 * * *"). Now it is all ICMP no response lines. [18:18:14] isn't ping icmp? ping works... [18:20:05] oh hey, I opened up udp and now the traceroute is sensible [18:20:07] ping as an ICMP protocol yes. traceroute uses a different ICMP type. [18:22:41] andrewbogott: yeah, looks the same in both tcp and icmp mode now. [18:23:05] cool! I will work on a patch [18:29:57] I finally found what little doc we have on officewiki -- https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/DNS/Designate#labs-ip-alias-dump [18:31:24] Removing the split horizon DNS response should make managing security groups a bit less confusing I would think. [18:32:30] yeah