[07:21:28] * Emperor also on-call today [07:23:21] (looks like 08:00 - 16:00 UTC) [11:27:20] running schema migrations on s1 [13:51:41] XioNoX: hi! I am ready to roll out the ns0 change but happy to wait if other things are happening [13:51:55] should be easier and more straightforward than yesterday [14:25:09] sukhe: yep [14:25:26] sukhe: go for it and I can double check once you're done [14:25:40] sukhe: everything is good in codfw? [14:27:40] XioNoX: thanks, and yes, codfw seems fine. just doing a pdns-rec restart and will check if stopping gdnsd withdraws the adverts and then proceed with eqiad [14:27:50] cool [14:36:35] sukhe: remind me: you ended up decoupling them at the systemd level, right? (so pdns-rec stays up if gdnsd is restarted?) [14:37:18] I remember having some debates about that, I just don't recall the outcome :) [14:38:43] hmmm looked at a live host, they're still linked by After=+BindsTo=, so yeah [14:38:56] stopping gdnsd will stop pdns-rec, which will withdraw the route I assume. [14:39:26] (the recdns route I mean) [14:39:35] I guess you're probably looking at the new authdns route anyways [14:39:37] bblack: yeah, stopping gdnsd will stop pdns-rec as the deps are still in place [14:40:08] well, if the healthcheck fails then the route is withdrawn, so I guess while related, we are trying to emulate that [14:40:44] healthchecks driving it will lose some reqs [14:41:16] (by the time health fails from a predictable operator action like some daemon restart, other in-flight reqs have already failed) [14:42:47] looking again, though, the whole dep chain is in place at the systemd level still [14:43:06] bird bindsto anycast-hc bindsto pdns-rec bindsto gdnsd [14:43:23] yeah but if you meant what we are doing in this case, the plan was to stop gdnsd to see if the advert is withdrawn. which then would happen if for some reason gdnsd would have stopped, the ns1 IP should not be advertised [14:43:28] so, yeah, daemon restarts work right (the route is withdrawn by bird shutting down before any of the rest can stop) [14:44:06] bblack: yeah that seems to be the chain, correct [14:44:44] one of the undesirable properties of the current setup, is that a stop or restart of pdns-rec also has the same effect on the ns1 advert [14:45:03] (even though gdnsd is still running, pdns-rec service-level stop/restart will stop anycast-hc+bird for the ns1) [14:45:14] yeah, I think we discussed that and didn't come to any consensus :]c [14:45:21] yeah [14:45:43] it would be nice to have each advert controlled seperately, in theory, but I doubt it's worth the effort. [14:45:56] there wouldn't be many other cases like this anyways, with two distinct services birding from one host. [14:48:46] I think the challenge it that there is value to having the dependencies. however, it also takes away control from the operator [14:49:16] but then should we really be advertising 10.3.0.1 if pdns-rec is not running? that's where it becomes a bit more clear [14:49:49] ideally, yeah, for example I just restarted pdns-rec and in theory it should have not affected the other routes [14:50:07] but well because of the dep chain we lose control there [14:51:18] yeah [14:52:03] the alternative would be to not bind anycast-hc to pdns-rec and gdnsd, but instead to have some other mechanism that ties individual route adverts to those daemons' status in realtime (via bird or anycast-hc), but that's really challenging to engineer right. [14:52:33] probably the way it's set up right now is a decent pragmatic tradeoff [14:53:29] we have 14 such hosts anyways, and multiple per site. temporarily losing authdns advert during a recdns restart on a host shouldn't be a huge issue. [14:53:33] it's acceptable in the sense at least that most of the restarts are pretty rare and infrequent. at least the ones that traverse up the chain [14:53:56] (hopefully 16 soon!) [14:54:10] migrating more to anycast will help with the redundancy and load spread, too [15:11:39] bblack: there was another dependency here, haproxy depending on gdnsd, which again makes sense [15:11:49] just that I really dislike the limited visiblity into all of this [15:12:22] limited in what sense? [15:13:06] well, there is no easy way to get a good bird's eye view of all this deps -- or at least I don't know of any [15:13:19] yeah, that would be nice to have! [15:14:25] there is a way, but it's muddled up a lot by the system-level deps that we don't really care as much about [15:16:22] still, the top snippet here: [15:16:25] root@dns1005:~# systemctl list-dependencies --after --all bird.service|grep '\.service$'|head -5 [15:16:28] bird.service [15:16:30] ● ├─anycast-healthchecker.service [15:16:33] ● │ ├─pdns-recursor.service [15:16:35] ● │ │ ├─gdnsd.service [15:16:38] ● │ │ │ ├─systemd-tmpfiles-setup.service [15:16:48] the metadata is all there, there's just not a simple, existing, reliable way to show exactly what we want to see [15:17:24] the biggest missing thing in the tooling is the ability to differentiate which services we care about [15:17:37] yeah, unless you know or set those up, otherwise you just have to discover it [15:17:42] because way further down the output there's stuff we don't care about like: [15:17:45] ● │ │ │ │ │ ├─systemd-update-utmp.service [15:17:48] ● │ │ │ │ │ │ ├─auditd.service [15:17:50] ● │ │ │ │ │ │ ├─systemd-remount-fs.service [15:17:53] ● │ │ │ │ │ │ │ ├─systemd-fsck-root.service [15:19:45] I wonder if there's some metadata flag we could set (a systemd unit k=v) and filter on somehow, that we set just in the systemd units of real production services that we consider distinct from all the system-level noise [15:21:23] I bet one could write a script for it anyways, maybe using the dbus API [15:22:05] that would be ideal yeah. because for our purpose, we care about the dependencies we set via Puppet [15:52:18] XioNoX: sorry, got busy with other stuff. I let's do it on Monday now then, not sure how long you are around. (I mean I can remove the statics but I had rather you confirm stuff, do it, etc.) [15:52:22] (ns0) [15:54:59] I'm still around for a bit, but up to you [16:00:50] XioNoX: meeting now sorry but happy to do it in ten :) [16:14:54] XioNoX: let's do it [16:16:10] sukhe: alright, let me know when to verify then remove the statics [16:16:44] starting [16:28:31] XioNoX: 208.80.154.6 208.80.154.153 208.80.154.77 all look OK to me, please check (I checked show route receive-protocol bgp and the hosts) [16:30:25] sukhe: yep, lgtm1 [16:30:33] sukhe: good to remove the statics? [16:30:36] nice! [16:30:38] yep, let's do it [16:31:50] sukhe: done and still replies to ping at least [16:31:59] :P [16:32:26] looks ok! [16:33:02] sukhe: nice, which static do we remove next? :) [16:33:11] haha [16:33:24] XioNoX: I think we will need to recover from this a bit before the next one :) [16:33:33] thanks for the help! glad we could get this done [16:33:34] yeah of course [16:33:49] everything is going according to you plan! [16:33:59] I mean https://phabricator.wikimedia.org/T347054 [16:34:32] task description needs a list of checkboxes btw :) [16:34:41] yeah, long time coming. I am just glad I don't have to deal with statics anymore [16:34:48] yep, will add. thanks!