[00:50:27] * bd808 off [14:30:06] dhinus: I woke up a bit early today, want to try rebuilding a cloudservices node today? Or are you in the middle of doing virts? [14:31:38] oh you're in a meeting, sorry [14:49:09] yep but the meeting's over now. I was thinking of doing cloudservices before cloudvirts, so I can start with one and see how it goes... any concerns before I kick off? [15:01:51] that will take one of the ns-auth IPs down but I think that's fine for a short period of time [15:05:32] dhinus: I don't think I have concenrs other than it being a complicated mess to bring it back up (which you already know) [15:11:13] ok I will start the reimage of cloudservices1005 then, and try to follow the guide here when it's reimaged https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/DNS/Designate#Initial_designate/pdns_node_setup [15:12:10] I think I should've updated those docs when I reimaged codfw, but by now I already forgot what was different [15:12:30] let me know how it goes! [15:12:42] I'll post updates here [15:13:45] reimage cookbook started [15:25:17] I noticed some alerts are duplicated in alerts.wm.org, one having receiver: wmcs-ircmail and one having receiver: default [17:27:31] there are some icinga alerts in cloudservices1005 after the reimage [17:27:36] "Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org" [17:28:17] maybe because of some schema issue in the pdns db that andrewbogott is currently fixing [17:30:09] I have to log off for a bit but I'll check back later [19:16:59] * bd808 lunch [19:36:44] andrewbogott any luck with fixing cloudservices1005? can I help? [19:37:25] I think it is mostly working except for a slowness issue with xfr which I've seen before but trying to decide the one true solution [19:37:31] are you seeing other issues? [19:37:54] alerts are still not happy (both icinga and alertmanager) [19:38:30] they're all about "DNS auth" [19:38:34] oh, I got logged out of alertmanager :/ looking now [19:38:58] I think they're actually the very same alerts that are generated in icinga and also shown in am [19:39:42] this is probably the same issue as the one I'm looking at although I would've expected it to sync by now. Let me try a few more things... [19:39:56] I think it has to do with the 'master' field in the domains table [19:42:56] I see some recoveries now [19:44:47] I'm still not sure this is exactly right [19:45:23] Let's see... iirc there's some lack of symmetry where local traffic comes in on the public ip and external on the private, let's see if I can confirm that [19:50:17] ok, now I think this might be actually right [19:50:46] dhinus: all recovered, right? [19:51:15] yep, confirmed [19:51:49] great. I'll continue to test but you should go back to not working. Thanks for stopping by! [19:52:20] if it remains stable I will try to reimage the second host tomorrow... anything I should do apart from the schema and dump restore? [19:52:27] * andrewbogott tries to think how to document this weirdness [19:53:03] I guess it should be easier now that you fixed this one, and I can dump from bookworm to bookworm [19:53:17] Well... there's no perfect solution now. If you dump and restore 1006 then you'll get the right schema but the wrong master records [19:53:25] ah ouch [19:53:33] shall I wait before reimaging then? [19:53:38] oh, sorry, I mean the other way around [19:54:05] No, the right thing is to dump 1005 restore to 1006 and then fix the masters [19:54:15] I'll document that so that it hopefully makes sense. [19:54:25] ok thanks, maybe drop a message later in the phab task, and I'll take it from there :) [19:54:30] sure [19:54:44] thanks! [19:54:57] * dhinus off