[07:36:33] <_joe_> if anyone here has experience writing parsing rules for rsyslog, your help would be really appreciated [07:39:58] <_joe_> as in - I have no idea how to write one :P [08:54:46] anything ongoing in eqsin? [08:54:47] Could not evaluate: Could not retrieve file metadata for puppet:///pluginfacts: Failed to open TCP connection to puppet:8140 (getaddrinfo: Name or service not known) [08:55:02] it seems that all cp5* nodes are having issues [08:56:38] started around 8:11 UTC https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?viewPanel=6&orgId=1 [08:56:57] I've been wondering the same [08:57:03] cp5 nodes seems to be depooled [08:57:05] https://gerrit.wikimedia.org/r/c/operations/dns/+/732380/ [08:57:07] and I wasn't able to find anything on the SAL [08:57:15] (I've been out a few days) [08:57:18] drmrs records are added before ulsfo services [08:58:05] majavah: what do you mean "before ulsfo services" ? [08:58:47] XioNoX: if you look at directly under "$INCLUDE netbox/drmrs.wmnet" in templates/wmnet, you can see some ulsfo records that are now in the wrong $ORIGIN zone [08:59:07] * jbond looking at the puppet issues [08:59:40] so "puppet 5M IN CNAME puppetmaster1001.eqiad.wmnet." was previously turned into puppet.eqsin.wmnet, i's now puppet.drmrs.wmnet [08:59:59] uh [09:00:09] yeah, we should revert that ASAP :) [09:00:17] wow nice majavah [09:00:24] ok, I see [09:00:24] indeed puppet is not resolving in esqin [09:00:34] nice catch majavah [09:01:02] indeed, good catch [09:01:29] actually those should be duplicated [09:01:35] as we'll need them in drmrs too [09:02:00] https://gerrit.wikimedia.org/r/c/operations/dns/+/733914 [09:02:11] +1 [09:02:18] jsut add the records there [09:02:27] moving forward seems quicker, but as you want [09:02:46] i think revert this will also be breaking other things [09:02:57] I'm in a moving train, so I'd rather not merge/deploy it in case I lose signal [09:03:04] ack ill merge [09:03:30] let me send the right fix [09:03:40] to be merged after this has been solved [09:03:45] ack [09:04:56] * jbond revert merged [09:06:02] ack [09:10:58] i have also cleared the cache for esqin.wmnet domains so things should be resolving now (my tests confirm) [09:12:59] thanks [09:15:00] I've sent https://gerrit.wikimedia.org/r/c/operations/dns/+/734208 [09:15:11] * jbond looking [09:15:20] I've picked the 'eqiad' endpoints for those that are spread between eqiad/codfw [09:15:33] we should probably re-evaluate those [09:15:41] as some seems unoptimal [09:19:38] yes agreeded [09:20:04] jbond: for the promethus comment, would it work also putting the one in esams? [09:20:26] I'm assumin esams being the closes to drmrs but to be tested ofc :) [09:21:07] volans: actully ignore that looks like they expect a site local proemtheus i guess that still needs to be built? [09:21:24] there is nothing yet over there [09:21:45] then probably fine to leave it untill there is stuff, upto you [09:22:20] yeah I guess so [09:22:24] same for the ganeti cluster [09:22:32] yes [09:25:14] XioNoX: I'll wait for you to have a look too [09:25:17] no hurry [09:34:32] volans: I'll be back on my laptop in an hour or so [09:34:39] ack [13:38:57] Fyi, I detailed what happened Friday evening with the Equinix IXP port causing a partial eqiad outage on https://phabricator.wikimedia.org/T293726#7454820 [16:10:42] volans: if you're around - we've tried re-naming a (not-live) host in netbox, but the cookbook tool to push the change said there was no diff to commit. [16:11:07] this host: https://netbox.wikimedia.org/dcim/devices/3561/ (it was "bast6001", we've renamed it in netbox Edit to ganeti6004) [16:11:14] mmandere: ^ [16:12:18] https://netbox.wikimedia.org/extras/changelog/68328/ is the changelog diff in the gui [16:15:05] bblack: need to update DNS name from the IPs (v4 and v6) [16:15:06] https://netbox.wikimedia.org/ipam/ip-addresses/9392/ [16:15:17] https://netbox.wikimedia.org/ipam/ip-addresses/9393/ [16:15:59] and update the host asset tag as well to a temporary ganeti6004a, this will had a diff too [16:16:17] mmandere: ^ [16:16:20] heh [16:16:33] I assumed the names were relational and changing it in one place would also change the linked IPs [16:16:36] I guess not! :) [16:16:42] and https://netbox.wikimedia.org/ipam/ip-addresses/9394/ [16:17:24] in theory it possible, but there are too many edge cases [16:18:11] mmandere: ok, so... gotta edit all those 3 links above to change the names in them [16:18:22] and yeah, the asset tag too [16:18:32] can just do them all, and then run the cookbook once at the end [16:18:50] bblack: got it [16:19:23] thanks XioNoX :) [16:25:25] bblack: reading backlog [16:25:39] volans: X already cleared it up for us, user error on our part :) [16:26:24] no prob, if in doubt this is useful bblack [16:26:24] https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging [16:26:34] ofc skip the parts of the reimage as this host doesn't exists yet [16:27:25] actually that might even skip the part you needed because of the decom+reimage that removes the IPs, so yeah not very helptful in this case I guess [16:27:50] I assume for this case (data only, host doesn't yet exist or need reimaging), we can skip the cable/switch-port stuff? [16:28:22] yeah [16:28:41] dunno if they physically labelled the host though [16:28:56] they = remote hand [16:29:10] yeah we talked about it in the meeting earlier, rob says they probably did, worst case you can write over it or replace it on-site [16:30:05] it's nice that the netbox report got the issue by itself too [16:30:05] Invalid management interface DNS (bast6001.mgmt.drmrs.wmnet != ganeti6004.mgmt.drmrs.wmnet) [16:30:25] added to my todo [16:30:31] volans: yeah we've already fixed that and re-running [16:30:41] we meaning marc, not me :) [16:31:40] lol, yep saw the fix in netbox, report happy [21:29:22] rsync::quickdatacopy now has a new parameter "exclude" which passes through --exclude to rsync, if you always wanted to ignore that one file that make no sense to sync but causes races and alerts like https://phabricator.wikimedia.org/T294080 or whatever