[14:34:09] effie: my rename cookbook wants to remove DNS records for mc-gp1001. Is that ok for you? it seems you decommissioned it but the cookbook failed [14:34:37] jelto: I was looking at the failure trying to u nderstand what to do next [14:34:53] mc-gp1001 1002 and 1003 are up for decon [14:34:56] mc-gp1001 1002 and 1003 are up for decom [14:35:47] jelto: do what it says, and I will figure out the rest afterwards, it failed during sre.dns.netbox, maybe we stepped on each others toes [14:36:27] jelto: ping me when you are done so I can take a look at what is left [14:36:41] okay then I'll proceed, thank you [14:40:20] effie: my run also fails :( [14:40:20] FileNotFoundError: [Errno 2] No such file or directory: '/tmp/dns-check.62z88mtu/zones/netbox/8.65.10.in-addr.arpa' [14:41:17] hmmm [14:41:35] ok lets compare notes [14:42:28] https://etherpad.wikimedia.org/p/netbox-fail [14:42:33] was a host decommissioned? [14:42:43] this means that it's trying to include $INCLUDE netbox/8.65.10.in-addr.arpa [14:42:59] but it can't find any IP addresses in there and hence the zone file is not created and therefore it fails [14:43:07] yes effie was decoming a node [14:43:18] sukhe: I was decomming hosts, jelto was renaming [14:44:14] sukhe: how can I help sorting this ? [14:44:52] effie: no worries, looking [14:45:18] effie: mc-gp1003 is also decommissioned? [14:45:25] ok I see it [14:45:26] sukhe: yes, 1001-1002-103 [14:45:28] -mc-gp1003 1H IN A 10.65.8.18 [14:45:29] ok [14:45:32] I released my netbox cookbook lock [14:45:51] there doesn't seem to be any active IP addresses in that so that's why it fails [14:46:04] as in, this was the last active one there? [14:46:23] yeah. I am fixing it and then will run the netbox cookbook again (otherwise this breaks all DNS updates) [14:46:40] cool, sorry for that, no idea it was possible [14:46:47] no, not your fault at all [14:47:01] this is because the include system is kinda messy and we have talked about fixing it (including topranks who has actually worked on it) [14:48:58] how is there no IP on the subnet anymore though? [14:49:30] https://netbox.wikimedia.org/search/?q=10.65.8 [14:49:38] ok yeah mgmt IP, plenty of IPs on the subnet but not that particular /24 [14:50:03] hopefully we can make some progress on T362985 [14:50:03] T362985: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985 [14:52:51] running netbox dns cookbook now [14:53:07] great thanks [14:55:52] ok you should be unblocked. please let me know if that's not the case [14:56:16] nice, thank you. Effie do you want to proceed or should I? [14:57:16] jelto: have a go please [14:57:51] sukhe: is there anything else I should do, or I am all done? [14:58:22] effie: should be all cleaned up [14:58:35] excellent, thank you ! [14:58:36] the longer fix is in the ticket topranks linked above and so we will pick that up but nothing from your side [15:01:05] thank you for the help [15:01:29] effie: thanks, I'll let you know when the rename is finished. I think I'm already beyond the netbox step [15:01:49] effie: and I'm done [15:02:32] jelto: netbox dns was the last part of the decom cookbook I believe [15:02:42] ok :) [15:09:40] UEFI/partman question: if I wanted to make an UEFI partman recipe for a 4-disk software RAID-10, would it just be a minor tweak to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/install_server/files/autoinstall/partman/raid1-2dev-efi.cfg ? Or are there other considerations [15:30:46] inflatador: the definitive partman reference is trial-and-error in production [15:32:58] cdanis good to hear partman hasn't changed a bit over the years ;P [15:33:53] oh it's changed lots of bits actually [15:34:00] I have a host that's failing to reimage (wdqs1025) with "media test failure"...seriously doubt UEFI will change anything but I thought I might kick that can [15:34:39] that sounds more like a boot settings issue [15:35:03] either there is not entry for the efi boot disk, or the efi partition was not setup correctly [15:35:33] I haven't tried EFI at all yet. I agree that it's more likely a layer 1 or boot settings issue [15:35:44] ah, sorry didn't realize [15:36:14] Happy to review the partman patch, or craft one [15:36:30] I'll keep troubleshooting, but I was just wondering the level of effort required to reimage as UEFI. Totally fine if it's not quite ready yet [15:37:09] * inflatador wonder if we could use cloud-init for partitioning once UEFI is ready [15:38:06] thanks jhathaway , will CC you when I get something up [15:38:18] we have an outstanding bug on the supermicro side, that makes the initial imaging cumbersome, but there are no known issues on the dell side. So we have made it available for initial use, but there may be some dragons still lurking [15:38:28] happy to help debug issues, if you want to give it a try [15:40:33] SGTM. Will hit you up once I've done my due diligence re: Layer1/boot settings [18:08:50] sirenbot: !incidents [18:09:00] !incidents [18:09:00] 5498 (ACKED) kafka-main1003/Kafka Broker Server (paged) [18:09:00] 5500 (UNACKED) ProbeDown sre (185.15.58.225 ip4 text-https:443 probes/service http_text-https_ip4 drmrs) [18:09:05] !ack 5500 [18:09:05] 5500 (ACKED) ProbeDown sre (185.15.58.225 ip4 text-https:443 probes/service http_text-https_ip4 drmrs) [20:48:28] jhathaway here's the partman patch...low priority fer sure https://gerrit.wikimedia.org/r/c/operations/puppet/+/1099740 [20:49:15] inflatador: looks good to me for testing, +1'd [20:51:16] Thanks...the host mentioned above does appear to be a layer 1 issue, so might be a sec before I can test [20:56:50] !incidents [20:56:50] 5498 (ACKED) kafka-main1003/Kafka Broker Server (paged) [20:56:50] 5501 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [20:56:50] 5502 (ACKED) Primary inbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [20:56:51] 5500 (RESOLVED) ProbeDown sre (185.15.58.225 ip4 text-https:443 probes/service http_text-https_ip4 drmrs)