[03:19:39] can someone check https://phabricator.wikimedia.org/T349671 [06:43:08] <_joe_> marostegui: tbh it seems like a specific upload failure and I guess it's not SRE realm [06:57:12] yeah don't know, I got pinged last night at like 2am about it [07:54:10] swift looks fine; I'm afraid I get a steady slow drip of odd upload failures (most of which aren't traceable to anything in the swift logs) [07:55:00] I'll go log-diving for Tammy_2023_path.png on the swift frontends in a bit, but don't expect it'll produce anything [07:58:27] <_joe_> Emperor: tbh I think the failure during uploads is more probable on the mediawiki side [08:16:51] the logs agree :) [08:17:05] ^- swift proxy [12:47:28] taavi: vgutierrez: are yuo able to give another pass to https://gerrit.wikimedia.org/r/c/operations/puppet/+/968269 [14:52:08] jbond: done, looks good to me [15:00:26] vgutierrez: cheers [18:23:07] hrmm... the decommission cookbook is failing on me while trying to update dns. If fails on "Running zone_validator to check WMF rules" with a FileNotFoundError (can't find a temp file). Does this ring any bells for anyone? [18:32:16] sukhe: any ideas? ^^ [18:32:40] urandom: can you paste the full error? [18:33:00] https://www.irccloud.com/pastebin/XeECKH5G/ [18:33:17] oh [18:33:47] which host were you decommissioning? [18:33:53] restbase1016 [18:35:54] do you recognize this? [18:36:46] yeah I know what's happening but not why. looking [18:40:21] hmm [18:40:27] $INCLUDE netbox/159.64.10.in-addr.arpa [18:41:12] urandom: is the cookbook stalling for input now (if yes, what is it saying) or has it failed? [18:41:46] it failed with that error for all of the dns servers, and it's prompting me to retry (which didn't help), to skip, or abort [18:42:12] What do you want to do? "retry" the last command, manually fix the issue and "skip" the last command to continue the execution or completely "abort" the execution. [18:42:44] I think the only thing that comes after this, is updating of the phab [18:46:49] I do see what's wrong, just not sure if I should delete this [18:47:01] essentially, there isn't any 159.64.10.in-addr.arpa file in netbox [18:47:22] which is what the error says, and that makes sense, because I don't see any assignments under 10.64.159.0/24 [18:48:09] but [18:48:18] in templates/10.in-addr.arpa, we include $INCLUDE netbox/159.64.10.in-addr.arpa [18:48:23] which is why you get the error above [18:48:50] uh. [18:49:36] the fix would be to essentially remove this line, run the netbox dns cookbook and that should unblock us [18:50:18] though I don't think this will be the only error, there is the v6 equivalent as well [18:53:08] I am tracing back to see what could have caused this so that we can remove it [18:53:13] and unblock, otherwise all authdns-updates will fail [18:54:13] the $INCLUDE was added in July by toprank.s , surely that's not the direct cause [18:55:59] depends. if you include the PTR file but no IP is assigned, there's nothing to include [18:59:03] I just don't see anything in 10.64.159 [18:59:19] in https://netbox.wikimedia.org/extras/changelog/?per_page=1000 for example [19:00:09] oh I do see it [19:00:13] -vlan1054.lsw1-e8-eqiad 1H IN A 10.64.159.1 [19:00:20] this is in 10.64.159.0/24 [19:01:32] -1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR vlan1053.lsw1-e8-eqiad.eqiad.wmnet. [19:01:44] sorry, 1054: [19:01:44] -1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR vlan1054.lsw1-e8-eqiad.eqiad.wmnet. [19:02:01] ok, I am going to remove the include and see what change netbox triggers [19:06:00] topranks: around? [19:06:18] hey [19:06:21] ^ can I get a confirmation for the above please? it sounds right in my head but I want to check [19:06:32] sukhe: hold off one second if you can [19:06:40] yep not going to merge till I get your +1 [19:07:04] that's a switch Arzhel has been experimenting with - it's a Dell one in that rack [19:07:20] I assume he removed the IP assignment, and thus netbox is no longer generating the file [19:07:29] yep, that's my assessment [19:07:44] but this currently borks authdns-update, so I thought we can remove the PTR include for now [19:08:24] we either do that or add the missing records in netbox [19:08:36] https://gerrit.wikimedia.org/r/c/operations/dns/+/968743/ this removes the PTRs for both v4 and v6 [19:08:40] whatever you think makes sense [19:08:42] if it can't wait a few minutes go ahead, otherwise I'll double check if they are supposed to be there and removed in error [19:08:59] we are not in a rush yeah but just wanted to unblock it [19:09:09] so go ahead and check. thanks! I know it's late for you but I didn't feel comfortable removing them [19:10:45] ultimately not a huge deal, these are planned subnets for those racks, so the includes will be needed eventually [19:10:53] I don't see why netbox wouldn't be generating those files though: [19:10:54] https://netbox.wikimedia.org/search/?q=vlan1054.lsw1-e8-eqiad.eqiad.wmnet&obj_type= [19:11:18] yeah, I tried a dummy update for the netbox DNS cookbook, nothing there as well [19:12:17] topranks: on the netbox DNS repo, HEAD, is eevans@cumin1001: restbase1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one [19:12:22] --- a/159.64.10.in-addr.arpa [19:12:22] +++ /dev/null [19:12:22] @@ -1 +0,0 @@ [19:12:22] -1 1H IN PTR vlan1054.lsw1-e8-eqiad.eqiad.wmnet. [19:12:50] on more thing: [19:12:51] 2023-10-25 19:12:38,140 [WARNING] Device lsw1-e8-eqiad of IP 10.64.159.1/24 with DNS name vlan1054.lsw1-e8-eqiad.eqiad.wmnet not in devices, skipping. [19:13:51] I think that's it - gimme a sec [19:14:25] status was changed: https://netbox.wikimedia.org/extras/changelog/149113/ [19:15:48] :D [19:16:22] ok they are back now after changing back to status planned [19:16:33] authdns-update should be ok now if you want to give it a shot [19:16:35] awesome, so the netbox dns cookbook should pick it up [19:16:37] trying [19:16:50] I already ran the dns cookbook yeah [19:17:40] yep! thanks! we are good now [19:18:00] so should I use the "retry" option again? [19:18:04] topranks: I owe you a brewery because I would have just uncommented it :) [19:18:07] urandom: please do [19:18:19] ah tbh that would have been fine too [19:18:30] a brewery....lol [19:18:31] yeah, if I couldn't reach you or arzhel, that was my plan :) [19:18:37] I'll chat to Arzhel tomorrow - if the decision is made to change the device back to 'staged' status we'll do just that! [19:18:58] I'm passed that step: \o/ [19:19:02] urandom: I once bothered him at 3AM when we had an incident, so a beer just wouldn't cut it [19:19:13] awesome :) thanks all! [19:19:25] yes, much thanks! [19:21:04] no probs, the 'include' niggles are annoying, hopefully we'll get some time to improve things soon [19:32:11] yeah not the first time but also not that frequent [19:54:21] lol [19:54:25] https://www.irccloud.com/pastebin/noPCJYiP/ [19:54:30] I was so close [19:56:01] nm, a retry succeeded there... I guess it was transient [20:07:16] this one is expected as brett is reimaging [20:07:41] but yeah transient [20:09:33] yeah, I figured that was it [20:15:30] topranks, sukhe, my bad! totally didn't think that changing the device's status would cause that [20:26:29] XioNoX: all good, happens! [22:04:50] you [22:05:01] (ignore, wrong window)