[07:46:57] Hi everyone! I have requested access to a given group/permission set (https://phabricator.wikimedia.org/T345633) but it turns out I messed up the SSH key generation step, as it seems both my wmf-cloud and wmf-prod are in use in WMCS. Indeed, I have added both to https://toolsadmin.wikimedia.org/profile/settings/ssh-keys/. I just wanted to make sure [07:46:57] of what the next steps were. I was thinking to a) remove the wmf-prod key from toolforbe, b) delete the current wmf-prod keypair, c) regenerate one and d) add the new public key to the ticket. Am I missing anything? Thanks! [07:59:07] brouberol: that sounds about right. Also make sure to answer the question in the last comment in the task [07:59:20] Thanks, and will do [08:46:58] brouberol: do you need any help with that? :) [08:48:34] I should be all good! I've just reached out to btullis to make sure that ssh+kerberos is indeed what I need, after which I'll post a response in phabricator. Sorry about the whole ssh key shennanigan [08:48:45] Thanks! [09:05:27] ack :) [13:43:30] XioNoX: o/ around? [13:44:21] elukey: s'up? [13:44:48] do you mind to join #wikimedia-serviceops? We are discussing https://phabricator.wikimedia.org/T345738 [16:31:05] does anyone know what it means when dhcpd logs 'no free leases' when trying to pxe boot for a reimage? [16:31:28] https://www.irccloud.com/pastebin/Fgd02Ry9/ [16:35:57] urandom: are you pxe booting a host manually? [16:36:55] volans: oh, interesting. I am not, right now, but I may have done so from the virtual console when I was troubleshooting. [16:37:03] that doesn't work [16:37:11] dhcp is set ephemerally by the cookbooks [16:37:19] when needed [16:37:34] so either by teh reimage cookbook or by the sre.hosts.dhcp cookbook if you need to manual debug [16:37:45] ok, that explains it. [16:38:02] https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#DHCP_Automation for more details [16:38:15] volans: and it may shed some light on why I'm having problems [16:38:42] urandom: it should work if you're running the reimage cookbook [16:38:47] I can't reimage, when I attempt to via the cookbook it fails and I see no logged output on install1004 [16:39:12] when I manually did, I got the 'no free leases' error [16:39:18] https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#What_to_do_if... [16:39:22] hmm... is this the same server we spoke about last week? we discussed it briefly including about possibly a bad SFP ? [16:39:22] we've been treating it as a hardware issue, but now I wonder... [16:39:30] topranks: right [16:39:33] ok [16:40:12] what's the server name again? I can have a little look [16:40:49] restbase1030 [16:40:57] last week the switch-side was hard DOWN while it was trying to do DHCP as part of PXEboot [16:41:11] ok [16:41:30] the above paste suggests that DHCP DISCOVERs are making it, which requires the port to be UP obviously [16:41:37] right. [16:43:39] urandom: is there a reimage cookbook currently running against restbase1030? [16:43:57] topranks: yeah, I just killed [16:44:25] you just killed the reimage cookbook? [16:45:37] yes [16:46:01] ok, if it's alright I'll kick it off again here? easier to compare the state of cookbook vs. server/switch for me [16:46:12] sure [16:46:16] was there any special flags to the cookbook command you were running? [16:46:46] --new --os bullseye -t T331713 restbase1030 [16:46:47] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [16:47:10] urandom: cool thanks let me give it a shot see if I can figure out what's broken [16:56:42] urandom: exact same symptoms as before, I'm gonna double check a few things [16:57:37] topranks: meaning that the link state is down? [16:57:45] yep [16:58:42] the port stats show it was up yesterday though, around the time matching your logs from above of the DHCPDISCOVERs hitting install1004 [17:01:58] that's...awesome [17:30:50] urandom: were the physcial's checked out for restbase1030? [17:31:18] everything I'm seeing points to a bad cable or possibly SFP still [17:33:45] topranks: sounds like the optic and cable were replaced? https://phabricator.wikimedia.org/T344259#9142803 [17:34:24] yeah seems like it [17:34:26] hmm [17:34:37] I'll update the task, I'm kind of stumped tbh [17:35:44] topranks: cool; thanks for having a look [17:36:26] sorry I was in a meeting, quick check, are you changing OS? did you upgrade the firwmare of the nic? [17:36:42] volans: yes, and yes. [17:37:03] to the correct version or the latest? for some combination of nic/OS we do have a specific version that works [17:37:06] and not the latest [17:37:18] sorry for double checking if this was already discussed [17:37:37] right, I think that's for the 10G nics though, right? [17:38:42] possibly yes [17:38:58] that recommended version won't even update to these onboard broadcoms.... unless there is another, separate version for the 1G broadcoms [17:39:20] https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Urgent_Firmware_Revision_Notices: [17:39:42] but I'll defer to pap.aul/dcops on those [17:40:18] urandom: yes I believe we've only ever had problems with NICs/ports staying down on the 10G PCIe cards [17:43:36] I updated the task there.. TL;DR I was able to get the interface up by forcing the switch side to 100Mb [17:44:00] The first thing that would suggest to be is bad cables (only certain pairs working), but that's already been checked so who knows [17:46:07] topranks: is it still set to 100mb? [17:46:22] no I set the config back to normal [17:46:27] Ok [17:46:47] I can force it back down to 100Mb, but tbh there is some problem there that's not a fix [17:46:59] it seems like it can be made to transition to up, but then drops when it tries to pxe [17:48:36] which is something we have seen before: https://phabricator.wikimedia.org/T340055 [17:49:04] it took replacing the SFP with a specific brand to get it working there [17:49:21] and it happened again in the same cluster shortly after, same thing [17:50:17] we don't have any of that brand in eqiad though [17:51:47] yeah can't rule it out, but given there are so many working ports on the same switch with the same model fiberstore SFP it seems unlikely there is a wider incompatibility between NIC/SFP/Switch [17:58:27] urandom: I'm gonna try a firmware downgrade on the NIC [17:58:34] clutching at straws but worth a shot [17:58:52] topranks: ok [17:59:16] if I ever start a band, that'll be the name: Clutching At Straws [17:59:54] haha [18:00:10] I love that :) [18:00:48] ;)