[09:18:53] <klausman>	 So I have run into an interesting oddity while trying to build an image on build2001: during some debian pkg install --- well, before it --- apt-get update fails:
[09:19:01] <klausman>	 Failed to fetch http://security.debian.org/debian-security/dists/bookworm-security/InRelease  Could not connect to security.debian.org:80 (151.101.194.132), connection timed out Could not connect to security.debian.org:80 (151.101.130.132), connection timed out Could not connect to security.debian.org:80 (151.101.2.132), connection timed out Could not connect to security.debian.org:80
[09:19:03] <klausman>	 (151.101.66.132), connection timed out
[09:20:03] <klausman>	 This then likely also is the root cause for unmet deps: libc6-dev : Depends: libc6 (= 2.36-9+deb12u4) but 2.36-9+deb12u7 is to be installed
[09:27:10] <klausman>	 Oh, I'm a dum-dum, I used `docker-pkg` instead of `build-production-images`
[10:27:33] <claime>	 moritzm: ok to merge your build hook change?
[10:29:33] <moritzm>	 claime: yes, please
[10:30:38] <claime>	 moritzm: done
[10:30:40] <moritzm>	 thx
[14:00:07] <taavi>	 my turn to ask this channel for help with troubleshooting reimages! I have a pair of servers, cloudvirt-wdqs1001/2, that just don't seem to want to network boot to the debian installer. when attached to the console via mgmt, I see the "PXE boot requested by iDRAC" text, but then it just continues to boot to the existing OS, which is not what I want
[14:00:50] <taavi>	 and a semi-related question: is it expected to see a constant stream of "no free leases" on various subnets when looking at the dhcpd logs on install1004?
[14:01:14] <sukhe>	 yes that one is fine
[14:01:30] <sukhe>	 so where exactly is the reimage stuck? "PXE boot requested by iDRAC" and it doesn't actually do that?
[14:02:28] <taavi>	 the cookbook reboots the server, and then instead of booting to debian-installer the server boots to the old OS instead
[14:06:27] <sukhe>	 can you try selecting PXE boot manually when it reboots and see what happens? (attaching to the console before hand so that you can do so)
[14:08:53] <taavi>	 you mean trying https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Reboot_and_boot_into_BIOS_then_console? can do
[14:10:09] <sukhe>	 no, simply when it is rebooting, selecting PXE boot manually (I think it is F12)
[14:13:48] <inflatador>	 Make sure the BIOS is set to PXE boot off the correct NIC and the NIC firmware matches what's in the linked article
[14:14:03] <sukhe>	 that's a good check as well
[14:14:10] <inflatador>	 The "correct" NIC should be whatever is connected when its in the OS
[14:14:24] <sukhe>	 (the NIC firmware looks fine for the R440)
[14:21:27] <volans>	 you can run the provision cookbook with specific options to check that btw, sorry I'm in a meeting
[14:23:21] <kamila_>	 taavi: 1G or 10G NICs?
[14:23:29] <taavi>	 kamila_: 10G
[14:25:09] <kamila_>	 see https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Troubleshooting , in particular the legacy boot option in the NIC FW
[14:25:25] <kamila_>	 (and FW versions)
[14:26:31] <kamila_>	 or if it PXE booted exactly once, come back tomorrow :D 
[14:27:36] <kamila_>	 (there's a weird bug with some specific NICs that I haven't quite understood where it won't do another PXE boot after the first successful one for ~12h)
[14:29:48] <kamila_>	 for the NIC FW options, you'll want F2 on the console and then device settings
[14:30:34] <kamila_>	 and then NIC settings IIRC in each device's menu
[14:31:22] <inflatador>	 kamila_ interersting. Pretty sure I ran into that before, but did not realize the issue
[14:31:41] <kamila_>	 inflatador: which one, the exactly once PXE boot? 
[14:31:57] <taavi>	 I need to jump to a meeting, but I'll keep looking afterwards
[14:31:57] <inflatador>	 Y
[14:32:03] <kamila_>	 taavi: lmk if you get stuck, I have it in fresh memory for now :D 
[14:32:12] <kamila_>	 inflatador: so I'm not crazy? neat :D 
[14:33:05] <kamila_>	 (saw it on 2 different hosts, so pretty sure it's deterministic, but I was doubting my sanity for a while :D) 
[15:15:58] <fabfur>	 btullis: can I merge your puppet change together with mine? 
[15:16:34] <btullis>	 fabfur: Oh yes please. Sorry. Was it a labs/private one?
[15:17:25] <fabfur>	 ack, tnx!
[16:39:57] <inflatador>	 do we have an alerts that read from the Dell SEL? Just wondering as I've had a few hosts go down from backplane errors
[16:40:05] <inflatador>	 re T367598 and others
[16:40:05] <stashbot>	 T367598: Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598
[16:40:54] <mutante>	 inflatador: I don't think we do, but I think that's a good idea
[16:42:28] <mutante>	 maybe automatic ticket creation. can likely be copied from the ones we get for RAID failure
[16:45:43] <mutante>	 if you search phabricator for "racadm getsel" it's quite the selection of hardware tickets of the past
[16:47:24] <mutante>	 but not automated/templated
[16:55:03] <inflatador>	 mutante excellent, will take a look
[16:58:55] <mutante>	 inflatador: cool! example ticket that I mean that is autocreated: https://phabricator.wikimedia.org/T367678  
[16:59:01] <mutante>	 operations/puppet:   grep -r "TASK AUTO-GENERATED"
[16:59:30] <mutante>	 yea, that's icinga, but it works fine
[17:00:24] <inflatador>	 I guess it depends on how badly the host is borked. With these backplane failures, they usually end up unbootable
[17:01:02] <mutante>	 ack, RAM failure is kind of common too though
[17:02:43] <inflatador>	 maybe the pollers could run a cmd like this and report back? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/ipmi/files/ipmi_mgmt.sh#29 . That command probably needs auth though
[17:06:39] <mutante>	 yea, that should be the same thing as "racadm getsel" locally after SSH to mgmt with password
[17:08:11] <mutante>	 [cumin1002:~] $ sudo ipmi_mgmt log
[17:09:01] <mutante>	 maybe the easiest way is puppetizing a bash script and timer that runs ipmi_mgmt on cumin* and sends email, not even on alert*
[17:09:22] <mutante>	 since it's already there
[17:11:37] <inflatador>	 OK, created T367790 to look at our options
[17:11:38] <stashbot>	 T367790: Hardware failures: consider alerting via SEL messages - https://phabricator.wikimedia.org/T367790