[09:18:53] So I have run into an interesting oddity while trying to build an image on build2001: during some debian pkg install --- well, before it --- apt-get update fails: [09:19:01] Failed to fetch http://security.debian.org/debian-security/dists/bookworm-security/InRelease Could not connect to security.debian.org:80 (151.101.194.132), connection timed out Could not connect to security.debian.org:80 (151.101.130.132), connection timed out Could not connect to security.debian.org:80 (151.101.2.132), connection timed out Could not connect to security.debian.org:80 [09:19:03] (151.101.66.132), connection timed out [09:20:03] This then likely also is the root cause for unmet deps: libc6-dev : Depends: libc6 (= 2.36-9+deb12u4) but 2.36-9+deb12u7 is to be installed [09:27:10] Oh, I'm a dum-dum, I used `docker-pkg` instead of `build-production-images` [10:27:33] moritzm: ok to merge your build hook change? [10:29:33] claime: yes, please [10:30:38] moritzm: done [10:30:40] thx [14:00:07] my turn to ask this channel for help with troubleshooting reimages! I have a pair of servers, cloudvirt-wdqs1001/2, that just don't seem to want to network boot to the debian installer. when attached to the console via mgmt, I see the "PXE boot requested by iDRAC" text, but then it just continues to boot to the existing OS, which is not what I want [14:00:50] and a semi-related question: is it expected to see a constant stream of "no free leases" on various subnets when looking at the dhcpd logs on install1004? [14:01:14] yes that one is fine [14:01:30] so where exactly is the reimage stuck? "PXE boot requested by iDRAC" and it doesn't actually do that? [14:02:28] the cookbook reboots the server, and then instead of booting to debian-installer the server boots to the old OS instead [14:06:27] can you try selecting PXE boot manually when it reboots and see what happens? (attaching to the console before hand so that you can do so) [14:08:53] you mean trying https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Reboot_and_boot_into_BIOS_then_console? can do [14:10:09] no, simply when it is rebooting, selecting PXE boot manually (I think it is F12) [14:13:48] Make sure the BIOS is set to PXE boot off the correct NIC and the NIC firmware matches what's in the linked article [14:14:03] that's a good check as well [14:14:10] The "correct" NIC should be whatever is connected when its in the OS [14:14:24] (the NIC firmware looks fine for the R440) [14:21:27] you can run the provision cookbook with specific options to check that btw, sorry I'm in a meeting [14:23:21] taavi: 1G or 10G NICs? [14:23:29] kamila_: 10G [14:25:09] see https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Troubleshooting , in particular the legacy boot option in the NIC FW [14:25:25] (and FW versions) [14:26:31] or if it PXE booted exactly once, come back tomorrow :D [14:27:36] (there's a weird bug with some specific NICs that I haven't quite understood where it won't do another PXE boot after the first successful one for ~12h) [14:29:48] for the NIC FW options, you'll want F2 on the console and then device settings [14:30:34] and then NIC settings IIRC in each device's menu [14:31:22] kamila_ interersting. Pretty sure I ran into that before, but did not realize the issue [14:31:41] inflatador: which one, the exactly once PXE boot? [14:31:57] I need to jump to a meeting, but I'll keep looking afterwards [14:31:57] Y [14:32:03] taavi: lmk if you get stuck, I have it in fresh memory for now :D [14:32:12] inflatador: so I'm not crazy? neat :D [14:33:05] (saw it on 2 different hosts, so pretty sure it's deterministic, but I was doubting my sanity for a while :D) [15:15:58] btullis: can I merge your puppet change together with mine? [15:16:34] fabfur: Oh yes please. Sorry. Was it a labs/private one? [15:17:25] ack, tnx! [16:39:57] do we have an alerts that read from the Dell SEL? Just wondering as I've had a few hosts go down from backplane errors [16:40:05] re T367598 and others [16:40:05] T367598: Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598 [16:40:54] inflatador: I don't think we do, but I think that's a good idea [16:42:28] maybe automatic ticket creation. can likely be copied from the ones we get for RAID failure [16:45:43] if you search phabricator for "racadm getsel" it's quite the selection of hardware tickets of the past [16:47:24] but not automated/templated [16:55:03] mutante excellent, will take a look [16:58:55] inflatador: cool! example ticket that I mean that is autocreated: https://phabricator.wikimedia.org/T367678 [16:59:01] operations/puppet: grep -r "TASK AUTO-GENERATED" [16:59:30] yea, that's icinga, but it works fine [17:00:24] I guess it depends on how badly the host is borked. With these backplane failures, they usually end up unbootable [17:01:02] ack, RAM failure is kind of common too though [17:02:43] maybe the pollers could run a cmd like this and report back? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/ipmi/files/ipmi_mgmt.sh#29 . That command probably needs auth though [17:06:39] yea, that should be the same thing as "racadm getsel" locally after SSH to mgmt with password [17:08:11] [cumin1002:~] $ sudo ipmi_mgmt log [17:09:01] maybe the easiest way is puppetizing a bash script and timer that runs ipmi_mgmt on cumin* and sends email, not even on alert* [17:09:22] since it's already there [17:11:37] OK, created T367790 to look at our options [17:11:38] T367790: Hardware failures: consider alerting via SEL messages - https://phabricator.wikimedia.org/T367790