[09:33:13] <Emperor>	 o/ do we support booting/imaging/etc servers in UEFI mode rather than BIOS mode? it seems like the RAID cards that the config-J servers come with is only configurable via UEFI boot cf https://phabricator.wikimedia.org/T378584#10282304
[09:39:13] <slyngs>	 Emperor: jhathaway and XioNoX have been working on it. I think there's some software RAID issue, but "real" RAID might work.
[09:41:31] <Emperor>	 it seems that currently the new SM config-J nodes can only set 12/24 drives to JBOD unless booted in UEFI mode
[09:42:13] <Emperor>	 (so I'm wondering if "just setup these nodes to boot via UEFI rather than BIOS" will actually work, or if all sorts of other things will then break)
[10:01:06] <slyngs>	 Probably not "all sorts of things" See: https://phabricator.wikimedia.org/T373519 but I'd ping jhathaway and check
[10:22:50] <Emperor>	 ta
[11:22:37] <tappof>	 moscovium and logstash1024 are the only VMs under ganeti1025, and both are currently unreachable, even though they are reported as running by ganeti
[11:42:41] <tappof>	 It seems that on Ganeti1025 there are some checking operations on /dev/md2 (stuck at 39%). Also, the load average increased around the same time that the Logstash1024/Moscovium issue was reported.
[11:46:34] <tappof>	 there are also some messages in dmesg like "task md2_raid5:566 blocked for more than 120 seconds" or "task qemu-system-x86:57148 blocked for more than 120 seconds." and "task md2_raid5:566 blocked for more than 120 seconds."
[11:56:50] <topranks>	 hmm yeah... load average spiked and flatlined since ~10:45 utc 
[11:59:49] <tappof>	 I don't know if it's safe to "migrate" the ganeti1025 (https://wikitech.wikimedia.org/wiki/Ganeti#Reboot/Shutdown_for_maintenance_a_node) to recover the 2 VMs
[12:00:33] <tappof>	 without shutting it down ... just to move around the vms
[12:00:48] <topranks>	 yeah that would be the ideal solution
[12:01:30] <topranks>	 however I fear it might not work here, as ganeti1025 may not be healthy enough to actually read them so they can be copied to another node 
[12:02:09] <topranks>	 the logstash VMs also are sometimes tricky to move, given the amount of logs they are ingesting is so frequent the system has trouble getting a "snapshot" 
[12:02:31] <tappof>	 Yes, that was also my concern
[12:04:20] <topranks>	 I think it's worth trying though, we can start with moscovium perhaps (though I'm not sure what it does) 
[12:04:58] <topranks>	 moritz.m is out today which is a shame, I'm wondering if we'd be better telling ganeti to power them off before we try the move 
[12:07:14] <tappof>	 on the Logstash side, we have redundancy and no signs of problems as of now... 
[12:07:38] <tappof>	 moscovium is owned by collaboration-services 
[12:07:59] <topranks>	 ok yep
[12:08:30] <topranks>	 I'll try the force shutdown and move in that case, I think better than live migration 
[12:09:54] <tappof>	 furthermore 1024 is a collector so just a frontend node 
[12:10:01] <tappof>	 with no data 
[12:10:20] <tappof>	 on board 
[12:11:59] <topranks>	 ok that's good to know 
[12:12:29] <topranks>	 virtual console to it  failed, issued a shutdown from the ganeti side now but it's not responding :( 
[12:12:43] <topranks>	 which all kind of makes sense as ganeti1025 is definitely sick 
[12:13:40] <tappof>	 ganeti1025 has its moment of melancholy
[12:17:28] <topranks>	 yeah it's failing to shut down either vm, which makes it almost certain it'll fail to migrate them 
[12:23:02] <tappof>	 I'll talk with my team this afternoon. Maybe we can try to reach out to someone from collab-services to understand whether we can wait to recover moscovium or not.
[12:23:11] <tappof>	 One more thing ... 
[12:24:18] <topranks>	 I was considering just attempting a reboot of ganeti1025 to see if it presents the same problem on reboot 
[12:24:32] <topranks>	 given both VMs are effectively stalled I don't think it can make things worse 
[12:24:45] <tappof>	 We don't have metrics from ganeti in eqiad because prometheus takes too long to retrieve metrics from ganeti1028 (about 13 seconds), while the maximum allowed is 10 seconds
[12:25:47] <tappof>	 yeah topranks could be an option ... I think we have only those two machines on the host ... 
[12:25:58] <topranks>	 yeah they are the only two 
[12:26:21] <topranks>	 huh that sounds like another issue... the scrape timeout is exceeded?  
[12:26:32] <tappof>	 yes topranks  
[12:26:39] <topranks>	 that may be related to it trying to 'gather' stats when scraped, and timing out talking to ganeti1025 
[12:26:52] <topranks>	 we should check afterwards either way 
[12:27:37] <tappof>	 but maybe it's because the master takes too long to get information from ganeti1025
[12:28:14] <slyngs>	 Either that or there are "to many" hosts and VMs
[12:29:09] <tappof>	 In this case, the timing suggests the first option :)
[12:31:59] <slyngs>	 If we curl the endpoint it does, eventually complete... There might be a point to adding a bit of async, caching, background worker code to the exporter.
[12:32:57] <slyngs>	 Then again: " curl localhost:8080 A server error occurred.  Please contact the administrator."
[12:40:22] <topranks>	 it seemed to stall when shutting down so eventually I did power cycle via idrac 
[12:40:27] <topranks>	 back up now and seems healthier 
[12:41:14] <topranks>	 not sure VMs are though 
[12:41:32] <slyngs>	 The Prometheus endpoint has speed up as well.
[12:42:16] <topranks>	 right, it was probably delaying trying to get stats from 1025 
[12:42:52] <tappof>	 vms are still reported as "admin down"
[12:45:14] <tappof>	 Yes, sure, you’re moving them into another node (I think ... draining the node with the cookbook)
[12:47:05] <tappof>	 1024 is up
[12:47:12] <topranks>	 yep, they are reporting as up now on ganeti1029 
[12:47:25] <topranks>	 tappof: are you able to check logstash1024 seems healthy?
[12:47:51] <tappof>	 topranks: I'm going to take a look
[12:48:17] <topranks>	 seems to be doing stuff, processes running, IP traffic etc 
[12:48:21] <topranks>	 so hopefully 
[12:49:13] <tappof>	 no untis failed 
[12:49:16] <tappof>	 *units
[12:51:18] <topranks>	 moscovium also looks good but I am unsure of specific things to check 
[12:52:06] <tappof>	 topranks: I think everything is okay with logstash1024
[12:52:13] <topranks>	 \o/
[12:52:18] <topranks>	 that's great
[12:52:28] <tappof>	 thank you for the help 
[12:52:36] <topranks>	 I'll open a task to describe what happened and we can look in more detail with mor.itz on monday 
[12:52:40] <topranks>	 np, thanks for the heads up!
[12:52:53] <tappof>	 :)
[12:53:21] <tappof>	 Time for lunch now! :)
[13:09:21] <elukey>	 Emperor: re EFI and Config J - IIUC EFI is needed only to access an utility to configure the disks in JBOD, after that we should be able to revert to BIOS. It is mostly annoying for DCops, but shouldn't impact anything else!
[13:16:13] <Emperor>	 elukey: my concerns are i) that's running the RAID card in an unsupported setup ii) disk swaps, where typically you need the controller to JBOD a newly-inserted disk
[13:23:34] <cdanis>	 tappof: topranks: was there anything about DRBD in the kernel logs on the ganeti host?  it was probably https://phabricator.wikimedia.org/T348730 if so
[13:24:46] <topranks>	 ah!
[13:25:01] <topranks>	 I'd not spotted it but yep, and correlated in time, this is the first one
[13:25:02] <topranks>	 Nov  1 10:45:05 ganeti1025 kernel: [12360956.583238] block drbd4: We did not send a P_BARRIER for 42072ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
[13:25:50] <elukey>	 Emperor: We'll need to test it yes, not ideal
[13:26:05] <elukey>	 we have support for EFI in our automation but it is not ready for prime time
[13:41:20] <cdanis>	 topranks: yeah that's it :) is that a bullseye host?
[13:41:43] <topranks>	 cdanis: yep it is 
[13:42:01] <topranks>	 I added that info to the task thanks for the extra info 
[13:42:57] <topranks>	 didn't seem to cause a kernel panic on the hypervisor itself, was that the case the other times?
[13:54:20] <cdanis>	 topranks: that's not a panic -- the kernel also prints thread stacks for kernel threads that are blocked for longer than /proc/sys/kernel/hung_task_timeout_secs seconds
[13:55:54] <topranks>	 yeah, just wasn't sure from the original task description how to understand "kernel hung" 
[13:56:06] <cdanis>	 ah yeah
[13:56:07] <topranks>	 but I guess it did the exact same as we seen today those times 
[13:56:20] <cdanis>	 the P_BARRIER message is enough to identify this failure mode IMO
[13:56:40] <topranks>	 yeah seems to be it 
[13:57:17] <topranks>	 I guess then the node is probably ok after reboot?  I had feared some physical disk problem or something but that doesn't look to be it 
[13:57:43] <cdanis>	 there might be some app-level data loss, but, nothing to do be done about it
[14:06:51] <topranks>	 yeah, think we were lucky here 
[14:07:22] <topranks>	 the logstash vm was a frontend-only one, eoghan checked moscovium it seems ok too 
[15:20:27] <Emperor>	 Hm, anyone seen xfs_admin -l hang ~indefinitely~ (including a kernel message about xfs_db blocked for more than 120s) where the underlying fs seems OK? I've had a disk swapped in thanos-be2003 and the fs seems to be behaving other than that xfs_db -x -p xfs_admin -r -c label /dev/sde1 seems to wedge forever
[15:20:50] <Emperor>	 Not sure if this is "underlying fs is v. busy, just wait" or "replacement disk is iffy"
[15:28:29] <cdanis>	 Emperor: I'm quite out of my depth wrt xfs but I think `biosnoop-bpfcc` from the bpfcc-tools package might help you, some usage examples at https://github.com/iovisor/bcc/blob/master/tools/biosnoop_example.txt
[15:29:30] <cdanis>	 (and yes IMO it's fine to temporarily install that package by hand on a host you're debugging)
[15:32:54] <Emperor>	 ta
[16:00:55] <inflatador>	 You might want to disable DRDB on VMs that don't need it. More stability and better I/O performance