[08:48:42] weird email just arrived to abuse@wikimedia.org about the cloudgw NAT IP [08:48:55] probably spurious bullsh-- but fyi [13:52:15] dhinus: what do you think we should do about T402475? I can work on a new recipe that adds lvm volumes on top of the boss drive, or we can ignore it, or... [13:52:15] T402475: KernelErrors - https://phabricator.wikimedia.org/T402475 [13:52:34] I don't understand why there are lvm errors on a host that's not (yet) using lvm at all. [13:53:26] andrewbogott: I'm having a deep dive... [13:53:42] ok :) [13:53:49] lvm2 is required by the ceph packages [13:54:00] I'm not sure if ceph will try to create LVM volumes on the data disks [13:54:16] ceph definitely requires lvm on the osds. [13:54:29] that isn't done by puppet though, it's a cookbook stage that comes later. [13:54:31] I mean, it /is/ weird to just have all of the OS in one partition, so maybe I should treat that as a problem anyway. [13:54:32] ok. that will probably work, because the error is only in this init process [13:55:06] the lvm2 package includes this binary that tries to set up some systemd units on boot [13:55:53] that binary calls "lvmconfig" in turn, that should work, but for some reason is failing [13:55:59] ah, I see. [13:56:10] and those issues show up as kernel messages for some reason rather than service-specific? [13:56:12] this dynamic boot process was removed completely from latest lvm2 releases [13:56:26] kernel messages because it's systemd boot process, I believe [13:57:52] ah, ok [13:57:58] I guess we could probably just ignore them and they will disappear when we upgrade to bookworm [13:58:03] but now I'm curious :) [13:59:38] Meanwhile i think I'm going to devote an hour or two to partman and see if I can get a more normal partition layout for those hosts (although I suspect I'll just learn why Luca didn't) [14:10:38] dhinus: see -sre for a not-very-deep but convincing arg for why there's no lvm on the os drive [14:11:37] thanks, I guess we can try to follow suit, but Ceph will still use LVM on the data drives [14:12:27] So I guess the next step is to try giving ceph a chance, and then reboot and see if it's happy once lvm is in use [14:12:57] yes please do and let me know, my guess is that it might still complain at boot, but then work fine [14:13:09] ok [14:13:18] I guess this is just another reason to be impatient for bookworm [14:20:19] did you just reboot cloudcephosd1048? it came back and did not log the error [14:22:23] yes, the cookbook reboots it before doing anything else [14:22:37] does that suggest that the kernel error only happens before the initial puppet run? [14:22:42] so only during reimage? [14:22:54] (although why would it be monitored at all then? So that doesn't make sense) [14:27:24] maybe a race condition [14:28:52] /var/log/apt/history.log suggests lvm2 was installed at 2:13:26, and the error was logged at 2:13:30, so NOT at boot, but after the package was installed [14:30:38] probably some silly race condition in the .deb package install process [14:30:49] I guess we can just ignore it [14:31:45] as always, I should've rebooted first, debugged later :) [14:32:12] I want to try one more reboot, but waiting for ceph to settle down first [14:32:22] ack, you can also try rebooting one of the other nodes [14:32:47] oh, true, I'll try 1049 [14:36:08] done. no alert so far... [14:44:52] ok in the meantime I verified that the script that failed... is a no-op anyway :D [14:52:21] nice, I will close that bug then and ignore those errors. [14:52:46] On other news, some potentially bad things seemed to happen when I pooled that host so it's out again, I'm going to try just one drive next and see if that's calmer [14:53:10] actually, why don't you summarize on that task and close it if you're patient with that [15:04:10] I was already writing my summary there, just posted and resolved the task :) [15:13:44] great [15:15:13] found a smoking gun that confirms my theory, added another comment there [15:16:12] this error probably didn't warrant this full investigation, but it was a fun one :) [15:26:38] I am now going to undrain exactly one osd in the cloudcephosd1048. We'll see if that swamps the network... [15:27:49] seems ok so far [16:59:30] andrewbogott: hey I take it you were adding/removing some of those ceph hosts from the active cluster today?? [17:00:00] I only noticed it now but some alerts fired for utilisation a little over two hours back [17:00:36] topranks: yeah, that was 1048 pool I think. The cookbook may be a bit overly ambitious about adding drives; I'm just doing one right now and that doesn't seem to swamp anything. [17:00:54] Are there the same QOS limits on the new 25G hosts? Or different limits, or no limits? [17:02:31] the same qos profile is applied, it works as a percentage of the bandwidth regardless of the interface speed [17:02:58] we definitely maxed a few interfaces, but the qos seemed to do what it should have (I see zero drops in high profile - which is the mon traffic - which is good as those dropping caused cascading problems before) [17:03:26] so I think qos kicked in to keep the lights on, and then normal TCP behaviour made the initial burst / maxing of interfaces turn into "heavy use" on those interfaces fairly quickly [17:04:24] to me it just looks like a burst of heavy use but no obvious signs of "network collapse" type saturation [17:04:58] there weren't any service failures, user reports or other major things you were alerted to around then? [17:17:13] oops, sorry, switched windows [17:17:38] topranks: a few pgs presented as unavailable, briefly. So ceph was a little bit upset; I don't know if that caused any user-facing blips, but it's still slightly concerning. [17:18:20] dhinus: (unrelated) when y'all were doing paws troubleshooting last month did anyone try/succeed to get a shell on the k8s worker VMs? [17:18:52] topranks: I'm interested in doing another stress test but it'll probably be a couple of hours until the current OSD finishes balancing and we have a baseline. [17:19:45] sure, I mean I can't guarantee I'll be available, but also probably not a whole lot I can do to remedy things if it does cause problems (i.e. the high traffic rate would need to be stopped) [17:20:22] but things seem to be generally ok I don't see why you wouldn't do another one [17:21:33] ok. I'll see if you're around when I'm ready for the next round, if not I'll save it for later. [17:22:16] These drives are huge, it takes forever for them to populate [17:25:10] To access a k8s node directly I would use [17:25:11] https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/#node-shell-session [17:25:26] Followed by chroot /host [17:32:29] andrewbogott: no, I never managed to ssh into a PAWS node. Rook's approach of using kubectl debug looks promising. [17:33:00] having a key like for trove instances would be even better [17:33:19] yep, ok! I just wanted to make sure that what I wrote on https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/VM_access#Magnum_workers is true [17:33:45] Because users can inject a key, I'm not sure if there's a straightforward/polite way to override or supplement that. [17:33:48] But it would be handy [17:34:30] Rook: in the context of your link, a 'node' is a pod and not a k8s host, correct? [17:34:46] No it's a host [17:34:57] Kubectl get nodes [17:35:05] Should give a list [17:35:18] oh, huh. ok [17:35:23] If true ssh is wanted a key can be added to the magnum deploy [17:35:30] and that's not a huge security risk because it only works if you have API keys anyway [17:35:47] That's the thought I assume [17:35:52] seems right [17:36:01] ok, I'll add that doc link to my page [17:36:02] thank you!