[07:31:30] <wikibugs>	 10netops, 06Infrastructure-Foundations: Different BFD settings on direct connected links - https://phabricator.wikimedia.org/T387773#10599433 (10ayounsi) Not a strong feeling, but I usually try to steer towards the leaner option. So in that case it's to remove BFD between cr1/2-codfw. Looking at https://github...
[09:52:48] <wikibugs>	 10netops, 06Infrastructure-Foundations: Different BFD settings on direct connected links - https://phabricator.wikimedia.org/T387773#10599988 (10cmooney) >>! In T387773#10599433, @ayounsi wrote: > Automation wise, we could probably automate "no `metric` = no BFD".  Not sure I get the logic here, the suggestion...
[10:15:09] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839 (10JAllemandou) 03NEW
[10:15:21] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10600037 (10JAllemandou) p:05Triage→03High
[10:58:29] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10600147 (10ayounsi) Let's see what other people think, but I think it would be fine to : * Keep only 1 month...
[13:53:10] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10600761 (10cmooney) +1 I've no objection to any of these.  30 days for the full data is probably enough.  In...
[14:58:37] <klausman>	 Hey, does the move-vlan option of the re-image cookbook do _all_ the necessary BGP steps for k8s workers? Or do I need to push changes manually?
[15:03:42] <claime>	 klausman: look at the sre.k8s.renumber-node cookbook
[15:06:53] <klausman>	 `- Run homer on the core-router and leaf switch to update BGP` <- looks like it does the bit I was wondering about
[15:27:36] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10601220 (10ayounsi) Mostly to be able to see long term trends, for example per destination AS.
[15:28:28] <klausman>	 https://phabricator.wikimedia.org/P74051 I just got a broken reimage
[15:28:36] <klausman>	 error: disk `lvmid/sRwxJO-NzaT-6d4G-fUJI-tGfp-1TD6-m3Cmws/fjJWRv-cBEQ-amjr-Oejo-
[15:28:38] <klausman>	 eftf-qGAH-m9Ugaj' not found.
[15:28:50] <klausman>	 is this a known issue/something I messed up?
[15:31:42] <klausman>	 I see a phab ticket from 2023 mentioning Puppet 5, but I doubt that's it?
[15:38:30] <elukey>	 this is the first boot after d-i?
[15:38:35] <klausman>	 yes
[15:38:58] <klausman>	 basically the second reboot of the reimage cookbook
[15:39:27] <volans>	 doesn't ring any bell to me
[15:39:38] <volans>	 but could be some partman-related issue?
[15:40:02] <klausman>	 Potentially. I did change the partman recipe used
[15:40:16] <elukey>	 it smells like the partman recipe messed up something
[15:41:21] <elukey>	 so ml-staging2* moved from standard+raid1-2dev+kubernetes-node-overlay-large-kubelet.cfg to standard+raid1-2dev+kubernetes-node-containerd.cfg
[15:43:22] <klausman>	 The diff shows nothing obvious. 
[15:43:29] <elukey>	 the difference between the last two is minimal, so unless it was broken before for some reason, it shouldn't be it
[15:43:57] <klausman>	 The last machine imaged with this was probably ml-staging2003
[15:44:02] <klausman>	 so ~months ago
[15:44:24] <klausman>	 But that's also a sSupermicro machine, the one failed here is a Dell
[15:44:59] <elukey>	 is it using UEFI?
[15:45:32] <klausman>	 So if I try to re-run the cookbook (with --new, because the machine is cleaned from Puppet), the cookbook still gives me the "Force P7 in hiera" message. That's not really necessary anymore, is it?
[15:45:38] <klausman>	 Not that I know
[15:46:18] <klausman>	 Unless the cookbook tries EFI automagically, of course. I didn't specify anything wild on the cmdline aside from --move-vlan
[15:49:00] <klausman>	 grub rescue> ls
[15:49:02] <klausman>	 (hd0) (hd0,gpt2) (hd0,gpt1) (hd1) (hd1,gpt2) (hd1,gpt1) (hd2) (hd2,gpt2) (hd2,gp
[15:49:04] <klausman>	 t1) (hd3) (lvm/vg0-containerd) (lvm/vg0-kubelet) (lvm/vg0-root) (md/0)
[15:49:26] <klausman>	 So the partitions etc. are there, but the kernel cmdline is bad?
[15:49:30] <klausman>	 er, grub config
[15:50:33] <elukey>	 it is ml-staging2002 right? So a dell node
[15:50:42] <elukey>	 for UEFI we specifically need to set it via provisioning
[15:51:00] <klausman>	 So it would not be the case here
[15:51:47] <klausman>	 We could try re-running the install cookbook
[15:51:59] <elukey>	 can you print the cmdpath in the grub shell?
[15:52:13] <klausman>	 I wouldn't know how
[15:52:38] <elukey>	 me too, I think it should be possible, but needs to be googled
[15:53:04] <elukey>	 it smells like EFI vs Bios though, lemme check the bios settings
[15:53:16] <elukey>	 I am not familiar with Dell, lemme find how to check if it is UEFI or BIOS
[15:55:19] <elukey>	 klausman: the "set" command could help in checking the defaults
[15:55:33] <klausman>	 yeah, I found it just new, let me pastebin its output
[15:56:09] <klausman>	 updated: https://phabricator.wikimedia.org/P74051
[16:01:00] <elukey>	 klausman: ok if I try provision just to be sure? It may reboot the host
[16:01:49] <klausman>	 I think I may have just manually configured grub to boot, gimme a sec to confirm
[16:02:09] <klausman>	 we could then try grub-install to fix the machine (but yes, we should retry providioning, either way)
[16:02:53] <klausman>	 yeah no, that didn't work
[16:03:03] <klausman>	 elukey: feel free to proceed with provisoning
[16:05:35] <elukey>	 it ran fine, a couple of values to fix but nothing big
[16:05:47] <elukey>	 at this point I'd retry reimage klausman, to see it we get to the same point
[16:06:11] <klausman>	 ack. I presume I can ignore the P7 hiera message from the reimage cookbook?
[16:06:21] <klausman>	 (since I have to use --new)
[16:06:39] <elukey>	 yes yes
[16:06:45] <klausman>	 alright, proceeding
[16:22:49] <klausman>	 elukey: install went fine
[16:25:48] <klausman>	 So no idea what happened there.
[16:29:16] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10601646 (10Ottomata) FWIW, netflow is ingested into the Data Lake, so is queryable using SQL and/or [[ https...
[16:29:35] <elukey>	 the only diff seems to be move-vlan, but it shouldn't play any role in this. We'll see if your next reimages get to the same point, or if it was a weird one
[16:29:54] <klausman>	 yeah, cosmic rays or sth...
[16:58:13] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10601848 (10JAllemandou) >>! In T387839#10601646, @Ottomata wrote: > FWIW, netflow is ingested into the Data...
[17:12:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 81.68% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[17:17:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 83.35% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[17:20:32] <klausman>	 elukey: I just got a kernel panic on the reinstall, looks like during an attempted reboot
[17:21:26] <elukey>	 klausman: mmm so you started a reboot and ended up in a kernel panic?
[17:21:39] <klausman>	 No, it was the reboot at the end of the installer
[17:21:42] <klausman>	 https://phabricator.wikimedia.org/P74063 messages
[17:21:54] <elukey>	 ah I assumed it ended an hour ago
[17:22:05] <klausman>	 I guess the installer reboots automagically after a panic (it just did, no input from me)
[17:22:13] <klausman>	 This is the second machine
[17:22:38] <klausman>	 The first machine seems to be working fine
[17:22:53] <klausman>	 aaaand the grub boot failed again, similar message as with the first machine
[17:22:53] <elukey>	 okok you need to give me more details then when reporting :D
[17:23:14] <klausman>	 I thought I had mentioned that I was doing machine #2, sorry
[17:23:31] <elukey>	 np :)
[17:23:40] <elukey>	 and this time with move-vlan right?
[17:23:44] <klausman>	 yep
[17:23:54] <klausman>	 basically same cmdline as the very first one with 2002
[17:23:59] <klausman>	 (this one is 2001
[17:24:02] <klausman>	 )
[17:24:09] <elukey>	 no idea, probably it is worth to open a task a this point
[17:24:29] <klausman>	 I'll try re-running the imagine cookbook.
[17:25:23] <elukey>	 okok
[17:25:29] <elukey>	 (need to go afk, will read later!)
[17:59:04] <klausman>	 re-running the imaging on 2001 seems to work (so the provisioning you did for 2002 was not what made a difference). Also, even the second install had a panic at the end, but the boot worked fine. So something else is going on. I'll write up a ticket with details. 2003 is an SMC machine, so if it is hw-dependent, we'll see.
[18:01:09] <klausman>	 and even the first reboot after install is a panic (and autoboot)
[18:01:23] <klausman>	 as in the first reboot after the first full puppet run
[18:02:14] <klausman>	 (though the panic is not what breaks subsequent grub, according to my current hypothesis
[18:06:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 80.15% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[18:11:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 80.15% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[18:21:12] <klausman>	 The SMC machien reinstalled without a hitch, so there's some hw-specificity
[18:47:37] <klausman>	 elukey: summary: the GRUB thing is still unclear, will do a writeup in a task tomorrow. All the staging machines are now on containerd and a _lot_ of stuff is crash looping. I currently don't have the brainpower to figure it out. I'll likely need your help tomorrow.
[23:02:44] <jinxer-wm>	 FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting
[23:07:44] <jinxer-wm>	 FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting
[23:12:44] <jinxer-wm>	 RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting
[23:37:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 81.6% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[23:42:56] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 81.6% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack