[07:31:30] 10netops, 06Infrastructure-Foundations: Different BFD settings on direct connected links - https://phabricator.wikimedia.org/T387773#10599433 (10ayounsi) Not a strong feeling, but I usually try to steer towards the leaner option. So in that case it's to remove BFD between cr1/2-codfw. Looking at https://github... [09:52:48] 10netops, 06Infrastructure-Foundations: Different BFD settings on direct connected links - https://phabricator.wikimedia.org/T387773#10599988 (10cmooney) >>! In T387773#10599433, @ayounsi wrote: > Automation wise, we could probably automate "no `metric` = no BFD". Not sure I get the logic here, the suggestion... [10:15:09] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839 (10JAllemandou) 03NEW [10:15:21] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10600037 (10JAllemandou) p:05Triage→03High [10:58:29] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10600147 (10ayounsi) Let's see what other people think, but I think it would be fine to : * Keep only 1 month... [13:53:10] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10600761 (10cmooney) +1 I've no objection to any of these. 30 days for the full data is probably enough. In... [14:58:37] Hey, does the move-vlan option of the re-image cookbook do _all_ the necessary BGP steps for k8s workers? Or do I need to push changes manually? [15:03:42] klausman: look at the sre.k8s.renumber-node cookbook [15:06:53] `- Run homer on the core-router and leaf switch to update BGP` <- looks like it does the bit I was wondering about [15:27:36] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10601220 (10ayounsi) Mostly to be able to see long term trends, for example per destination AS. [15:28:28] https://phabricator.wikimedia.org/P74051 I just got a broken reimage [15:28:36] error: disk `lvmid/sRwxJO-NzaT-6d4G-fUJI-tGfp-1TD6-m3Cmws/fjJWRv-cBEQ-amjr-Oejo- [15:28:38] eftf-qGAH-m9Ugaj' not found. [15:28:50] is this a known issue/something I messed up? [15:31:42] I see a phab ticket from 2023 mentioning Puppet 5, but I doubt that's it? [15:38:30] this is the first boot after d-i? [15:38:35] yes [15:38:58] basically the second reboot of the reimage cookbook [15:39:27] doesn't ring any bell to me [15:39:38] but could be some partman-related issue? [15:40:02] Potentially. I did change the partman recipe used [15:40:16] it smells like the partman recipe messed up something [15:41:21] so ml-staging2* moved from standard+raid1-2dev+kubernetes-node-overlay-large-kubelet.cfg to standard+raid1-2dev+kubernetes-node-containerd.cfg [15:43:22] The diff shows nothing obvious. [15:43:29] the difference between the last two is minimal, so unless it was broken before for some reason, it shouldn't be it [15:43:57] The last machine imaged with this was probably ml-staging2003 [15:44:02] so ~months ago [15:44:24] But that's also a sSupermicro machine, the one failed here is a Dell [15:44:59] is it using UEFI? [15:45:32] So if I try to re-run the cookbook (with --new, because the machine is cleaned from Puppet), the cookbook still gives me the "Force P7 in hiera" message. That's not really necessary anymore, is it? [15:45:38] Not that I know [15:46:18] Unless the cookbook tries EFI automagically, of course. I didn't specify anything wild on the cmdline aside from --move-vlan [15:49:00] grub rescue> ls [15:49:02] (hd0) (hd0,gpt2) (hd0,gpt1) (hd1) (hd1,gpt2) (hd1,gpt1) (hd2) (hd2,gpt2) (hd2,gp [15:49:04] t1) (hd3) (lvm/vg0-containerd) (lvm/vg0-kubelet) (lvm/vg0-root) (md/0) [15:49:26] So the partitions etc. are there, but the kernel cmdline is bad? [15:49:30] er, grub config [15:50:33] it is ml-staging2002 right? So a dell node [15:50:42] for UEFI we specifically need to set it via provisioning [15:51:00] So it would not be the case here [15:51:47] We could try re-running the install cookbook [15:51:59] can you print the cmdpath in the grub shell? [15:52:13] I wouldn't know how [15:52:38] me too, I think it should be possible, but needs to be googled [15:53:04] it smells like EFI vs Bios though, lemme check the bios settings [15:53:16] I am not familiar with Dell, lemme find how to check if it is UEFI or BIOS [15:55:19] klausman: the "set" command could help in checking the defaults [15:55:33] yeah, I found it just new, let me pastebin its output [15:56:09] updated: https://phabricator.wikimedia.org/P74051 [16:01:00] klausman: ok if I try provision just to be sure? It may reboot the host [16:01:49] I think I may have just manually configured grub to boot, gimme a sec to confirm [16:02:09] we could then try grub-install to fix the machine (but yes, we should retry providioning, either way) [16:02:53] yeah no, that didn't work [16:03:03] elukey: feel free to proceed with provisoning [16:05:35] it ran fine, a couple of values to fix but nothing big [16:05:47] at this point I'd retry reimage klausman, to see it we get to the same point [16:06:11] ack. I presume I can ignore the P7 hiera message from the reimage cookbook? [16:06:21] (since I have to use --new) [16:06:39] yes yes [16:06:45] alright, proceeding [16:22:49] elukey: install went fine [16:25:48] So no idea what happened there. [16:29:16] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10601646 (10Ottomata) FWIW, netflow is ingested into the Data Lake, so is queryable using SQL and/or [[ https... [16:29:35] the only diff seems to be move-vlan, but it shouldn't play any role in this. We'll see if your next reimages get to the same point, or if it was a weird one [16:29:54] yeah, cosmic rays or sth... [16:58:13] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10601848 (10JAllemandou) >>! In T387839#10601646, @Ottomata wrote: > FWIW, netflow is ingested into the Data... [17:12:55] FIRING: MaxConntrack: Max conntrack at 81.68% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [17:17:55] RESOLVED: MaxConntrack: Max conntrack at 83.35% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [17:20:32] elukey: I just got a kernel panic on the reinstall, looks like during an attempted reboot [17:21:26] klausman: mmm so you started a reboot and ended up in a kernel panic? [17:21:39] No, it was the reboot at the end of the installer [17:21:42] https://phabricator.wikimedia.org/P74063 messages [17:21:54] ah I assumed it ended an hour ago [17:22:05] I guess the installer reboots automagically after a panic (it just did, no input from me) [17:22:13] This is the second machine [17:22:38] The first machine seems to be working fine [17:22:53] aaaand the grub boot failed again, similar message as with the first machine [17:22:53] okok you need to give me more details then when reporting :D [17:23:14] I thought I had mentioned that I was doing machine #2, sorry [17:23:31] np :) [17:23:40] and this time with move-vlan right? [17:23:44] yep [17:23:54] basically same cmdline as the very first one with 2002 [17:23:59] (this one is 2001 [17:24:02] ) [17:24:09] no idea, probably it is worth to open a task a this point [17:24:29] I'll try re-running the imagine cookbook. [17:25:23] okok [17:25:29] (need to go afk, will read later!) [17:59:04] re-running the imaging on 2001 seems to work (so the provisioning you did for 2002 was not what made a difference). Also, even the second install had a panic at the end, but the boot worked fine. So something else is going on. I'll write up a ticket with details. 2003 is an SMC machine, so if it is hw-dependent, we'll see. [18:01:09] and even the first reboot after install is a panic (and autoboot) [18:01:23] as in the first reboot after the first full puppet run [18:02:14] (though the panic is not what breaks subsequent grub, according to my current hypothesis [18:06:55] FIRING: MaxConntrack: Max conntrack at 80.15% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [18:11:55] RESOLVED: MaxConntrack: Max conntrack at 80.15% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [18:21:12] The SMC machien reinstalled without a hitch, so there's some hw-specificity [18:47:37] elukey: summary: the GRUB thing is still unclear, will do a writeup in a task tomorrow. All the staging machines are now on containerd and a _lot_ of stuff is crash looping. I currently don't have the brainpower to figure it out. I'll likely need your help tomorrow. [23:02:44] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [23:07:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [23:12:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [23:37:55] FIRING: MaxConntrack: Max conntrack at 81.6% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [23:42:56] RESOLVED: MaxConntrack: Max conntrack at 81.6% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack