[00:19:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:00:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:05:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:09:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 83.11% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[03:14:56] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 80.91% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[04:19:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:19:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:20:34] <jinxer-wm>	 FIRING: DiskSpace: Disk space seaborgium:9100:/ 5.59% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=seaborgium - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[09:45:34] <jinxer-wm>	 RESOLVED: DiskSpace: Disk space seaborgium:9100:/ 3.871% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=seaborgium - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[10:22:45] <topranks>	 volans: thanks for the quick review 
[10:23:05] <volans>	 anytime
[10:23:08] <topranks>	 on that using the interface.device after doing interface.delete().... for some reason it works??
[10:23:17] <topranks>	 It doesn't sit well with me though 
[10:23:41] <volans>	 so netbox would not unset the content of the variable so you can still access what you gathered previously
[10:23:41] <topranks>	 it's like the delete doesn't properly happen right away or something?
[10:23:53] <volans>	 but if you'd try to access something that requires anAPI call would fail
[10:23:57] <topranks>	 there was another odd thing when I tested 
[10:23:59] <topranks>	 ok
[10:24:08] <volans>	 it's just object still in scope and hence cached in memory
[10:24:41] <volans>	 but pynetbox could as well as unset everything on delete so interface.name could fail, it's implementation-dependent :D
[10:24:56] <topranks>	 ok gotcha 
[10:25:03] <volans>	 what's the odd thing?
[10:25:08] <topranks>	 is it bad practice to reference it after the delete() though?
[10:25:26] <topranks>	 perhaps I should move the log message before and declare the intention to remove it instead 
[10:25:28] <volans>	 you could log right after saying "deleting"
[10:25:33] <volans>	 *before
[10:26:20] <topranks>	 yeah, it might be a more logical approach 
[10:26:33] <topranks>	 on the weird thing - I'm just testing something here gimme 2 mins 
[10:27:12] <topranks>	 ok yeah so the weird thing is weird 
[10:27:31] <volans>	 lol
[10:27:32] <topranks>	 in nb_shell if I delete the interfaces like this they are gone when I check netbox ui
[10:27:47] <topranks>	 >>> for child_int in myint.child_interfaces.all():
[10:27:47] <topranks>	 ...     child_int.delete()
[10:27:47] <topranks>	 ... 
[10:27:47] <topranks>	 (2, {'dcim.Interface': 1, 'ipam.IPAddress': 1})
[10:27:47] <topranks>	 (2, {'dcim.Interface': 1, 'ipam.IPAddress': 1})
[10:28:33] <topranks>	 For some reason in the puppetdb import script when I checked it ran delete() for the 'vlan1234' interface as part of this loop when it went to delete the parent 
[10:28:38] <topranks>	 as expected 
[10:29:15] <topranks>	 but then a few lines later it logged "deleting interface vlan1234", processing that as a regular interface that was on the box (same as physical) and deleting cos it wasn't in puppetdb 
[10:29:50] <topranks>	 deleting the child worked as expected and allowed the parent to be deleted avoiding the error condition 
[10:30:04] <topranks>	 ah......
[10:30:06] <volans>	 hostname/
[10:30:06] <topranks>	 nevermind 
[10:30:08] <volans>	 ?
[10:30:15] <volans>	 if they are listed in interfaces()
[10:30:17] <topranks>	 it is the same phenomenon you mentioned I think 
[10:30:19] <volans>	 you'll get there eventuyally
[10:30:39] <topranks>	 so firstly the script does a "for interface in interfaces.all()":
[10:30:45] <topranks>	 it gets to the parent 
[10:30:54] <topranks>	 - deletes the child cos it has to 
[10:31:14] <topranks>	 but then that 'for' loop is still going, and the child int is one of the objects it's iterating over 
[10:31:23] <topranks>	 so it later gets to that same child int and processes it 
[10:31:25] <topranks>	 if that makes sense?
[10:31:47] <volans>	 that's waht I'm saying, vlanXXX is also returned by interfaces.all() so if you process them in the right order no need for the child loop
[10:31:53] <topranks>	 ok ok 
[10:31:57] <volans>	 order by parent for example
[10:32:05] <topranks>	 yeah it is not an issue, things working ok 
[10:32:17] <volans>	 I mean you don't need your child loop at all
[10:32:34] <volans>	 just alter the order of the all interfaces in the first loop and just delete
[10:33:08] <topranks>	 yeah, we'd need to iterate over things twice and re-order 
[10:33:33] <topranks>	 we're careful about that for processing the list of interfaces read from PuppetDB, so things are added in the correct order to allow relations to be set 
[10:35:04] <volans>	 topranks: the cleaner way to avoid pre-caching issues is to filter the first interfaces.all() to uinstead be .filter(parent is none)
[10:35:22] <volans>	 so you iterate only through the parent on the main loop and then do the child deletion in teh child loop
[10:35:38] <volans>	 doing it in one go could still have weird behaviours due to pre-fetched state
[10:58:52] <topranks>	 volans: ok I reworked it to look over the child ints first and remove them if not in PupeptDB 
[10:58:55] <topranks>	 *loop
[10:59:08] <topranks>	 safer/cleaner I think?
[11:01:22] <volans>	 great, I'll have a look but right now in a meeting for a while
[11:03:31] <topranks>	 thanks :) 
[11:48:50] <volans>	 vgutierrez, topranks: I think that your changes needs a run of the dns netbox cookb ook
[11:49:07] <volans>	 (the uncommitted changes alerted and I see things like vl612-ens3f1np1.lvs6001.drmrs.wmnet being removed
[11:49:26] <topranks>	 yep you are right I'll do that now 
[11:49:49] <volans>	 so this is a bit more convoluted than expected, those interfaces had IPs with DNS and hence needs a run of the cookbook, not sure if worth to add the logic to the reimage to trigger the cookbook or not
[11:49:57] <volans>	 if it's onlyu for lvs hosts might not be worth, up to you
[11:50:42] <topranks>	 I'm running the cookbook now 
[11:50:57] <volans>	 thx
[11:51:20] <volans>	 for vgutierrez FYI you'll need to run it too for the next LVSes
[11:51:22] <topranks>	 normally those IPs don't get dns names 
[11:51:40] <volans>	 why those had them
[11:51:40] <topranks>	 I think I added it for the eqsin ones manually waaaay back thinking it was a nice little flourish 
[11:51:41] <volans>	 ?
[11:51:44] <topranks>	 back when I was young and foolish 
[11:51:46] <volans>	 ahhhh
[11:51:52] <volans>	 then ignore what I said valentin :D
[12:56:55] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10ops-magru: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10632528 (10cmooney) FWIW should also mention here the info in the below slide deck:  https://www.lacnic.net/innovaportal/file/3207/1/lacnog2018-douglasfischer_anal...
[14:40:36] <inflatador>	 Are we still having the issues with Supermicro hosts reimaging twice? Just wondering since I got `The puppet server has no CSR for cloudelastic1012.eqiad.wmnet` during my last reimage
[14:41:07] <inflatador>	 I think https://phabricator.wikimedia.org/T381919 is the ticket for the issue but LMK if I missed something
[14:41:09] <volans>	 elukey: ^^^ it might be the puppet5 issue 
[14:42:16] <inflatador>	 it seems to be intermittent, so not a huge deal. Just wanted to let y'all know
[14:44:45] <elukey>	 inflatador: o/ we have the double d-i issue but only with supermicros set with UEFI
[14:44:55] <elukey>	 is it the case for cloudelastic?
[14:45:06] <elukey>	 if so a new reimage should work, if it is not UEFI lemme know and I'll check
[14:45:12] <elukey>	 sadly the bug is still there :(
[14:46:05] <inflatador>	 elukey confirming, it is UEFI
[14:46:24] <inflatador>	 also confirming that running the reimage again worked ;)
[14:53:23] <elukey>	 okok perfect, then it is the bug :(
[14:53:36] <elukey>	 thanks a lot for testing UEFI btw
[16:21:41] <wikibugs>	 07Puppet, 06SRE: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629#10633728 (10BCornwall) > In my limited testing, I saw this error mostly in codfw...
[18:04:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 83.19% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[18:09:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 83.19% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[18:32:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 84.21% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[18:37:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 82.61% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[18:44:55] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10ops-magru, 13Patch-For-Review: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10634362 (10cmooney) FWIW the router CPU is still fine with the arp policer set to 2MB size, which is how high I had to go before it stopped i...
[19:32:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 83.95% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[19:37:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 82.63% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[19:39:49] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10634521 (10dr0ptp4kt) For keyword search later: This is manifesting in alerts with subject:"DiskSpace druid_...
[20:47:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 84.17% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[20:52:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 84.17% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack