[00:19:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:00:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:05:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:09:55] FIRING: MaxConntrack: Max conntrack at 83.11% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [03:14:56] RESOLVED: MaxConntrack: Max conntrack at 80.91% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [04:19:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:19:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:20:34] FIRING: DiskSpace: Disk space seaborgium:9100:/ 5.59% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=seaborgium - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:45:34] RESOLVED: DiskSpace: Disk space seaborgium:9100:/ 3.871% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=seaborgium - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:22:45] volans: thanks for the quick review [10:23:05] anytime [10:23:08] on that using the interface.device after doing interface.delete().... for some reason it works?? [10:23:17] It doesn't sit well with me though [10:23:41] so netbox would not unset the content of the variable so you can still access what you gathered previously [10:23:41] it's like the delete doesn't properly happen right away or something? [10:23:53] but if you'd try to access something that requires anAPI call would fail [10:23:57] there was another odd thing when I tested [10:23:59] ok [10:24:08] it's just object still in scope and hence cached in memory [10:24:41] but pynetbox could as well as unset everything on delete so interface.name could fail, it's implementation-dependent :D [10:24:56] ok gotcha [10:25:03] what's the odd thing? [10:25:08] is it bad practice to reference it after the delete() though? [10:25:26] perhaps I should move the log message before and declare the intention to remove it instead [10:25:28] you could log right after saying "deleting" [10:25:33] *before [10:26:20] yeah, it might be a more logical approach [10:26:33] on the weird thing - I'm just testing something here gimme 2 mins [10:27:12] ok yeah so the weird thing is weird [10:27:31] lol [10:27:32] in nb_shell if I delete the interfaces like this they are gone when I check netbox ui [10:27:47] >>> for child_int in myint.child_interfaces.all(): [10:27:47] ... child_int.delete() [10:27:47] ... [10:27:47] (2, {'dcim.Interface': 1, 'ipam.IPAddress': 1}) [10:27:47] (2, {'dcim.Interface': 1, 'ipam.IPAddress': 1}) [10:28:33] For some reason in the puppetdb import script when I checked it ran delete() for the 'vlan1234' interface as part of this loop when it went to delete the parent [10:28:38] as expected [10:29:15] but then a few lines later it logged "deleting interface vlan1234", processing that as a regular interface that was on the box (same as physical) and deleting cos it wasn't in puppetdb [10:29:50] deleting the child worked as expected and allowed the parent to be deleted avoiding the error condition [10:30:04] ah...... [10:30:06] hostname/ [10:30:06] nevermind [10:30:08] ? [10:30:15] if they are listed in interfaces() [10:30:17] it is the same phenomenon you mentioned I think [10:30:19] you'll get there eventuyally [10:30:39] so firstly the script does a "for interface in interfaces.all()": [10:30:45] it gets to the parent [10:30:54] - deletes the child cos it has to [10:31:14] but then that 'for' loop is still going, and the child int is one of the objects it's iterating over [10:31:23] so it later gets to that same child int and processes it [10:31:25] if that makes sense? [10:31:47] that's waht I'm saying, vlanXXX is also returned by interfaces.all() so if you process them in the right order no need for the child loop [10:31:53] ok ok [10:31:57] order by parent for example [10:32:05] yeah it is not an issue, things working ok [10:32:17] I mean you don't need your child loop at all [10:32:34] just alter the order of the all interfaces in the first loop and just delete [10:33:08] yeah, we'd need to iterate over things twice and re-order [10:33:33] we're careful about that for processing the list of interfaces read from PuppetDB, so things are added in the correct order to allow relations to be set [10:35:04] topranks: the cleaner way to avoid pre-caching issues is to filter the first interfaces.all() to uinstead be .filter(parent is none) [10:35:22] so you iterate only through the parent on the main loop and then do the child deletion in teh child loop [10:35:38] doing it in one go could still have weird behaviours due to pre-fetched state [10:58:52] volans: ok I reworked it to look over the child ints first and remove them if not in PupeptDB [10:58:55] *loop [10:59:08] safer/cleaner I think? [11:01:22] great, I'll have a look but right now in a meeting for a while [11:03:31] thanks :) [11:48:50] vgutierrez, topranks: I think that your changes needs a run of the dns netbox cookb ook [11:49:07] (the uncommitted changes alerted and I see things like vl612-ens3f1np1.lvs6001.drmrs.wmnet being removed [11:49:26] yep you are right I'll do that now [11:49:49] so this is a bit more convoluted than expected, those interfaces had IPs with DNS and hence needs a run of the cookbook, not sure if worth to add the logic to the reimage to trigger the cookbook or not [11:49:57] if it's onlyu for lvs hosts might not be worth, up to you [11:50:42] I'm running the cookbook now [11:50:57] thx [11:51:20] for vgutierrez FYI you'll need to run it too for the next LVSes [11:51:22] normally those IPs don't get dns names [11:51:40] why those had them [11:51:40] I think I added it for the eqsin ones manually waaaay back thinking it was a nice little flourish [11:51:41] ? [11:51:44] back when I was young and foolish [11:51:46] ahhhh [11:51:52] then ignore what I said valentin :D [12:56:55] 10netops, 06Infrastructure-Foundations, 10ops-magru: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10632528 (10cmooney) FWIW should also mention here the info in the below slide deck: https://www.lacnic.net/innovaportal/file/3207/1/lacnog2018-douglasfischer_anal... [14:40:36] Are we still having the issues with Supermicro hosts reimaging twice? Just wondering since I got `The puppet server has no CSR for cloudelastic1012.eqiad.wmnet` during my last reimage [14:41:07] I think https://phabricator.wikimedia.org/T381919 is the ticket for the issue but LMK if I missed something [14:41:09] elukey: ^^^ it might be the puppet5 issue [14:42:16] it seems to be intermittent, so not a huge deal. Just wanted to let y'all know [14:44:45] inflatador: o/ we have the double d-i issue but only with supermicros set with UEFI [14:44:55] is it the case for cloudelastic? [14:45:06] if so a new reimage should work, if it is not UEFI lemme know and I'll check [14:45:12] sadly the bug is still there :( [14:46:05] elukey confirming, it is UEFI [14:46:24] also confirming that running the reimage again worked ;) [14:53:23] okok perfect, then it is the bug :( [14:53:36] thanks a lot for testing UEFI btw [16:21:41] 07Puppet, 06SRE: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629#10633728 (10BCornwall) > In my limited testing, I saw this error mostly in codfw... [18:04:55] FIRING: MaxConntrack: Max conntrack at 83.19% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [18:09:55] RESOLVED: MaxConntrack: Max conntrack at 83.19% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [18:32:55] FIRING: MaxConntrack: Max conntrack at 84.21% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [18:37:55] RESOLVED: MaxConntrack: Max conntrack at 82.61% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [18:44:55] 10netops, 06Infrastructure-Foundations, 10ops-magru, 13Patch-For-Review: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10634362 (10cmooney) FWIW the router CPU is still fine with the arp policer set to 2MB size, which is how high I had to go before it stopped i... [19:32:55] FIRING: MaxConntrack: Max conntrack at 83.95% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [19:37:55] RESOLVED: MaxConntrack: Max conntrack at 82.63% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [19:39:49] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10634521 (10dr0ptp4kt) For keyword search later: This is manifesting in alerts with subject:"DiskSpace druid_... [20:47:55] FIRING: MaxConntrack: Max conntrack at 84.17% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:52:55] RESOLVED: MaxConntrack: Max conntrack at 84.17% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack