[09:07:51] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10677359 (10ayounsi) [09:07:53] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10677360 (10ayounsi) [09:17:47] 10netops, 06Infrastructure-Foundations, 06SRE: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10677390 (10aborrero) >>! In T389958#10674585, @cmooney wrote: > @aborrero as discussed we can possibly arrange a window for Thurs Mar 27th to carry out the remaining st... [09:23:48] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10677414 (10ayounsi) At least the BFD metrics are not exposed in Junos 21.2, possibly only starting in 22.3 (https://apps.junipe... [09:25:12] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10677417 (10ayounsi) [09:25:43] 10netops, 06Infrastructure-Foundations, 06SRE, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10677419 (10taavi) [09:27:02] 10netops, 06Infrastructure-Foundations, 06SRE, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10677421 (10taavi) [09:35:33] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10677444 (10ayounsi) [09:50:38] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10677498 (10ayounsi) For alarms: https://apps.juniper.net/telemetry-explorer/select-software?software=Junos%20OS&release=21.2R3&... [10:09:32] 10netops, 06Infrastructure-Foundations, 06SRE, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10677602 (10aborrero) announcement: https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/LX6KDZMQHEL3NZ3DMWQERI2O3YVSDDKM/ [11:00:53] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10677770 (10cmooney) >>! In T388641#10677498, @ayounsi wrote: > For alarms: https://apps.juniper.net/telemetry-explorer/select-s... [11:10:27] 10netops, 06Infrastructure-Foundations, 06SRE, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10677790 (10cmooney) [11:16:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [11:17:12] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10677814 (10fgiunchedi) In this case we could indeed alert on `gnmi_system_alarms_alarm_state_id` and Prometheus will issue aler... [11:26:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [12:00:31] XioNoX, topranks - I see only option 82 for sretest2001 with dhcpdump :( [12:00:50] oh well :( [12:00:56] I guess it's not unexpected [12:01:01] elukey: thanks for checking! [12:01:26] np! dhcpdump is really awesome [12:02:25] 10netops, 06Infrastructure-Foundations, 06SRE: Classify ceph traffic flows for network prioritization - https://phabricator.wikimedia.org/T390044 (10cmooney) 03NEW p:05Triage→03Low [12:03:30] 10netops, 06Infrastructure-Foundations, 06SRE: Classify ceph traffic flows for network prioritization - https://phabricator.wikimedia.org/T390044#10677969 (10cmooney) [12:05:23] elukey: thanks! at least that helps decide where to focus our efforts :) can you share it in a pastebin? [12:24:34] I'm just merging in a few more Netbox alerts to AlertManager. They're all warning, but just letting everyone know if to many alerts starts triggering. [12:43:42] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 06serviceops, 06SRE: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#10678176 (10thcipriani) →14Duplicate dup:03T387833 [12:49:44] FIRING: [2x] NetboxLibreNMS: Netbox - All checks related to LibreNMS. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/3/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxLibreNMS [12:56:37] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052 (10ayounsi) 03NEW [12:56:51] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10678232 (10ayounsi) [12:56:54] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10678231 (10ayounsi) [12:59:44] RESOLVED: [2x] NetboxLibreNMS: Netbox - All checks related to LibreNMS. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/3/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxLibreNMS [13:09:21] XioNoX: Does this feel correct? The LibraNMS script runs, fails, Alertmanager alerts, but only for 10 minutes. Shouldn't the alert be triggering as long as the last run is unsuccessful. [13:09:42] slyngs: yeah [13:09:53] it should alert as long as the problem exists [13:10:16] And the same for all of the alerts really [13:10:24] yup [13:19:44] FIRING: [2x] NetboxLibreNMS: Netbox - All checks related to LibreNMS. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/3/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxLibreNMS [13:29:44] RESOLVED: [2x] NetboxLibreNMS: Netbox - All checks related to LibreNMS. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/3/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxLibreNMS [13:46:36] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10678444 (10ayounsi) @cmooney another question is that if the service is not present on the device, for example BFD where BFD is... [13:47:33] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10678447 (10ayounsi) [13:48:10] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10678463 (10ayounsi) [13:49:44] FIRING: [2x] NetboxLibreNMS: Netbox - All checks related to LibreNMS. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/3/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxLibreNMS [13:51:13] Yeah, we're just rolling that back for now, this is annoying [13:53:59] XioNoX: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1131326 <- Quick +1 [13:55:44] slyngs: +1 [13:57:11] I think we might need to export the data differently, or do the Prometheus query from hell. [13:59:44] RESOLVED: [2x] NetboxLibreNMS: Netbox - All checks related to LibreNMS. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/3/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxLibreNMS [14:04:45] Hello. We have a server which seems to have been installed into the wrong vlan (private instead of analytics) (T390048) [14:04:46] T390048: an-worker1202 Hadoop NodeManager service keeps flapping - https://phabricator.wikimedia.org/T390048 [14:05:51] What is the best procedure to use at the moment? I have read T350152 and I can see the `--move-vlan` argument to the reimage cookbook. Should I use this, or is it for a different type of move? Thanks. [14:05:51] T350152: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 [14:06:38] btullis: different kind of move, that won't work [14:07:05] btullis: the "easiest" is to decom then re-provision the host. Dcops should be able to help [14:07:11] Cool. Suspected it might be. [14:08:29] OK, thanks. I will take it to that channel. [14:12:08] sigh redfish... reporting 2 NICs with linkup on hosts that AFAIK have just one [14:12:28] volans: ah? link up or interface up? [14:12:41] volans: or does it include the idrac nic? [14:14:39] linkup, but AFAIK it has only one cable, so far only for an-druid100[1-2] [14:15:31] getting more while progressing the check, so I need to figure it out, was hoping this way would be quicker, sigh [14:16:22] volans: let me know if I can help [14:17:12] make vendors implement the standards as they are defined :D [14:18:49] volans: better hope to ask for world peace :) [14:25:25] I need to go via scp_dump that is proprietary and slower but should work... [14:49:14] FYI I will exclude the 237 hosts older than 5y and I kinda like the idea to get it from the host or puppetdb best effort despite the weirdness in the failure scenario (1st reimage, mac from puppet, fails, 2nd reimage, mac from redfish) [14:53:25] that excludes also the HPs [14:55:44] we still have HPs? [14:57:56] 3 leftover [14:57:59] still active [14:58:04] https://netbox.wikimedia.org/dcim/devices/?status=active&manufacturer_id=6 [14:58:21] 24 in the dc [14:58:30] all 3 older than 5 years [14:58:38] yep [14:58:48] we have 50 active servers older than 6 btw [14:59:25] are people pushing for their decom? or their's just being forgotten? [14:59:42] willy is usually extracting the list and pushing people [14:59:44] dunno [15:00:12] cool [15:01:03] 25 of them are an-worker so is clearly luca's fault :D [15:03:32] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10679023 (10BTullis) Please could someone expedite this, if possible? We still have some alerts that are flag... [15:21:26] XioNoX: https://phabricator.wikimedia.org/P74441 [15:22:33] elukey: it's there, line 112: https://phabricator.wikimedia.org/P74441$112 [15:22:44] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [15:22:57] elukey: but looks like it's set to the mac-address [15:23:02] not serial [15:23:26] XioNoX: ah wow my bad! I may have checked only the ones with vendor class d-i, that don't have it (the last ones) [15:23:39] I collected few of them from the dump, I missed the earlier ones [15:24:07] yeah D-I doesn't have 97, but there are other levers at this point [15:24:27] topranks: --^ [15:24:40] my bad there is something but it doesn't seem useful either [15:25:04] elukey: yeah, not useful... [15:27:44] FIRING: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [15:30:55] FIRING: MaxConntrack: Max conntrack at 80.41% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [15:32:44] RESOLVED: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [15:35:55] RESOLVED: MaxConntrack: Max conntrack at 80.8% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [15:37:46] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10679226 (10JAllemandou) a:03JAllemandou [15:45:48] FYI I'm running the "audit" of gathering the mac via redfish on active not-too-old dells. So far seems mostly to work but the question is what to do when it doesn't work. I'll dig more into the failures once finished to see if are fixable or not [15:48:24] yeah it will depends on the kind of errors [15:48:40] in general we need to have a plan b for the exceptions [15:48:54] that is not send dcops to read the mac address on the host :D [15:50:31] puppetdb? [15:50:41] that works only one-shot [15:50:43] if the reimage fails [15:50:44] no more [15:51:39] we should copy the mac to netbox imo [15:51:54] * volans shakes head [15:52:16] * topranks gives voalns the evil eye [15:52:21] how do you manage motherboard/nic replacement properly without too much burdeen? [15:53:26] that's a good point, I guess I'm more trying to argue for using ID_NET_NAME_MAC too and moving away from puppetdb import to netbox [15:53:53] I'm not saying don't get it from redfish on a reimage either, just it could be useful to have a record of "what it was" [16:05:43] yeah I think that's the long term good solution, but I'd be careful of doing too many changes now [16:18:48] 10netops, 06Infrastructure-Foundations, 10ops-drmrs: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10679496 (10RobH) > We'll be installing the new optics into the original ports, and removing the old optics and patch. > > So please remove the optic patch D0100B and the op... [16:57:03] 07Puppet, 06SRE, 06Web-Team: Certain mobile devices including XiaoMi are not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10679754 (10Jdlrobson-WMF) p:05High→03Medium [16:57:54] /win 14 [17:02:51] 10netops, 06Infrastructure-Foundations, 10ops-drmrs: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10679800 (10ayounsi) just got off the phone with the tech, I made a small mistake it was port 1 on cr1, so he called me to double check. He is going to do the patching, updat... [17:03:23] 07Puppet, 06SRE, 06Web-Team: Certain mobile devices are (possibly) not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10679816 (10Jdlrobson-WMF) 05In progress→03Stalled [17:04:35] 07Puppet, 06SRE, 06Web-Team: Certain mobile devices are (possibly) not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10679823 (10Jdlrobson-WMF) [17:04:52] 07Puppet, 06SRE, 06Web-Team: Certain mobile devices are (possibly) not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10679825 (10Jdlrobson-WMF) p:05Medium→03Low Lowering priority and stalling as there is nothing actionable here at this time and the numbers we saw do not... [21:54:32] !log disabling external Internet peers in BGP on cr1-drmrs T389071 [21:54:32] topranks: Not expecting to hear !log here [21:54:33] T389071: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071 [22:14:13] 10netops, 06Infrastructure-Foundations, 10ops-drmrs: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10681006 (10cmooney) p:05High→03Low Happy to say all looks good following the replacement patch and optics being installed this evening: ` cmooney@cr1-drmrs> show interfa... [23:11:55] FIRING: MaxConntrack: Max conntrack at 83.28% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [23:16:55] RESOLVED: MaxConntrack: Max conntrack at 83.28% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack