[06:34:12] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10697203 (10ayounsi) BFD is deployed, here is the full list of devices not able to expose those metrics (devices that don't have... [08:20:59] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10697436 (10Joe) >>! In T389932#10694961, @jhathaway wrote: > One issue with using just the FQDN is that is breaks tools which rely on matching other hostnames, for instanc... [08:28:46] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10697450 (10Joe) Alternatively, we can ofc remove the TLD from the matching expression in the pseudocode I posted. I don't think that having long regexes is really feasible... [08:35:16] moritzm: FYI I'm about to upgrade spicerack on cumin2002 [08:35:25] (I've seen you were running some cookbooks there) [08:35:37] I've just finished the last run, go ahead [08:35:57] <3 thx [08:48:00] can you leave a note when the update is done, I can also test things with the ganeti cookbooks I'll be running [08:48:42] moritzm: on cumin2002 spicerack is the latest and the cookbook patch required is deployed, so feel free to test things [08:54:56] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10697487 (10FCeratto-WMF) FWIW I would suggest prioritizing readability and safety for **prod** configuration, and git-diff friendliness. When converting the existing confi... [08:55:26] on it [08:55:55] FIRING: MaxConntrack: Max conntrack at 84.42% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:00:55] RESOLVED: MaxConntrack: Max conntrack at 80.4% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:03:10] volans: the drain cookbook works fine, if I run into any issues, I'll report back [09:03:19] I think I'm the only user of cumin2002 anyway :-) [09:10:56] ahaha yeah probably [09:10:57] thanks a lot [09:19:30] all good from my tests too, I'll upgrade cumin1002 shorthly [09:27:28] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10697616 (10fgiunchedi) With my pontoon hat on: what I did is basically the same as the suggested `node/data.yaml`, i.e. map roles to (cloud) hostnames and `pontoon-enc` ta... [10:11:09] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 06serviceops, 06SRE: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#10697808 (10LSobanski) 05Duplicate→03Open This is separate from the activity in {T387833} so let's keep it open. [10:14:37] that "shortly" above is still waiting for the clone cookbook to finish [10:18:16] cumin2002 looks all good, ran various ganeti cookbooks and the reimage cookbook which also shells out to various other cookboks [10:19:18] great, thx [10:38:13] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10697904 (10Joe) >>! In T389932#10697616, @fgiunchedi wrote: > With my pontoon hat on: what I did is basically the same as the suggested `node/data.yaml`, i.e. map roles to... [11:12:28] and upgrade done on cumin1002 too [11:51:39] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10698172 (10cmooney) [12:36:59] XioNoX: what's the current status of homer and the graphql stuff? [12:37:18] from what I can tell we merged this patch below, but we didn't update the version on the cumin hosts? [12:37:30] https://gerrit.wikimedia.org/r/c/operations/software/homer/+/1094291 [12:37:30] topranks: we need to make a release, deploy and test it. [12:37:41] ok [12:37:42] It's on my todo [12:38:07] everything is working ok on cumin anyway no problems [12:38:21] I pulled the latest to my laptop and the plugin breaks [12:38:47] so do we need a patch for the plugin before merging? what's breaking? [12:38:49] we need this patch too [12:38:53] https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/1094284 [12:39:08] but even with that patch it seems broken a little, but basically we need that I think [12:39:38] right now with current main branch of homer and deply repo the plugin causes this: [12:39:41] I don't understand why is broken, we didn't remove the possibility to query netbox normally [12:39:43] https://www.irccloud.com/pastebin/lBdIyHKR/ [12:39:53] ahhh that [12:39:56] that's a one-liner fix [12:40:03] will do when making the release [12:48:24] so... we need to also include the 'base_paths' when we init the NetboxDeviceDataPlugin in wmf-plugin.py ? [12:49:13] I'm confused about what we should be passing though [12:50:21] topranks: you take it then send it back [12:51:22] ah... you mean like this [12:51:41] def __init__(self, netbox_api, base_paths, device): [12:51:41] super().__init__(netbox_api, base_paths, device) [12:51:55] yeah exactly (Was trying to double check it) [12:53:01] that fixed it yep [12:53:21] will I make a quick patch to the deploy repo to do just that so things are in-sync? [12:53:37] we can take a look at the larger patch to modify the plugin separately then? [12:54:14] +1 [12:54:16] thx [12:54:43] also what do you think of https://gerrit.wikimedia.org/r/c/operations/software/homer/+/1124437? [12:55:58] I was looking at that [12:56:09] seems like a good idea I think, let me do a proper pass [13:18:30] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10698935 (10ayounsi) Some updates on that front ! **Fundraising switches (fasw)** All good. **Management switches (msw)** After configuration, seems like only msw2-codfw have gNMI listeni... [13:55:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731 (10cmooney) 03NEW p:05Triage→03High [14:24:22] topranks: I'm pondering the usefulness of a "dashboard" like this: https://grafana.wikimedia.org/d/b6d2222b-e761-4c8f-90d0-6f1b11392379/network-map-links-states?orgId=1 One one side it looks nice, on the other it might be a bit of a pain to maintain [14:25:30] heh I was just looking at this yesterday: [14:25:31] https://grafana-weathermap.seth.cx/ [14:26:01] I figured we could make one for the WAN (between CRs), each site (from CR -> Spines and between spines) and then for each pair of spines to all leafs [14:26:20] I then concluded that would "be a bit of a pain to maintain" (and build!) so went back to my day [14:27:44] definitely it would be good to have something like that, but question is how much effort it would be [14:31:08] just need so AI to draw it [14:53:53] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10699524 (10xcollazo) Ok attempting the below query again now: >>! In T390623#10699223, @xcollazo wrote: >... [15:15:09] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10699685 (10xcollazo) I've succesfully run the following query: >>! In T390623#10699616, @xcollazo wrote: >... [15:16:07] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10699693 (10xcollazo) >We only have these stats for some of the presto hosts, which are those in rows E and... [15:27:08] XioNoX: I thought that was just a mock-up on that link [15:27:20] they are actual interface states there??!? [15:27:22] awesome [15:29:15] it's rare enough we've multiple failures at once, but when it does happen this would make it quite quick to assess the likely impact [15:39:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [15:43:54] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10699869 (10RobH) Case 01043199 > Support, > > We recently rolled some OS upgrades to our routers and during that, one of the optics on our cross... [15:44:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10699870 (10RobH) a:05cmooney→03RobH [15:49:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [15:57:44] FIRING: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [16:05:14] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10699995 (10cmooney) > This effectively moved 308GB from HDFS Datanodes, thru the routers, to Presto server... [16:07:44] RESOLVED: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [16:22:22] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10700068 (10cmooney) FWIW the largest potential bottleneck in Ashburn are on the 10G interfaces (names star... [16:31:55] FIRING: MaxConntrack: Max conntrack at 85.87% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [16:36:55] RESOLVED: MaxConntrack: Max conntrack at 85.52% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [17:02:55] FIRING: MaxConntrack: Max conntrack at 82.23% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [17:07:55] RESOLVED: MaxConntrack: Max conntrack at 82.23% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [17:27:57] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10700358 (10cmooney) p:05High→03Low Looks like remote hands replaced the module. ` cmooney@cr4-ulsfo> show log messages | match qsfp Apr 1 17:1... [17:35:08] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10700403 (10xcollazo) Thanks for the pointers @cmooney. --------- Here are my heavy query results: First... [17:42:31] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10700438 (10cmooney) > No one is yelling on IRC so I think I am happy with this. I am done from my side. O... [17:53:08] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10700483 (10cmooney) Also to get a sense of total throughput this graph is good: https://grafana.wikimedia... [17:58:03] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10700491 (10xcollazo) >>! In T381389#10700438, @cmooney wrote: >> No one is yelling on IRC so I think I am... [18:20:55] FIRING: MaxConntrack: Max conntrack at 82.14% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [18:30:55] RESOLVED: MaxConntrack: Max conntrack at 82.16% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [19:06:55] FIRING: MaxConntrack: Max conntrack at 82.68% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [19:11:55] RESOLVED: MaxConntrack: Max conntrack at 82.59% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [21:57:03] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10701560 (10jhathaway) >>! In T389932#10697436, @Joe wrote: >>>! In T389932#10694961, @jhathaway wrote: >> One issue with using just the FQDN is that is breaks tools which... [22:00:43] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10701585 (10jhathaway) In proposing possible solutions, I would love to understand a bit more why our `site.pp` uses complex regexes. From looking through the git log it ap...