[00:14:16] FIRING: [2x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:16] FIRING: [2x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:25:34] RESOLVED: [2x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:04:13] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9955625 (10Marostegui) [06:50:42] for people on the Netbox slack, there is an interesting thread https://netdev-community.slack.com/archives/C01P0FRSXRV/p1720016599859599 [06:51:41] tldr; the branching feature will be in the OSS Netbox, but some extra features (PR style approval for Netbox changes) will only be in the commercial/cloud offering [06:53:11] https://netboxlabs.com/blog/branching-and-change-management-is-coming-to-netbox-announcing-private-preview-of-netbox-branch-management/ [06:58:57] The branching feature, if well done, will allow us to stage big infrastructure changes, like a new POP [06:59:07] We got a report that VRTS emails are failing Gmail's SPF checks (https://phabricator.wikimedia.org/T369341). Considering this comment from j.hathaway this may be a broader problem affecting more than just VRTS: https://phabricator.wikimedia.org/T355764#9503385 [06:59:12] unrelated, new tool on the block: https://github.com/opsmill/infrahub [07:02:42] sobanski: I guess the wikipedia.org spf should be similar to the wikimedia.org one ? [07:03:05] https://mxtoolbox.com/SuperTool.aspx?action=spf%3awikipedia.org&run=toolpage vs. https://mxtoolbox.com/SuperTool.aspx?action=spf%3awikipedia.org&run=toolpage [07:03:21] I'm just looking at the same thing :) [07:03:33] I mean https://mxtoolbox.com/SuperTool.aspx?action=spf%3awikimedia.org&run=toolpage [07:03:37] the url didn't update [07:04:24] I don't know much about email and SPF, but maybe the`include _cidrs.wikimedia.org` would solve it? [07:05:02] MAIL FROM domain [wikipedia.org] [07:07:52] sobanski: any idea how critical it is? can it wait for jhathaway to come online? [07:08:03] For now that is my plan [07:08:38] Unless we get more reports of things failing. [07:09:05] I'll take an intermediate disruption of VRTS over accidentally breaking all of email. [07:09:34] sounds like a wise take :) [07:09:34] Outbound email was switched to the mx-out servers on Wednesday, which could also be related [07:09:50] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1051803 [07:10:20] Perhaps SPF was not updated to match the new servers? [07:12:11] The include:_spf.google.com for wikimedia.org is related to sending email from Google Workspace so I think this can be ignored [09:38:02] slyngs: you around? :) [09:38:11] Yes [09:38:28] slyngs: we moved https://netbox-next.wikimedia.org/ to netbox-dev2003, but it's showing an `Your credentials aren't allowed` [09:38:53] I think we need to say somewhere that netbox-dev2003 is not netbox-dev anymore but netbox-next for idp [09:41:42] Yes, the client secret is wrong [09:43:08] Just a sec [09:43:14] no rush, thx [09:46:14] I'm just going to remove /srv/private/hieradata/hosts/netbox-dev2003.yaml from the private repo. That has an override for the OIDC secret [09:46:44] sukhe: perfect yeah, that matches https://gerrit.wikimedia.org/r/c/operations/puppet/+/1052267 [09:46:49] I should have remembered [09:47:51] There is a lot of stuff that I should remember... but I don't :-) [09:47:59] Just running puppet [09:49:06] Works :-) [09:49:11] slyngs: yaaa [09:49:25] https://netbox-next.wikimedia.org/ (cc elukey, topranks, volans|off) [09:50:13] my eyes!!! [09:50:25] haha it's actually not that bad... nice work!!!!! [09:50:41] No class, no sense of style :-) [09:51:10] nice work folks :) [09:51:15] at least is has this :) [09:51:17] https://usercontent.irccloud-cdn.com/file/sZgEEJqL/image.png [09:51:30] topranks: hahahaha [09:51:31] Actually I don't like the light grey on dark grey menu [09:52:35] slyngs: you can switch to dark mode if you want to make it worse with the lightbulb icon in the top right [09:53:03] But I want light light mode [09:54:18] XioNoX: That very nice work, but are you sure we should hold out for Netbox 5? [09:54:28] shouldn't [09:54:42] slyngs: don't worry it's going to be released the day we upgrade prod to 4 [09:55:30] one big difference with netbox 3 is the addition of data sources: https://netbox-next.wikimedia.org/core/data-sources/ [09:55:56] and reports are merged with scripts: https://netbox-next.wikimedia.org/extras/scripts/ [09:56:39] Oh it just syncs with git, that's nice [10:04:09] topranks: https://netbox-next.wikimedia.org/vpn/tunnels/1/ [10:04:22] heh nice! [10:04:33] I was literally making a phab task right now to say we should investigate :) [10:05:04] hahah [10:05:31] topranks: I'm also wondering about https://netbox-next.wikimedia.org/vpn/l2vpns/ and VXLAN [10:05:38] I'll finish it anyway so we can track, we'll need to adjust homer templates [10:06:19] for the vxlan I'm not sure we need to use that l2vpn stuff really [10:06:33] I expect it's aimed more at ISPs doing VPLS and pseudowires [10:06:48] Checking all it seems to allow is tracking the route-targets [10:06:56] but we kind of just build all that from the vlan ID [10:07:29] seems simpler to retain the current model - with an "evpn" flag on a device indicating it uses vxlan and any defined vlans should be built like that [10:08:44] sounds good [10:11:39] on the tunnels, does it look like we can (or should we) also use it for the CF tunnels? [10:12:00] obviously at the front of my mind is the qos stuff, and identifying the internal vs external tunnels [10:12:29] I see there are "tunnel groups" there we can use too, although I suspect we can differentiate without those [10:12:58] topranks: https://netbox-next.wikimedia.org/vpn/tunnels/2/ :) [10:15:55] ok cool that looks good, we can use the group no problem [10:16:02] We can't define CF's side IP though, the termination needs to be on a device or VM we own [10:16:11] or indeed I see the single-ended tunnel endpoint is of type "spoke" so another way [10:16:20] but it's a single IP so no big deal [10:16:34] it's a single IP common to them all is it? [10:16:38] yep [10:17:07] ok that doesn't make it too bad, I guess we could have something in the YAML with a key matching the "tunnel group" name for that? [10:17:10] or probably other ways [10:17:23] yeah, it's jsut a detail at this point [10:17:45] be nice to get the GRE config in general out of the YAML :) [10:18:00] yeah for sure [10:18:31] we're also going to be able to move https://docs.google.com/spreadsheets/d/19f8XkjqQIKZ66uCY8vcEvqOdooTFxR8guLY6m_5yzXM/edit?gid=2128466433#gid=2128466433 to netbox [10:18:49] by adding custom fields and updating https://netbox-next.wikimedia.org/circuits/circuits/?export=cross-connects [10:20:37] nice! [10:20:56] I still think we should model the carrier patch panels in there as part of that but happy for dc-ops to manage how they want :) [10:23:52] we can also merge all the duplicates in https://netbox-next.wikimedia.org/extras/custom-links/ (like I just did for debmonitor) [10:26:20] anyway, lots of good stuff [10:26:37] let's hope that it also solves the few timeout and weird bugs we're seeing :) [10:28:13] it will surely exorcise all those demons yes :D [10:28:35] what's that custom-links view? [10:29:23] topranks: it's the links on the top right side of objects: https://netbox-next.wikimedia.org/dcim/devices/4107/ [10:29:37] dell config, ticket, debmon, etc [10:30:41] ah ok yeah I never delved into how they were done [10:30:42] nice [10:31:39] feel free to add more if you can think of any [10:32:30] ah ok cool, I see we can now attach the VMs and physical devices to a single element there [10:32:31] nice [10:32:41] yeah will do [10:36:52] 10netops, 06Infrastructure-Foundations, 06SRE: Model GRE tunnels in Netbox - https://phabricator.wikimedia.org/T369351 (10cmooney) 03NEW p:05Triage→03Low [10:49:50] it's also possible to explicitely define the OOB IP, see for example on https://netbox-next.wikimedia.org/dcim/devices/4107/ [10:50:36] and we can get it with like `device.oob_ip.address` [12:43:41] congrats on netbox-next and netbox 4! [12:43:53] * sukhe tries out the delete feature [12:43:56] as is tradition [12:50:34] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:28:23] it is very nice that the Supermicro Redfish api, by default, doesn't accept a password like 'calvin' for a new account :D [13:49:16] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:16] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:45:34] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:30] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9956862 (10Papaul) @ayounsi here are some errors when running some scripts ` Scripts Provision_server ProvisionServerNetwork Script Source Jobs Error loading... [15:47:27] 10CAS-SSO, 06Infrastructure-Foundations: Login attempts from bd808 get 500 on Debmonitor - https://phabricator.wikimedia.org/T369205#9956993 (10bd808) >>! In T369205#9952826, @SLyngshede-WMF wrote: > @bd808 I think we need to find 20 minutes where we can enable CAS Debugging on Debmonitor and see what is going... [17:12:40] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384 (10cmooney) 03NEW p:05Triage→03Medium [17:12:42] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#9957289 (10cmooney) [17:16:18] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9957290 (10cmooney) 05Open→03Resolved I'm going to close this task now, the current gnmic collection is providing what we need i... [17:17:59] 10netops, 06Infrastructure-Foundations, 10observability, 10Observability-Metrics, 06SRE: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210#9957316 (10cmooney) 05Open→03Resolved Seems like a great tool, but we are going to move forward with pulling these stats using... [17:20:58] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#9957330 (10cmooney)