[00:03:47] RESOLVED: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:38:47] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:35:49] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:10:49] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:53:48] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:47] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:49] RESOLVED: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:53:37] I'll just try to reboot pcc-worker1006. It seems to fail repeatably [11:12:17] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10vrts: generate_vrts_aliases failing on mx-in1001 - https://phabricator.wikimedia.org/T368257 (10LSobanski) 03NEW [12:43:40] topranks: have we ever thought about putting core network link utilization data in prometheus somehow? [12:53:32] cdanis: we do :) [12:54:16] XioNoX: do we? [12:54:28] 👀 [12:54:49] I know we have the per-queue stats from gnmic, but we're filtering the rest of the interface stats we collect right? [12:54:52] iirc WMCS uses that data [12:55:12] they use network stats from graphite I believe [12:55:39] which tbh can probably get us a lot of where we want to be, but I'm personally not really used to the syntax so find it hard to work with [12:55:41] ahhh, yeah, I thought we had that in prometheus [12:55:44] ahh okay [12:55:47] nah :( [12:56:09] but getting it - at least for the more modern kit we have gnmic stats from, it a pretty quick change I think [12:56:21] there is really grpc running on routers now? [12:56:41] yep - we have some stats from it [12:56:42] https://grafana-rw.wikimedia.org/d/5p97dAASz/cathal-network-queue-stats [12:56:52] very cool [12:56:59] all thanks to Arzhel [12:58:18] https://grafana-rw.wikimedia.org/d/5p97dAASz/cathal-network-queue-stats?orgId=1&var-site=eqiad%20prometheus%2Fops&var-device=cr2-eqiad&var-interface=et-1%2F1%2F0&from=1719090161813&to=1719102979395 [12:59:07] cr1 was the worst hit [12:59:09] https://grafana-rw.wikimedia.org/d/5p97dAASz/cathal-network-queue-stats?forceLogin&from=1719093997942&orgId=1&to=1719096830384&var-device=cr1-eqiad&var-interface=et-1%2F1%2F0&var-interface=et-1%2F1%2F3&var-interface=et-1%2F1%2F2&var-interface=et-1%2F0%2F2&var-site=eqiad%20prometheus%2Fops [12:59:22] eesh [12:59:30] in outages like the one this weekend it'd be very useful to have an at-a-glance panel of like, the ~5 "most drop-py" and "most saturated" links [12:59:41] is line rate available in the gnmic metrics ? [12:59:46] quick one if somebody has time https://gerrit.wikimedia.org/r/c/operations/software/debmonitor-client/+/1049154 [12:59:57] (yes I am releasing software without Riccardo) [13:00:41] <3 [13:00:59] cdanis: right now through adding all the outbound queue stats you can get it in that direction [13:01:08] but tbh we are collecting the data and just not sending to prometheus [13:01:10] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/gnmi_telemetry.yaml#39 [13:01:29] if we remove lines 39-41 here I think we'd send all the stats (basically same as IF-MIB) [13:01:48] got it [13:01:55] I'd been meaning to try and test what volume that would result in for observability to get their ok to do it [13:02:44] I've not had the time to lab it all up yet but I'm very anxious to do it if we can - as you say that kind of panel would be very useful and the type of thing we could probably do with prom. queries [13:03:57] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9917456 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:03:58] cdanis: the downside is our older switches don't support it, or at least not without painful row-wide JunOS upgrades [13:04:08] right [13:04:29] topranks: one way to lab it up is to just set up a second instance of a gnmic exporter that we don't point prometheus at yet, right? [13:07:16] yeah I guess so, I'd more been thinking I need to create a test prom. instance and look what metrics are sent to it [13:07:36] but you're probably right you can view the data gnmic gets itself without sending to prom. [13:10:18] you can just scrape its /metrics endpoint yourself [13:10:34] and if all you care about is per-metric cardinality, that's a shell one-liner :) [13:11:14] cdanis: I have no doubt it's actually that easy tbh [13:12:02] I need to do a little research on exactly how to do that, my brain only knows snmpwalk at it stands [13:12:15] I'd be happy to try it together with you [13:12:20] if you had any pointers that might be good? [13:12:33] yeah that'd be great, I've just not had time to dig into it really [13:12:42] the first thing I was looking at was how to set up another instance [13:13:11] through puppet? [13:13:37] yeah, nothing's parameterized so it's "easiest" in puppet to just set up the role on some other machine, like an sretest host maybe [13:13:57] yeah that was my first thought when you mentioned it [13:14:09] not sure if you also need special firewall rules or something to have the gnmi api be accessible [13:14:26] yeah I'm sure we would [13:14:34] what's your level of knowledge wrt: prometheus & our installation of it topranks ? [13:15:14] but perhaps that would be useful too, we could make a manual firewall addition on only one router so we're not pulling stats from them all from the test host [13:15:41] I know a little abouit prometheus far from an expert though [13:15:43] you can also run it manually [13:15:58] for any testing [13:16:03] I'm not that familiar with our setup specifically [13:16:29] XioNoX: yeah I was wondering that too.... like is there a way to run gnmic from the cli and just dumb the data it gets back? [13:16:44] even from your laptop using socks proxy through netflow hosts [13:17:27] I recall you mentioned that before, do you need anything special on the laptop? [13:17:31] just gnmic itself? [13:17:38] yeah, last time I checked it was not exactly the same data returned depending on if I was outputing them through stdout or /metrics or file, but it's possible [13:17:38] ahaha [13:17:58] just gnmic and tinyproxy [13:18:14] topranks: if gnmic is like most things it just runs an http server on some port, that answers /metrics with basically a text file [13:18:22] ok... I have a prometheus instance on my LAN here (obviously every home needs one) so I could fire the stats there too maybe [13:19:09] actually tinyproxy is not needed for gnmic only pygnmi [13:19:21] yep [13:19:24] curl -v localhost:9804/metrics [13:19:34] ^^ worked on netflow1002 [13:19:57] usually what prom/o11y cares about is the maximum cardinality (number of different labels in existence) for a given metric [13:20:06] exactly [13:22:02] topranks, one of my many local test files, dunno if it works, but you get the idea https://www.irccloud.com/pastebin/QE6QeimQ/ [13:22:04] ok actually this is shaping up to seem fairly simple, let me see if I can give it a whirl [13:22:12] cool thanks! [13:22:23] then just run `ssh cumin1002.eqiad.wmnet -D 8888` for the socks proxy [13:23:34] asw2-ulsfo doesn't work with gnmi though, there is a weird with VC bug jtac gave up on, I need to re-open it [13:24:13] sigh [13:24:29] my home internet died just as I was about to post https://phabricator.wikimedia.org/P65390 for you topranks [13:25:08] cdanis: ah nice! [13:25:28] I will at very least steal your greps and sed's :) [13:26:38] if we had counters of drops for every switch port, we wouldn't need nic_saturation_exporter [13:27:48] yeah, we have the drops now but only for the switches that support it :( [13:27:51] yeah [13:27:57] well. some day [13:28:00] but making progress, soon codfw won't have any VCs [13:28:08] eqiad will be another year or two [13:28:14] that's really not that bad :) [13:29:02] XioNoX: is the password needed for gnmic? [13:29:14] topranks: yeah [13:29:17] ok [13:29:22] topranks: you can use rancid's password iirc [13:29:34] yeah that's what it's doing in prod. [13:29:42] ok I'll do that rather than adding one temporarily [13:56:09] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9917697 (10ayounsi) Opened https://github.com/netbox-community/netbox/issues/16698 for a Netbox regression on how it handles Scripts compared to... 3.2.9 As well as documented https:... [13:58:05] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10vrts: generate_vrts_aliases failing on mx-in1001 - https://phabricator.wikimedia.org/T368257#9917703 (10jhathaway) a:03jhathaway [14:25:43] XioNoX: topranks: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1048066 [14:25:52] running homer showed me no diffs for eqiad/codfw [14:26:18] sukhe: did you run puppet on the cumin host your using? [14:26:32] oh for some reason I assumed that that wasn't required anymore! [14:26:41] has to be that, thanks, doing so now then :) [14:30:57] 10Packaging, 06Infrastructure-Foundations, 06SRE Observability: upgrade prometheus-ipmi-exporter to 1.8.0 - https://phabricator.wikimedia.org/T368088#9917836 (10MoritzMuehlenhoff) p:05Triage→03Medium Given that this is a Go static ELF we can also simply build on bookworm and copy over the deb to bullseye... [14:31:01] 10Packaging, 06Infrastructure-Foundations, 06SRE Observability: upgrade prometheus-ipmi-exporter to 1.8.0 - https://phabricator.wikimedia.org/T368088#9917841 (10MoritzMuehlenhoff) [14:45:42] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10vrts: generate_vrts_aliases failing on mx-in1001 - https://phabricator.wikimedia.org/T368257#9917913 (10jhathaway) p:05Triage→03Medium [14:50:49] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:33] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9917973 (10aborrero) [15:48:48] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:28:58] * elukey 15 [16:29:02] err :) [16:33:52] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9918514 (10cmooney) >>! In T326322#9615636, @fgiunchedi wrote: > Yeah having some ballpark numbers will be a great help @cmooney, unless... [16:39:16] XioNoX: btw the latest version of gnmic seems to have broken the socks proxy functionality [16:39:49] bizarrely the binary uses the socks proxy you tell it to, but makes a connect() request to it to connect to the address/port of the proxy itself [16:40:16] that almost sounds like you're using a double socks wrapper haha [16:40:38] in other words instead of using the local socks proxy to request a connection to cr1-eqiad.wikimedia.org:32767, it requests a connection to localhost: [16:41:00] cdanis: yeah I was really confused for ages, I had a quick look at the source to see where it was added but tricky to track down [16:41:30] but 100% I can see the proxy request in wireshark is asking for that, then cumin is trying to connect to itself and getting RST. very odd [16:41:44] for testing a local TCP port forward worked fine though so no major issue [16:42:17] topranks: maybe ignore its native functionality and just run it via something like `tsocks` [16:44:50] cdanis: cool yeah good to have that option in the back pocket [17:32:41] 10Packaging, 06Infrastructure-Foundations, 06SRE Observability: upgrade prometheus-ipmi-exporter to 1.8.0 - https://phabricator.wikimedia.org/T368088#9918789 (10herron) >>! In T368088#9917836, @MoritzMuehlenhoff wrote: > Given that this is a Go static ELF we can also simply build on bookworm and copy over th... [22:54:14] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:50:32] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed