[00:03:47] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:38:47] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:35:49] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:10:49] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:53:48] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:08:47] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:15:49] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:53:37] <slyngs>	 I'll just try to reboot pcc-worker1006. It seems to fail repeatably 
[11:12:17] <wikibugs>	 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10vrts: generate_vrts_aliases failing on mx-in1001 - https://phabricator.wikimedia.org/T368257 (10LSobanski) 03NEW
[12:43:40] <cdanis>	 topranks: have we ever thought about putting core network link utilization data in prometheus somehow?
[12:53:32] <XioNoX>	 cdanis: we do :)
[12:54:16] <topranks>	 XioNoX: do we? 
[12:54:28] <cdanis>	 👀
[12:54:49] <topranks>	 I know we have the per-queue stats from gnmic, but we're filtering the rest of the interface stats we collect right?
[12:54:52] <XioNoX>	 iirc WMCS uses that data
[12:55:12] <topranks>	 they use network stats from graphite I believe 
[12:55:39] <topranks>	 which tbh can probably get us a lot of where we want to be, but I'm personally not really used to the syntax so find it hard to work with 
[12:55:41] <XioNoX>	 ahhh, yeah, I thought we had that in prometheus
[12:55:44] <cdanis>	 ahh okay
[12:55:47] <topranks>	 nah :( 
[12:56:09] <topranks>	 but getting it - at least for the more modern kit we have gnmic stats from, it a pretty quick change I think
[12:56:21] <cdanis>	 there is really grpc running on routers now?
[12:56:41] <topranks>	 yep - we have some stats from it 
[12:56:42] <topranks>	 https://grafana-rw.wikimedia.org/d/5p97dAASz/cathal-network-queue-stats
[12:56:52] <cdanis>	 very cool
[12:56:59] <topranks>	 all thanks to Arzhel 
[12:58:18] <cdanis>	 https://grafana-rw.wikimedia.org/d/5p97dAASz/cathal-network-queue-stats?orgId=1&var-site=eqiad%20prometheus%2Fops&var-device=cr2-eqiad&var-interface=et-1%2F1%2F0&from=1719090161813&to=1719102979395
[12:59:07] <topranks>	 cr1 was the worst hit 
[12:59:09] <topranks>	 https://grafana-rw.wikimedia.org/d/5p97dAASz/cathal-network-queue-stats?forceLogin&from=1719093997942&orgId=1&to=1719096830384&var-device=cr1-eqiad&var-interface=et-1%2F1%2F0&var-interface=et-1%2F1%2F3&var-interface=et-1%2F1%2F2&var-interface=et-1%2F0%2F2&var-site=eqiad%20prometheus%2Fops
[12:59:22] <cdanis>	 eesh
[12:59:30] <cdanis>	 in outages like the one this weekend it'd be very useful to have an at-a-glance panel of like, the ~5 "most drop-py" and "most saturated" links
[12:59:41] <cdanis>	 is line rate available in the gnmic metrics ?
[12:59:46] <elukey>	 quick one if somebody has time https://gerrit.wikimedia.org/r/c/operations/software/debmonitor-client/+/1049154
[12:59:57] <elukey>	 (yes I am releasing software without Riccardo)
[13:00:41] <elukey>	 <3
[13:00:59] <topranks>	 cdanis: right now through adding all the outbound queue stats you can get it in that direction 
[13:01:08] <topranks>	 but tbh we are collecting the data and just not sending to prometheus 
[13:01:10] <topranks>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/gnmi_telemetry.yaml#39
[13:01:29] <topranks>	 if we remove lines 39-41 here I think we'd send all the stats (basically same as IF-MIB) 
[13:01:48] <cdanis>	 got it
[13:01:55] <topranks>	 I'd been meaning to try and test what volume that would result in for observability to get their  ok to do it 
[13:02:44] <topranks>	 I've not had the time to lab it all up yet but I'm very anxious to do it if we can - as you say that kind of panel would be very useful and the type of thing we could probably do with prom. queries 
[13:03:57] <wikibugs>	 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9917456 (10MoritzMuehlenhoff) p:05Triage→03Medium
[13:03:58] <topranks>	 cdanis: the downside is our older switches don't support it, or at least not without painful row-wide JunOS upgrades 
[13:04:08] <cdanis>	 right
[13:04:29] <cdanis>	 topranks: one way to lab it up is to just set up a second instance of a gnmic exporter that we don't point prometheus at yet, right?
[13:07:16] <topranks>	 yeah I guess so, I'd more been thinking I need to create a test prom. instance and look what metrics are sent to it 
[13:07:36] <topranks>	 but you're probably right you can view the data gnmic gets itself without sending to prom. 
[13:10:18] <cdanis>	 you can just scrape its /metrics endpoint yourself
[13:10:34] <cdanis>	 and if all you care about is per-metric cardinality, that's a shell one-liner :)
[13:11:14] <topranks>	 cdanis: I have no doubt it's actually that easy tbh 
[13:12:02] <topranks>	 I need to do a little research on exactly how to do that, my brain only knows snmpwalk at it stands 
[13:12:15] <cdanis>	 I'd be happy to try it together with you
[13:12:20] <topranks>	 if you had any pointers that might be good?
[13:12:33] <topranks>	 yeah that'd be great, I've just not had time to dig into it really 
[13:12:42] <cdanis>	 the first thing I was looking at was how to set up another instance
[13:13:11] <topranks>	 through puppet?
[13:13:37] <cdanis>	 yeah, nothing's parameterized so it's "easiest" in puppet to just set up the role on some other machine, like an sretest host maybe
[13:13:57] <topranks>	 yeah that was my first thought when you mentioned it 
[13:14:09] <cdanis>	 not sure if you also need special firewall rules or something to have the gnmi api be accessible
[13:14:26] <topranks>	 yeah I'm sure we would
[13:14:34] <cdanis>	 what's your level of knowledge wrt: prometheus & our installation of it topranks ?
[13:15:14] <topranks>	 but perhaps that would be useful too, we could make a manual firewall addition on only one router so we're not pulling stats from them all from the test host 
[13:15:41] <topranks>	 I know a little abouit prometheus far from an expert though 
[13:15:43] <XioNoX>	 you can also run it manually
[13:15:58] <XioNoX>	 for any testing
[13:16:03] <topranks>	 I'm not that familiar with our setup specifically 
[13:16:29] <topranks>	 XioNoX: yeah I was wondering that too.... like is there a way to run gnmic from the cli and just dumb the data it gets back?
[13:16:44] <XioNoX>	 even from your laptop using socks proxy through netflow hosts
[13:17:27] <topranks>	 I recall you mentioned that before, do you need anything special on the laptop?
[13:17:31] <topranks>	 just gnmic itself?
[13:17:38] <XioNoX>	 yeah, last time I checked it was not exactly the same data returned depending on if I was outputing them through stdout or /metrics or file, but it's possible
[13:17:38] <cdanis>	 ahaha
[13:17:58] <XioNoX>	 just gnmic and tinyproxy
[13:18:14] <cdanis>	 topranks: if gnmic is like most things it just runs an http server on some port, that answers /metrics with basically a text file
[13:18:22] <topranks>	 ok... I have a prometheus instance on my LAN here (obviously every home needs one) so I could fire the stats there too maybe 
[13:19:09] <XioNoX>	 actually tinyproxy is not needed for gnmic only pygnmi
[13:19:21] <topranks>	 yep 
[13:19:24] <topranks>	 curl -v localhost:9804/metrics
[13:19:34] <topranks>	 ^^ worked on netflow1002
[13:19:57] <cdanis>	 usually what prom/o11y cares about is the maximum cardinality (number of different labels in existence) for a given metric
[13:20:06] <topranks>	 exactly 
[13:22:02] <XioNoX>	 topranks, one of my many local test files, dunno if it works, but you get the idea https://www.irccloud.com/pastebin/QE6QeimQ/
[13:22:04] <topranks>	 ok actually this is shaping up to seem fairly simple, let me see if I can give it a whirl 
[13:22:12] <topranks>	 cool thanks!  
[13:22:23] <XioNoX>	 then just run `ssh cumin1002.eqiad.wmnet -D 8888` for the socks proxy
[13:23:34] <XioNoX>	 asw2-ulsfo doesn't work with gnmi though, there is a weird with VC bug jtac gave up on, I need to re-open it
[13:24:13] <cdanis>	 sigh
[13:24:29] <cdanis>	 my home internet died just as I was about to post https://phabricator.wikimedia.org/P65390 for you topranks 
[13:25:08] <topranks>	 cdanis: ah nice!  
[13:25:28] <topranks>	 I will at very least steal your greps and sed's :)
[13:26:38] <cdanis>	 if we had counters of drops for every switch port, we wouldn't need nic_saturation_exporter
[13:27:48] <topranks>	 yeah, we have the drops now but only for the switches that support it :( 
[13:27:51] <cdanis>	 yeah
[13:27:57] <cdanis>	 well.  some day
[13:28:00] <topranks>	 but making progress, soon codfw won't have any VCs 
[13:28:08] <topranks>	 eqiad will be another year or two 
[13:28:14] <cdanis>	 that's really not that bad :)
[13:29:02] <topranks>	 XioNoX: is the password needed for gnmic?  
[13:29:14] <XioNoX>	 topranks: yeah
[13:29:17] <topranks>	 ok 
[13:29:22] <XioNoX>	 topranks: you can use rancid's password iirc
[13:29:34] <topranks>	 yeah that's what it's doing in prod. 
[13:29:42] <topranks>	 ok I'll do that rather than adding one temporarily 
[13:56:09] <wikibugs>	 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9917697 (10ayounsi) Opened https://github.com/netbox-community/netbox/issues/16698 for a Netbox regression on how it handles Scripts compared to... 3.2.9  As well as documented https:...
[13:58:05] <wikibugs>	 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10vrts: generate_vrts_aliases failing on mx-in1001 - https://phabricator.wikimedia.org/T368257#9917703 (10jhathaway) a:03jhathaway
[14:25:43] <sukhe>	 XioNoX: topranks: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1048066
[14:25:52] <sukhe>	 running homer showed me no diffs for eqiad/codfw
[14:26:18] <XioNoX>	 sukhe: did you run puppet on the cumin host your using?
[14:26:32] <sukhe>	 oh for some reason I assumed that that wasn't required anymore!
[14:26:41] <sukhe>	 has to be that, thanks, doing so now then :)
[14:30:57] <wikibugs>	 10Packaging, 06Infrastructure-Foundations, 06SRE Observability: upgrade prometheus-ipmi-exporter to 1.8.0 - https://phabricator.wikimedia.org/T368088#9917836 (10MoritzMuehlenhoff) p:05Triage→03Medium Given that this is a Go static ELF we can also simply build on bookworm and copy over the deb to bullseye...
[14:31:01] <wikibugs>	 10Packaging, 06Infrastructure-Foundations, 06SRE Observability: upgrade prometheus-ipmi-exporter to 1.8.0 - https://phabricator.wikimedia.org/T368088#9917841 (10MoritzMuehlenhoff)
[14:45:42] <wikibugs>	 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10vrts: generate_vrts_aliases failing on mx-in1001 - https://phabricator.wikimedia.org/T368257#9917913 (10jhathaway) p:05Triage→03Medium
[14:50:49] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:58:33] <wikibugs>	 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9917973 (10aborrero)
[15:48:48] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:28:58] * elukey 15
[16:29:02] <elukey>	 err :)
[16:33:52] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9918514 (10cmooney) >>! In T326322#9615636, @fgiunchedi wrote: > Yeah having some ballpark numbers will be a great help @cmooney, unless...
[16:39:16] <topranks>	 XioNoX: btw the latest version of gnmic seems to have broken the socks proxy functionality 
[16:39:49] <topranks>	 bizarrely the binary uses the socks proxy you tell it to, but makes a connect() request to it to connect to the address/port of the proxy itself 
[16:40:16] <cdanis>	 that almost sounds like you're using a double socks wrapper haha
[16:40:38] <topranks>	 in other words instead of using the local socks proxy to request a connection to cr1-eqiad.wikimedia.org:32767, it requests a connection to localhost:<proxy_port>
[16:41:00] <topranks>	 cdanis: yeah I was really confused for ages, I had a quick look at the source to see where it was added but tricky to track down 
[16:41:30] <topranks>	 but 100% I can see the proxy request in wireshark is asking for that, then cumin is trying to connect to itself and getting RST. very odd 
[16:41:44] <topranks>	 for testing a local TCP port forward worked fine though so no major issue 
[16:42:17] <cdanis>	 topranks: maybe ignore its native functionality and just run it via something like `tsocks`
[16:44:50] <topranks>	 cdanis: cool yeah good to have that option in the back pocket 
[17:32:41] <wikibugs>	 10Packaging, 06Infrastructure-Foundations, 06SRE Observability: upgrade prometheus-ipmi-exporter to 1.8.0 - https://phabricator.wikimedia.org/T368088#9918789 (10herron) >>! In T368088#9917836, @MoritzMuehlenhoff wrote: > Given that this is a Go static ELF we can also simply build on bookworm and copy over th...
[22:54:14] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:50:32] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed