[09:11:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [09:16:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (2) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [09:21:35] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: (2) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [15:49:52] hello o11y i think icinga is struggling to process events? https://grafana.wikimedia.org/d/rsCfQfuZz/icinga?orgId=1&var-datasource=codfw%20prometheus%2Fops&from=now-7d&to=now&viewPanel=4 [15:51:00] cdanis: Thanks for the heads-up. Taking a look. [15:51:28] icinga also sent two pages to vops via email but nothing posted on IRC [15:52:22] yeah I noticed that too, we should be doing better on that front cdanis [15:52:28] i.e. ircecho restarted [15:53:30] yeah I saw icinga-wm just posted sth [16:05:29] I'm keeping an eye on that dashboard btw, maybe we can chalk this up to "icinga takes the same time to warm up as a container ship engine" [16:10:06] the max latency check shot up yesterday on alert1001 too, defo unexpected though not failover-related [16:11:28] icinga-wm is sending messages correctly to -ops [16:15:00] The latency check seems to be consistent at 21.9s [16:15:28] Tho I'm not sure about the possible root cause for that. [16:37:06] Aside from the high latency (unrelated to the failover) everything looks good for the alert hosts. [17:12:16] We also have a strange scenario where our Icinga BFD check is failing for past 2+ hours [17:12:28] The Python script seems to be failing to load the MIB [17:12:34] https://phabricator.wikimedia.org/T359198 [17:12:46] I don't have time to look further right now, but just FYI [17:13:16] Thanks for filling the task, I'll take a look at it. [17:17:03] denisse: thanks! [17:19:12] looking too, thanks topranks [17:21:47] hah I think we were missing a 'download-mibs' invocation [17:25:15] ok yeah I was wondering [17:25:35] ok should be better now topranks [17:25:51] godog: ok great! was the system reimaged or something? [17:26:22] topranks: Yes, we upgraded it from Buster to Bookworm, this also involved a Python 2 to Python 3 upgrade. [17:26:28] ah ok thanks [17:26:40] and I guess the download-mibs wasn't part of the automation flow [17:26:56] showing green across the board now! [17:26:57] thanks :) [17:27:01] You can find more info of the upgrade in here: https://phabricator.wikimedia.org/T333615 [17:27:01] Apologies for the inconvenience caused! [17:27:13] no probs, glad it wasn't anything too tricky :) [17:30:46] see this package on the alert hosts: [17:30:53] ii snmp-mibs-downloader 1.2 [17:31:09] my guess is this was supposed to download the missing MIB [17:31:26] but either lacked a timer or it was just downloading in the wrong/different path [17:31:33] since the fix seemed to be a symlink? [17:33:01] I think the fix was to execute the downloader command that is installed with that package [17:33:40] it's an odd package that one, it installs a tool that downloads mibs from the internet, but the tool isn't executed automatically after package installation [17:34:17] topranks: One question regarding the MIBs, do you know how often are they updated? I'm wondering if a systemd timer would be ideal for it or if a 1 time execution is the best approach. [17:34:30] they are never updated [17:34:31] ever :) [17:35:08] systemd timer every Feb 29 :) [17:35:21] yeah the installation paths changed between buster and bookworm, I did the quick and dirty thing of the symlink, the more proper fix is likely to ship /etc/snmp-mibs-downloader/snmp-mibs-downloader.conf with BASEDIR=/var/lib/snmp/mibs or understand by snimpy isn't loading from where snmp-mib-downloader is downloading [17:35:24] I am open to correction, but I don't think those standard MIBs ever change, or have for many years [17:35:32] Okay, I have another question. If they're never updated, would it be better for us to use quickdatacopy to sync them between the hosts? [17:36:07] if they are small I think it's just easier to pull on both/all machines. but either works [17:36:11] @godog Thanks for sharing your findings and for the fix. <3 [17:36:40] denisse: I guess that's up to you. They are quite small, just text files. [17:36:43] that way you dont need to define an active/source host in Hiera [17:36:54] probably feel free to do whatever the easiest thing is to get them on new hosts [17:37:40] Okay, I'll work on a patch and keep you posted. Thanks. [17:37:49] denisse: sure np [17:41:40] ok I'm logging off for the day, ttyl [17:42:43] topranks: dennise: re: "was supposed to run the download tool". the download tool has config like this: [17:42:46] AUTOLOAD="rfc ianarfc iana" [17:43:05] maybe it means stuff can be added that is "auto loaded"? [17:43:27] mutante: I may have been mistaken tbh [17:43:44] The BGP MIB is an IETF one, I believe covered by 'rfc' [17:44:04] godog can probably confirm if the tool ran, but just downloaded to a new location in bookworm (hence symlink) [17:44:16] or if he had to kick off the downloader and then also add the symlink [17:45:09] ACK, you are right, if the symlink fixed it that is probably all [18:56:54] I can confirm it was a file path change. I can see calls to "Exec["download-mibs {title}"] [22:08:32] :)