[13:51:22] herron: o/ [13:51:36] if you have time, not sure if you saw https://phabricator.wikimedia.org/T359879#9669840 [13:51:41] but I am really puzzled [13:56:26] elukey: hey, yes I did see and we had a chat about it about at the last o11y meeting as well but no smoking gun yet [13:56:38] I'll aim to spend some more time on it this week [14:28:45] thanks!!! Lemme know if I can help [14:28:55] I tried to check on titan nodes but couldn't find much [14:29:49] also https://thanos.wikimedia.org/rules#istio_slos look good-ish atm, a total of 17s for each evaluation [14:30:01] not great but it should cause a huge lag [15:02:59] is it not a given that an event in icinga should also show on alerts.w.o? [15:04:50] I'm trying to figure out how I could have a machine (aqs2001) down for more than 4 days, and only just get a notification for it this morning. But it also still doesn't show in alerts.w.o... [15:08:25] urandom: only icinga service alerts show up on alerts.w.o (though we could change that), not sure about being down for 4d though, how/where did you get the notification(s) ? [15:09:04] I got an alert for CQL at 11:17 UTC today, for both instances [15:09:25] but according to the icinga dashboard, those failures are 4 days old [15:09:49] which matches the dmesg output, and librenms showing the link going down [15:10:19] the notification I got was via email [15:13:35] ok so yeah the host went down indeed on the 28, emails about the host down were issued [15:13:56] to data platform though, since the host was down notifications for services were not sent [15:13:59] [Thu Mar 28 15:51:04 2024] HOST NOTIFICATION: data-platform-alerts;aqs2001;DOWN;host-notify-by-email;PING CRITICAL - Packet loss = 100% [15:14:03] [Thu Mar 28 15:51:04 2024] HOST NOTIFICATION: irc-data-platform;aqs2001;DOWN;notify-host-by-irc-team-data-platform;PING CRITICAL - Packet loss = 100% [15:14:37] then today the host was UP according to icinga and services down, thus notifications were sent for the services [15:15:00] for example this [Tue Apr 2 11:17:07 2024] SERVICE NOTIFICATION: team-services;aqs2001;cassandra-b CQL 10.192.0.215:9042;CRITICAL;notify-by-email;CRITICAL - Socket timeout after 10 seconds [15:15:17] what I'm looking at is this [15:15:20] alert1001:/srv/icinga-logs# zgrep -h aqs2001 icinga-03-27-2024-00.log icinga-03-28-2024-00.log icinga-03-29-2024-00.log icinga-03-30-2024-00.log icinga-03-31-2024-00.log icinga-04-01-2024-00.log icinga-04-02-2024-00.log /var/log/icinga/icinga.log | perl -pe 's/(\d+)/localtime($1)/e' [15:21:22] godog: ok, so it's basically a split-brain of notification configuration 🤦‍♂️ [15:21:56] yeah sth like that