[13:51:22] <elukey>	 herron: o/
[13:51:36] <elukey>	 if you have time, not sure if you saw https://phabricator.wikimedia.org/T359879#9669840
[13:51:41] <elukey>	 but I am really puzzled
[13:56:26] <herron>	 elukey: hey, yes I did see and we had a chat about it about at the last o11y meeting as well but no smoking gun yet
[13:56:38] <herron>	 I'll aim to spend some more time on it this week
[14:28:45] <elukey>	 thanks!!! Lemme know if I can help
[14:28:55] <elukey>	 I tried to check on titan nodes but couldn't find much
[14:29:49] <elukey>	 also https://thanos.wikimedia.org/rules#istio_slos look good-ish atm, a total of 17s for each evaluation
[14:30:01] <elukey>	 not great but it should cause a huge lag
[15:02:59] <urandom>	 is it not a given that an event in icinga should also show on alerts.w.o?
[15:04:50] <urandom>	 I'm trying to figure out how I could have a machine (aqs2001) down for more than 4 days, and only just get a notification for it this morning.  But it also still doesn't show in alerts.w.o...
[15:08:25] <godog>	 urandom: only icinga service alerts show up on alerts.w.o (though we could change that), not sure about being down for 4d though, how/where did you get the notification(s) ?
[15:09:04] <urandom>	 I got an alert for CQL at 11:17 UTC today, for both instances
[15:09:25] <urandom>	 but according to the icinga dashboard, those failures are 4 days old
[15:09:49] <urandom>	 which matches the dmesg output, and librenms showing the link going down
[15:10:19] <urandom>	 the notification I got was via email
[15:13:35] <godog>	 ok so yeah the host went down indeed on the 28, emails about the host down were issued 
[15:13:56] <godog>	 to data platform though, since the host was down notifications for services were not sent
[15:13:59] <godog>	 [Thu Mar 28 15:51:04 2024] HOST NOTIFICATION: data-platform-alerts;aqs2001;DOWN;host-notify-by-email;PING CRITICAL - Packet loss = 100%
[15:14:03] <godog>	 [Thu Mar 28 15:51:04 2024] HOST NOTIFICATION: irc-data-platform;aqs2001;DOWN;notify-host-by-irc-team-data-platform;PING CRITICAL - Packet loss = 100%
[15:14:37] <godog>	 then today the host was UP according to icinga and services down, thus notifications were sent for the services
[15:15:00] <godog>	 for example this [Tue Apr  2 11:17:07 2024] SERVICE NOTIFICATION: team-services;aqs2001;cassandra-b CQL 10.192.0.215:9042;CRITICAL;notify-by-email;CRITICAL - Socket timeout after 10 seconds
[15:15:17] <godog>	 what I'm looking at is this
[15:15:20] <godog>	 alert1001:/srv/icinga-logs# zgrep -h aqs2001 icinga-03-27-2024-00.log icinga-03-28-2024-00.log icinga-03-29-2024-00.log icinga-03-30-2024-00.log icinga-03-31-2024-00.log icinga-04-01-2024-00.log icinga-04-02-2024-00.log /var/log/icinga/icinga.log | perl -pe 's/(\d+)/localtime($1)/e'
[15:21:22] <urandom>	 godog: ok, so it's basically a split-brain of notification configuration 🤦‍♂️
[15:21:56] <godog>	 yeah sth like that