[12:34:13] dhinus: can you tell me how to find this dashaboard that you inserted? https://phabricator.wikimedia.org/T384118#10526491 [13:35:39] andrewbogott: https://logstash.wikimedia.org/app/dashboards#/view/8b1907c0-2062-11ec-85b7-9d1831ce7631 [13:37:16] oh, logstash! I forget that logstash can make charts :) [13:37:17] ty [13:40:13] that confirms that the flapping has stopped! for now at least [13:56:47] awesome! thanks for fixing that [14:52:01] I'm gonna reboot cloudnet1006 which is the last reboot for T384946 [14:52:08] I checked and it's the current standby cloudnet [14:52:12] so no impact expected [14:52:47] ok! If it's the standby then there must've been a quiet failover over the weekend [14:53:05] that or I marked the wrong box on the task, can you doublecheck uptime before you reboot? [14:53:20] uptime confirms it's not been rebooted in a while [14:53:26] while 1005 was rebooted 3 days ago [14:55:10] ok then [14:55:15] thanks for checking [14:55:45] I was also surprised by the fact the missing one was the standby... so I double checked the uptime :) [14:56:06] * dhinus checks if a failover is visible in grafana [15:00:57] failover confirmed on 2025-02-08 around 12:00 UTC [15:02:13] Does that correspond with the reboot of 1005? Maybe it somehow grabbed the service when it came up? [15:32:16] andrewbogott: nope, reboot was 2025-02-07 [15:51:27] cloudnet1006 logged kernel errors [15:56:52] arturo: I rebooted it [15:57:09] I see the alert is about warnings, the error message is correctly ignored [15:57:14] yep, that's good! [15:57:21] are we interested in alerting based on warning-level messages? [15:57:34] maybe we could keep the metric, but remove the alerts? [15:58:56] yeah