[13:57:48] Amir1: marostegui: https://phabricator.wikimedia.org/P58768 has been depooled given the page alert on #wikimedia-operations → its a vslow/dump host fyi [13:58:12] I depooled the wrong host [13:58:43] haha [13:58:44] arnaudb: thanks, please create a task so it can be tracked and take a look at the HW logs and see if there's something interesting there [13:59:00] yep! [13:59:14] arnaudb: downtime it too, give it like 24h or so [14:00:18] done [14:01:57] "The system board BP1 PG voltage is outside of range. Tue Mar 12 2024 14:52:10" [14:04:34] that sounds like it deserves an ops-eqiad tag and a ping to dc-ops [14:04:47] Start mariadb and replication, but leave it depooled for now [14:04:53] trying a hard reboot atm, to see how it's dead [14:05:06] how much* [14:05:08] but it came back no? [14:05:13] like it rebooted itself [14:05:24] it was unreachable and nothing on console [14:05:28] [14:56:59] <+icinga-wm> RECOVERY - Host db1246 #page is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [14:05:30] yep [14:05:33] no ssh though [14:05:39] aaah [14:05:40] weird [14:05:49] past grub [14:07:03] same symptom, stuck on https://usercontent.irccloud-cdn.com/file/iqBkpxwM/image.png [14:07:20] sometimes that takes a while [14:07:25] let's give it a few mins [14:11:26] I've seen sick raspberry pis boot faster at this stage :D [14:13:40] XD [14:13:49] so yeah, downtime it for a week I'd say [14:14:05] And tag ops-eqiad with the pasted HW log you found and let's see what they say [14:14:10] yep [14:22:56] [{reqId}] {exception_url} Wikimedia\Rdbms\DBConnectionError: Cannot access the database: No route to host (db1246) [14:23:00] I am late to the party :) [14:23:29] it is no more occuring [14:25:40] hashar: yeah, arnaudb depooled it :) [14:25:51] good arnaudb :) [14:27:15] arnaudb: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1010530 can you merge that? :) [14:28:24] sure! [14:29:14] thanks [14:48:38] https://en.wikipedia.org/wiki/Turtles_all_the_way_down#/media/File:PSM_V10_D562_The_hindoo_earth.jpg [15:56:56] (SystemdUnitFailed) firing: (23) prometheus-mysqld-exporter.service on db2197:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:58:01] 🤔 checking [16:03:55] its weird, all the hosts have `profile::monitoring::notifications_enabled: false` [16:04:30] anyway, it's only unprovisionned hosts so I'll ack the alert until tomorrow [16:51:40] arnaudb: ^ the host is down so that patch isn't effective:-). I just sent it in case it comes back [16:57:25] oh it was not one of the hosts impacted by this alert [17:02:24] Ah! [17:02:29] Then don't know:)