[07:59:47] ERROR: 10.192.0.88:6000/sdc1 is unmounted -- This will cause replicas designated for that device to be considered missing until resolved or the ring is updated. [07:59:48] [08:01:22] kernel log errors from Oct 21 on ms-be2028 [08:03:43] going to unmount and try xfs_repair [08:06:33] oh, it's worse than that [08:07:11] sdc reports hardware errors and the hpsa driver has removed the drive entirely [08:10:55] I'll open a Phab ticket for a disk swap. Do we have a way of knowing if a host is still in warranty? netbox tells me this host was purchased 2017-02-10 [08:12:53] it is normally 3 years [08:13:07] I think there's a field that's when the warranty ends [08:13:09] let me check [08:14:41] no we don't have that in netbox because it's the same for all hardware [08:14:58] ad was deemed redunadant [08:15:24] right :) [08:15:36] so yes, no warranty for you Emperor! [08:16:11] Emperor: it's 3y and you also have the quick link in netbox top-right side to go to the vendor page (dell, hp) [08:16:44] https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Hardware_Troubleshooting_Runbook#All_Hardware have some related info [08:24:57] Thanks. So I should still open a phab item for out-of-warranty drive failures? [08:26:21] (my last place once something was out of warranty that was basically it - get a warranty extension or retired from service) [08:27:24] we do out of warranty repairs in the 3 to 5 years gap, usually on a case by case basis dependign on the issue. Hard drives are usually replaced without problems [08:27:55] this server should be in list for replacement in the next quarter AFAICT [08:28:09] because it will reach the 5y threshold [08:28:42] so maybe not worthed for just few months, not sure what are the plans for refresh in that cluster [08:32:32] https://netbox.wikimedia.org/dcim/devices/?cf_ticket=T155659 [08:32:59] that's the batch from where it was bought [08:37:19] we've just added a bunch of Swift backend servers (e.g. ms-be206[2..5]), but I'm not sure when/if we are planning on getting rid of these. godog care to weigh in when you've a moment, please? [08:41:10] hello, reading scrollback [08:42:38] yeah the old servers will get retired, I don't have the list handy though [08:43:06] see also T294001 for what I believe is the same failed disk ? [08:43:07] T294001: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T294001 [08:43:40] there's likely a spare onsite, if not we can remove the disk from the rings and that's it I think [08:43:52] indeed not worth getting a replacement [08:46:09] yes, that's the same disk, good find :) I guess wait and see if a spare can be found [08:46:57] agreed [08:49:38] Emperor: fyi, there's a subscribe button on the sidebar on the right that doesn't require you to leave a comment [09:21:50] the other thing I'd suggest is to add yourself as a 'watcher' of sre-swift-storage, unless you are receiving notifications from that project already as a member, I'm not sure [09:48:40] I'll give that a go and see if the spam is too much :) [14:20:34] [16:20:16] <+icinga-wm> RECOVERY - HP RAID on ms-be2028 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:20:36] \o/ [14:31:42] Now I need to get the drive back into Swift... [14:35:17] (thankfully there are destructions) [14:37:42] PROBLEM - MariaDB sustained replica lag on s7 on db1155 is CRITICAL: 381 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1155&var-port=13317 [14:37:51] ^ that's me [14:38:20] I thought I had downtimed it [14:45:58] RECOVERY - MariaDB sustained replica lag on s7 on db1155 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1155&var-port=13317 [18:33:26] FYI db1112 just went down [18:33:33] s3 sanitarium master [18:33:40] If puppet is right [18:57:31] marostegui, kormat: ^ [18:58:57] https://phabricator.wikimedia.org/T294295 [19:04:50] Spookreeeno: thanks! [19:05:11] Np [19:47:12] sobanski: I didn't know what tag was best [19:47:17] So used them all [19:47:42] And that was a perfectly fine thing to do, thanks :) [19:49:04] Np :) [19:49:28] Glad me watching far too many irc alerts has come in handy [19:50:09] yea, on Phabricator the tags will sort themselves out :)