[05:16:34] I am going to failover es1, es2 and es3 master for the kernel reboots [06:02:12] done also es4 codfw [06:15:14] going for es5 codfw now [07:30:00] jynus: let me know when I can reboot db2160 (misc backup source) [07:31:56] swift error rates are high, I'm doing rolling-restarts [08:24:16] marostegui: any time before 0 am [08:24:28] thanks! [09:03:24] any thoughts? I asked dc ops already - https://phabricator.wikimedia.org/T337174#8870237 [09:07:03] uff [09:07:13] In a brand new host...that doesn't look healthy [09:07:58] Maybe move the disk to a different slot and see if it keeps failing or the same slot keeps failing [09:10:56] jynus: was that right after being set up? [09:11:13] Ah now, a month after from what I can see [09:11:14] you mean now, or before Sunday? [09:11:16] *no [09:11:24] this is a Sunday event [09:11:37] yeah, no, I meant if this happened right after the host was installed [09:16:54] jynus: firmware all up-to-date on this host? [09:18:57] Emperor: I would hope so, this is a just-setup a few weeks ago host [09:19:16] that is why I asked dcops for their advice [09:19:33] there's a cookbook that can check for you :) [09:19:35] but mentioned it here in case you have experience that ever or had additional suggestions [09:19:48] oh, that I didn't knew [09:20:06] see, those are the kind of suggestions I would be loving [09:20:37] the sre.hardware.upgrade-firmware prompts you before doing anything, so you can run it and see if it has anything it wants to upgrade [09:22:16] did the drive recover itself automatically? [09:22:32] I pasted the full log [09:23:02] I also checked for media errors and got 0 [09:24:14] OK. I suspect it'll be hard to get a new drive in that start replaced. I think, assuming firmware all up to date, I'd be tempted to run a stress test on the RAID device for a few days and see if the issue recurs [09:25:25] I added that to the question to dc ops but will now wait for their feedback [09:26:08] depending on their feedback (and possibly upstream) I will follow your suggestion [09:28:38] hardware, eh? [09:32:39] Emperor: https://i.imgur.com/NwvlxQy.jpg [09:33:03] lolsob [11:57:15] PROBLEM - MariaDB sustained replica lag on s6 on db2124 is CRITICAL: 276 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2124&var-port=9104 [11:58:51] checking [14:01:04] I forgot, or I guess technically urandom did, about having an ONFIRE updat section [14:02:00] janis will do 2 reviews thise afternoon [14:03:08] recommended to attend, etc. [14:05:26] jynus: If by forget you mean that I didn't know I should do that, you are correct :) [14:05:49] though, I definitely should have know, having seen you do so every week! [14:06:43] yeah, I was kidding, didn't mean it as a passive-agressive comment [14:06:48] I know [14:07:02] I started writing and then I realized it was not something I do , but also was used to [14:07:14] also I expect you have not even attended a meeting yet! [14:07:22] I have [14:07:24] oh [14:07:49] I understood 18.2% of what was discussed [14:08:02] that's more than me! :-P [14:08:24] uh oh, then clearly I'm overestimating myself