[05:49:43] jynus: I got db2097:3311 and db2100:3318 fixed on tendril but i am unsure with db1139:3311, can I just drop that one? [07:33:50] marostegui: I go try to sleep, there are two bash scripts running in mwmaint (my screen) cleaning up logging table in arwiki and ruwiki. If you see issues in s6 or s7, feel free to kill them [07:34:19] sweet [07:34:21] thanks [07:34:24] get a good rest [08:41:42] marostegui, fixed? why weren't they working? [08:42:15] jynus: they needed to be recreated [08:42:45] which is a bug when reimagining from 10.1 to 10.4 [08:42:45] they stop getting updated on tendril for some reason [08:43:10] ah, you weren't in mondays meeting- I intend to reimport db1139:s1 with a logical dump [08:43:17] maybe others [08:43:17] ah cool [08:43:27] I will leave it there then :-) [08:52:30] finally backups are fast again [08:56:20] and I can finally use fstrings! [09:06:26] swift is split py2/py3 for the different clusters, to the ring management thing I'm working on has to be bilingual [09:07:02] :-( [09:07:16] hopefuly that can get upgraded soon [09:07:34] 🤣 [09:23:04] PROBLEM - MariaDB sustained replica lag on s7 on db2121 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2121&var-port=9104 [09:26:18] something is doing heavy sequential reads on db2121, but I don't know what [09:27:07] maybe it is just replication and it is cold [09:27:24] ah, Amir mentioned the script on s7, could be that [09:28:10] yes, it is "Slave_SQL" so not an issue [09:28:18] it is that yes [09:29:20] it is a bit weird only 1 host is alerting- maybe it is not in an optimal state data wise or hw wise [09:31:04] other hosts have similar throughput but are lagging much less [09:31:59] at least it is not 10.4.22 :-) [09:32:12] I wouldn't spend much time on this I think [09:32:41] no, but something to notice in case it repeats [09:32:56] PROBLEM - MariaDB sustained replica lag on s7 on db2121 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2121&var-port=9104 [09:33:11] as in, in other circumstances [09:33:37] I will just check the RAID to make sure disks are find and leave it be [09:33:42] *fine [09:37:37] ^ no disk errors, no learning cycle BTW [09:41:10] So https://phabricator.wikimedia.org/T295563 (disk issue on a swift host) - there were enough medium errors for XFS to become sad and unmount the FS, but the hardware diagnostics haven't marked the drive as failed (so Dell don't want to fix it) - I could try xfs_repair and remounting it and seeing if behaves better, or I could try taking the "XFS thinks its broken" line... ? [09:41:55] Emperor, cannot say it will work for sure, but I got in the past disks replaced without fully failing [09:42:01] [I have limited experience with XFS; our vendor at my last place was happy with kernel errors] [09:42:21] as long as I had a RAID output, but I say talk to dcops what they can do [09:42:23] jynus: well, Papaul seems to think there's not enough there [09:42:28] :-( [09:42:36] jynus: see his comment at the bottom of that phab item [09:45:48] I guess I'll try and repair the fs and see if that works [09:47:12] to be fair, with 15 media errors, it will likely be rejected by the raid soon [09:48:28] I am guessing the controller is more lenient for a JBOD than for a RAID :-( [09:51:37] marostegui, I think I found what is different about db2121- not super important, but interesting: it is swapping way more than the other hosts, even if it is using much less memory [09:51:53] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=18&orgId=1&from=1637056309436&to=1637142709436&var-server=db2121&var-datasource=thanos&var-cluster=mysql [10:03:24] RECOVERY - MariaDB sustained replica lag on s7 on db2121 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2121&var-port=9104 [10:11:30] PROBLEM - MariaDB sustained replica lag on s7 on db2121 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2121&var-port=9104 [10:20:48] that's odd indeed [10:25:48] xfs_repair is 🐌 [10:49:47] Bored now [10:51:23] RECOVERY - MariaDB sustained replica lag on s7 on db2121 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2121&var-port=9104 [11:14:55] PROBLEM - MariaDB sustained replica lag on s7 on db2121 is CRITICAL: 3.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2121&var-port=9104 [11:26:24] thank you, btw, Manuel, for taking care of a lot of backup work while I was out, I was told on Monday [11:26:34] did my best :) [11:26:34] much appreciated :-) [11:26:39] you did great! [11:27:11] I think you ran into the "you have to prepare with the same version" error [11:27:19] yeah [11:27:38] I found some weird stuff [11:27:43] this is T253959 but was not given a lot of priority [11:27:44] T253959: Check we are preparing (xtrabackup --prepare) with the same package version as the server version of which the backup was taken - https://phabricator.wikimedia.org/T253959 [11:27:46] like some errors on the transfer but then zarcilo was saying it was ok [11:28:16] I think the logs may be missleading, as they are run in parallel [11:29:03] and that should be fixed on next release with volans' patch: https://gerrit.wikimedia.org/r/c/operations/software/wmfbackups/+/736652 [11:35:08] yeah, thanks for that volans :) [11:35:24] yw :) [11:42:13] Emperor: https://phabricator.wikimedia.org/T295118#7509294 [11:42:52] ah I see you are already there - nice! [11:45:58] :) [11:55:20] PROBLEM - MariaDB sustained replica lag on s7 on db2095 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2095&var-port=13317 [11:56:22] RECOVERY - MariaDB sustained replica lag on s7 on db2095 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2095&var-port=13317 [12:05:42] I made the s7 one twice slower [12:18:50] Amir1, it wasn't a script issue, but a host issue [12:19:07] oh okay [12:32:58] Amir1: I have merged https://phabricator.wikimedia.org/T291419 into the TransactionProfiler one [12:42:31] Thanks [12:42:58] I need someone to review the mediawiki patch [12:44:18] * marostegui runs away [13:11:28] 🤪🤪🤪 [13:19:16] PROBLEM - MariaDB sustained replica lag on s7 on db2121 is CRITICAL: 3.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2121&var-port=9104 [13:40:50] ugh, xfs_repair still going [14:11:04] oh look more Medium Errors [14:13:17] same sector as last time, which smells very "actual hardware fault" to me [14:14:03] Emperor: what you pasted on https://phabricator.wikimedia.org/T295563 should be enough to get a new disk shipped in [14:15:14] here's hoping... [14:15:54] I wouldn't have even started the xfs_repair [14:16:17] I did so because Papaul thought I'd not provided enough evidence to get Dell to replace the disk [14:16:48] Usually the kernel errors and all the controller info is enough, at least with DBs [14:18:08] AFAICT (I hate hardware RAID so am probably not very good at reading it) the RAID controller hasn't failed the disk, just marked the Medium Error Count up [14:18:51] which means it is about to fail (or failed already but the controller didn't really detect it yet), but the kernel errors are pretty clear [14:19:19] sometimes the RAID controller cannot see the disk failed entirely and doesn't mark it as failed, but it is really broken and the performance gets degraded a lot until you manually mark it as failed [14:20:39] Oh, sure, _I_ think the disk is duff. You don't need to convince me :) Also, disks are a) cheap b) crap so I would always swap one at the first sign of trouble... [14:21:34] Yeah, I think the xfs repair is a waste of time and we should probably ask for a new one based on the kernel logs [14:21:46] if it doesn't fail now, it will soon [14:23:33] I rather thought so too, but I had kernel logs in the original ticket and that wasn't good enough [17:18:02] db2121 is still producing some alerting, when ongoing maintenance finishes it may be worth doing a reboot or upgrade there (with no rush needed ofc, being codfw) [18:38:24] good morning? [18:41:59] Morning [19:15:39] RECOVERY - MariaDB sustained replica lag on s7 on db2121 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2121&var-port=9104 [19:31:42] I added a timestamp condition to the delete to fix this ^ [19:32:04] it's quite a hassle tbh, now I need to delete year by year