[01:09:55] PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 7.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:11:31] RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [07:26:09] jynus: I am wondering if something is wrong with db2139:3314...it's had replication stopped for almost 14h now [07:26:25] (it is a backup source) [07:28:25] Although dbbackups says db2099 is the source for s4 so I am unsure [08:27:11] maybe it crashed [08:27:51] or the backup process, that stops replication, crashed [08:29:47] 230425 17:43:24 [ERROR] mysqld got signal 7 ; [08:30:59] mariadb@s4.service: Main process exited, code=killed, status=7/BUS [08:31:08] mariadb@s4.service: Failed with result 'signal'. [08:33:15] T321147 [08:33:16] T321147: Degraded RAID on db2139 - https://phabricator.wikimedia.org/T321147 [08:36:15] I will create a task and recover it- it wasn't related to backups [08:36:24] but I wonder if there is something weird with hw there [08:38:04] anything on hw logs? [08:38:28] let me create the task first, I will look at that next [08:40:55] cool [08:44:03] I will subscribe you, as you may be able to provide good pointers [08:44:33] sounds good! [08:44:40] T335396 [08:44:41] T335396: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 [08:44:58] 10.4.25 [08:46:19] ha [08:46:19] [16212571.545882] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 - err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000]) [08:46:37] [16212571.545800] {8}[Hardware Error]: node: 0 card: 0 module: 1 rank: 0 bank: 3 device: 0 row: 58443 column: 600 [08:46:38] [16212571.545801] {8}[Hardware Error]: error_type: 3, multi-bit ECC [08:46:38] [16212571.545803] {8}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 [08:47:00] :-( [08:48:40] So yeah, A7 is broken [08:48:42] :( hw is cursed [08:48:44] Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A7. [08:49:01] we would have not this issue if we were on blockchain [08:49:11] or mongodb [08:49:58] jynus: It doesn't look like I can do much now but let me know if you need anything [08:53:06] Amir1: this is one of "my" (note the quotes) hosts, so nothing at the moment, but feel free to subscribe and look at it, as it is a common db isssue you may face [08:53:38] yeah I think I dealt with bad memory before [08:55:48] but you're not sure? ;-) [08:56:39] marostegui: the host has had power issues before- maybe those are unrelated and just due to power unit maintenance [08:56:53] yeah, who knows [08:56:56] is it under warranty? [08:57:07] but maybe they are related- loss of power leading to cpu and memory weirdness [08:58:44] 2020, so I am guessing no [08:58:58] I am going to restart it [08:59:45] if it is the module, better it rejected and a server with less memory (specially for backup sources) than an unstable one [08:59:46] Purchase date 2020-04-27 [08:59:51] so maybe it is still in warranty by 1 day [09:00:10] maybe we can try to get that DIMM ordered today [09:00:13] oh, true, my math is not good [09:00:21] we might need kwakuofori to escalate this quickly to willy [09:03:54] the errors are ongoing [09:04:30] noted. will ping Willy but likely to be addressed in the second part of the day [09:05:13] and unlikely it's under warranty (or just) but let's see [09:07:20] kwakuofori: isn't it 3 years since purchase date? [09:10:54] it is. I'm just hoping the dates tally. [09:12:17] So that date is from netbox and this is the date from the task: https://phabricator.wikimedia.org/T246007#6098817 [09:12:32] So it may be the case that the warranty expires tomorrow or 30th! [09:12:55] It will be tight anyways [09:15:48] the slot hasn't been disabled automatically [12:01:14] So: "building a continuous process backing up newly uploaded files with a max lag of 20 seconds": 20 minutes [12:01:25] "Debugging why new files were not detected because pymysql, contrary to documentation was in autocommit mode": 9 hours [12:07:05] :-/ [13:20:34] Emperor: did you see the ms-be2071 alert? [13:24:48] no, let me see [13:25:04] sigh [13:25:09] /dev/sda1 246M 246M 172K 100% /srv/swift-storage/objects0 [13:33:09] that should be better