[00:00:57] RECOVERY - MariaDB sustained replica lag on s5 on db2178 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2178&var-port=9104 [00:23:17] RECOVERY - MariaDB sustained replica lag on s5 on db2171 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2171&var-port=13315 [00:39:23] RECOVERY - MariaDB sustained replica lag on s5 on db2123 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2123&var-port=9104 [00:46:09] RECOVERY - MariaDB sustained replica lag on s5 on db2111 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2111&var-port=9104 [00:51:41] PROBLEM - MariaDB sustained replica lag on s5 on db1230 is CRITICAL: 196 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1230&var-port=9104 [00:51:47] PROBLEM - MariaDB sustained replica lag on s5 on db1130 is CRITICAL: 221 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1130&var-port=9104 [00:52:19] PROBLEM - MariaDB sustained replica lag on s5 on db1200 is CRITICAL: 262 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1200&var-port=9104 [00:52:33] PROBLEM - MariaDB sustained replica lag on s5 on db1210 is CRITICAL: 246 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1210&var-port=9104 [00:52:47] PROBLEM - MariaDB sustained replica lag on s5 on db1144 is CRITICAL: 272 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1144&var-port=13315 [00:52:51] PROBLEM - MariaDB sustained replica lag on s5 on db1185 is CRITICAL: 270 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1185&var-port=9104 [00:53:11] PROBLEM - MariaDB sustained replica lag on s5 on db1161 is CRITICAL: 318 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1161&var-port=9104 [00:53:11] PROBLEM - MariaDB sustained replica lag on s5 on db1213 is CRITICAL: 322 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1213&var-port=13315 [00:59:53] RECOVERY - MariaDB sustained replica lag on s5 on db1183 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1183&var-port=9104 [02:01:45] PROBLEM - MariaDB sustained replica lag on s5 on db1154 is CRITICAL: 1126 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13315 [02:05:57] RECOVERY - MariaDB sustained replica lag on s5 on db1154 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13315 [02:06:55] RECOVERY - MariaDB sustained replica lag on s5 on db1185 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1185&var-port=9104 [02:08:01] RECOVERY - MariaDB sustained replica lag on s5 on db1210 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1210&var-port=9104 [02:08:39] RECOVERY - MariaDB sustained replica lag on s5 on db1161 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1161&var-port=9104 [02:12:27] RECOVERY - MariaDB sustained replica lag on s5 on db1144 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1144&var-port=13315 [02:15:33] RECOVERY - MariaDB sustained replica lag on s5 on db1230 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1230&var-port=9104 [02:15:41] RECOVERY - MariaDB sustained replica lag on s5 on db1213 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1213&var-port=13315 [02:38:29] RECOVERY - MariaDB sustained replica lag on s5 on db1200 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1200&var-port=9104 [04:03:37] RECOVERY - MariaDB sustained replica lag on s5 on db1130 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1130&var-port=9104 [05:34:25] a delete without a limit..... classic! [06:20:18] arnaudb: is db1127 depooled for a reason? cloning or such? can it be repooled? [07:43:11] let me check [07:44:33] https://gerrit.wikimedia.org/r/c/operations/puppet/+/969989/ it has been swapped marostegui → I let it depooled until monday and will set it aside for decommission [07:45:48] excellent! for tracking those things, can you create a decommission task and make it a subtask of that provisioning one? so we can keep track of the host that are somewhat ready to be decommissioned [07:45:57] so we don't have to bother you and ask all the time :) [07:45:58] sure! [07:46:11] thanks! [11:14:06] backup1010 got bad partitioning- I could fix it through a painful set of cloning, lvm and mdadm commands, but giving it has no data yet, I will just reimage it [12:14:05] I'm glad you are going for bookworm <3 [15:02:23] Amir1: I really want to get pc4 into bookworm [15:02:27] Before it is production [15:02:34] Would you let me do it next week? XD [15:55:46] o/ it's been a while since I came here with some pesky mariadb issues... but I have an "interesting" one :P [15:55:50] T349695 [15:55:51] T349695: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 [15:56:27] do you have any hint at what config value I should tweak to prevent mariadb from using too much memory? [15:56:56] there are a few mentioned here but I'm not sure where to start https://mariadb.com/kb/en/mariadb-memory-allocation/ [15:58:16] dhinus: Probably the most obvious ones is to reduce the innodb buffer pool size, however, I doubt't that is the root cause. Reducing it might give you just a bit more time until it crashes again [15:59:05] I checked that it's already lower than the recommended "70% of RAM", but I can try reducing it more [15:59:05] If it is a pattern, I would try to see if there's an specific query causing issues [15:59:32] Well the recommendation is that, but you can reduce it to whatever number you want. That's going to affect how "warm" the database is [15:59:51] So the more memory you have there, the more tables will be stored in memory, so the faster the queries using those tables would be [15:59:53] makes sense. I also suspect some specific query, but it's hard to find it... [16:00:18] for a couple days it happened at the same time, but then it's now happening at totally different times [16:00:20] But again, this might be a matter of time until it hits the threshold again. In other words, it is probably something else [16:00:41] Maybe you can enable the slow query log and see if there's something massive there [16:01:23] yeah I was thinking that as well. or also a query timeout [16:01:24] Another option would be to monitor show full processlist periodically and check what was there before each crash [16:01:58] You can use pt-kill to kill read queries that take more than X time [16:01:59] Like we have on the wikireplicas [16:02:22] I'd reduce the buffer pool size, but you definitely need to keep investigating because the crashes are likely to keep happening [16:08:00] thanks, I will try a combination of these strategies! [17:05:17] marostegui: sorry I was afk [17:05:20] sure, definitely [17:05:52] as long as you do it early :D [17:11:23] yeah [17:11:30] I will do next week early in the week