[00:00:57] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db2178 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2178&var-port=9104
[00:23:17] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db2171 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2171&var-port=13315
[00:39:23] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db2123 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2123&var-port=9104
[00:46:09] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db2111 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2111&var-port=9104
[00:51:41] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db1230 is CRITICAL: 196 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1230&var-port=9104
[00:51:47] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db1130 is CRITICAL: 221 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1130&var-port=9104
[00:52:19] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db1200 is CRITICAL: 262 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1200&var-port=9104
[00:52:33] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db1210 is CRITICAL: 246 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1210&var-port=9104
[00:52:47] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db1144 is CRITICAL: 272 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1144&var-port=13315
[00:52:51] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db1185 is CRITICAL: 270 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1185&var-port=9104
[00:53:11] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db1161 is CRITICAL: 318 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1161&var-port=9104
[00:53:11] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db1213 is CRITICAL: 322 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1213&var-port=13315
[00:59:53] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db1183 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1183&var-port=9104
[02:01:45] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db1154 is CRITICAL: 1126 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13315
[02:05:57] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db1154 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13315
[02:06:55] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db1185 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1185&var-port=9104
[02:08:01] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db1210 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1210&var-port=9104
[02:08:39] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db1161 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1161&var-port=9104
[02:12:27] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db1144 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1144&var-port=13315
[02:15:33] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db1230 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1230&var-port=9104
[02:15:41] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db1213 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1213&var-port=13315
[02:38:29] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db1200 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1200&var-port=9104
[04:03:37] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db1130 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1130&var-port=9104
[05:34:25] <marostegui>	 a delete without a limit..... classic!
[06:20:18] <marostegui>	 arnaudb: is db1127 depooled for a reason? cloning or such? can it be repooled? 
[07:43:11] <arnaudb>	 let me check
[07:44:33] <arnaudb>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/969989/ it has been swapped marostegui → I let it depooled until monday and will set it aside for decommission
[07:45:48] <marostegui>	 excellent! for tracking those things, can you create a decommission task and make it a subtask of that provisioning one? so we can keep track of the host that are somewhat ready to be decommissioned 
[07:45:57] <marostegui>	 so we don't have to bother you and ask all the time :)
[07:45:58] <arnaudb>	 sure!
[07:46:11] <marostegui>	 thanks!
[11:14:06] <jynus>	 backup1010 got bad partitioning- I could fix it through a painful set of cloning, lvm and mdadm commands, but giving it has no data yet, I will just reimage it
[12:14:05] <marostegui>	 I'm glad you are going for bookworm <3
[15:02:23] <marostegui>	 Amir1: I really want to get pc4 into bookworm
[15:02:27] <marostegui>	 Before it is production
[15:02:34] <marostegui>	 Would you let me do it next week? XD
[15:55:46] <dhinus>	 o/ it's been a while since I came here with some pesky mariadb issues... but I have an "interesting" one :P
[15:55:50] <dhinus>	 T349695
[15:55:51] <stashbot>	 T349695: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695
[15:56:27] <dhinus>	 do you have any hint at what config value I should tweak to prevent mariadb from using too much memory?
[15:56:56] <dhinus>	 there are a few mentioned here but I'm not sure where to start https://mariadb.com/kb/en/mariadb-memory-allocation/
[15:58:16] <marostegui>	 dhinus: Probably the most obvious ones is to reduce the innodb buffer pool size, however, I doubt't that is the root cause. Reducing it might give you just a bit more time until it crashes again
[15:59:05] <dhinus>	 I checked that it's already lower than the recommended "70% of RAM", but I can try reducing it more
[15:59:05] <marostegui>	 If it is a pattern, I would try to see if there's an specific query causing issues
[15:59:32] <marostegui>	 Well the recommendation is that, but you can reduce it to whatever number you want. That's going to affect how "warm" the database is
[15:59:51] <marostegui>	 So the more memory you have there, the more tables will be stored in memory, so the faster the queries using those tables would be
[15:59:53] <dhinus>	 makes sense. I also suspect some specific query, but it's hard to find it...
[16:00:18] <dhinus>	 for a couple days it happened at the same time, but then it's now happening at totally different times
[16:00:20] <marostegui>	 But again, this might be a matter of time until it hits the threshold again. In other words, it is probably something else
[16:00:41] <marostegui>	 Maybe you can enable the slow query log and see if there's something massive there
[16:01:23] <dhinus>	 yeah I was thinking that as well. or also a query timeout
[16:01:24] <marostegui>	 Another option would be to monitor show full processlist periodically and check what was there before each crash
[16:01:58] <marostegui>	 You can use pt-kill to kill read queries that take more than X time
[16:01:59] <marostegui>	 Like we have on the wikireplicas
[16:02:22] <marostegui>	 I'd reduce the buffer pool size, but you definitely need to keep investigating because the crashes are likely to keep happening
[16:08:00] <dhinus>	 thanks, I will try a combination of these strategies!
[17:05:17] <Amir1>	 marostegui: sorry I was afk
[17:05:20] <Amir1>	 sure, definitely
[17:05:52] <Amir1>	 as long as you do it early :D
[17:11:23] <marostegui>	 yeah 
[17:11:30] <marostegui>	 I will do next week early in the week