[09:29:48] FYI there is no root mysql access from cumin1002 [10:46:23] yeah, this needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/983169 first [10:46:48] and https://phabricator.wikimedia.org/T352974 [12:03:33] PROBLEM - MariaDB sustained replica lag on s1 on db1219 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1219&var-port=9104 [12:05:03] RECOVERY - MariaDB sustained replica lag on s1 on db1219 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1219&var-port=9104 [16:28:43] marostegui: This is really dhinus's project but I have one specific question about https://phabricator.wikimedia.org/T353093: Should it be possible for client behavior, however pathological, to cause an OOM? Is there really no way to cap mariadb's memory usage? [16:31:43] andrewbogott: not really, altough there are caps for resources (max queries, max connections, max statement time, etc.) [16:32:30] it can also be limited through pt-query-killer, which you have already the infra there for wikireplicas (you would only need to tune it) [16:33:11] for example, there are metrics of memory consumption per thread [16:34:44] So there's no way to limit the total aggregate resource usage, only the usage per query? [16:35:24] (Also I note that you don't mention 'ram' in any of those limits... ) [16:35:43] there is select * FROM memory_by_user_by_current_bytes; [16:36:34] although it has a reported bug: https://jira.mariadb.org/browse/MDEV-23936 [16:38:25] So in theory we can cap ram per query, and also cap the number of queries [16:38:44] I need to go, but hopefuly I helped with some ideas [16:38:45] (which, capping ram per query would involve detecting that it's over the limit and killing) [16:39:06] thanks jynus [16:43:28] memory_by_user_by_current_bytes is unfortunately 10.6 only (that's one good reason to upgrade) [16:43:46] there might be some other useful tables in performance_schema that I haven't looked into yet [16:51:17] dhinus: 10.4 is EOL in 6 months, so 10.6 is waiting for you there :) [16:52:05] yep I guess we'll have to bite the bullet :) [16:52:06] I don't know why toolsdb doesn't have a query killer like we do on wikireplicas [16:52:25] But I think your team should give it a thought [16:52:31] does the query killer kill based on memory used, or only on query running time? [16:53:06] is the wikireplicas one this thing? https://wikitech.wikimedia.org/wiki/Query_killer [16:57:25] It's based on running time [16:57:38] That is old, we use pt-kill [16:57:54] You can check on any of the clouddb hosts [16:58:04] Issue a ps aux and you'll see it running [16:58:26] That link above is a query killer we use in production [17:08:47] what's the advantage over setting max_statement_time? [17:10:11] for a started, that didn't exist when killers were setup [17:10:14] *starter [17:10:52] the other is that you can tune more specifically what to kill, rather than killing everything [17:11:25] dhinus: it gives you a lot more flexibility [17:11:34] What jynis said basically [17:12:35] I see, it might be worth trying!