[09:13:02] Do we have process for engaging in RfCs? apropos T191804 someone has started https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)#RfC:_Increasing_the_maximum_size_for_uploaded_files and while I'm not entirely opposed, if we start seeing a lot of 5G files that would previously have been edited down to 4G, that could be quite a lot more storage [09:13:03] T191804: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 [09:15:15] why is that an RFC in the first place? :/ [09:15:35] Emperor: I would first engage on the ticket- whoever started the rfc may not know the whole story. Only now arnaud is running the schema change, so it is definitely not ready [09:19:27] Emperor: do you want me to comment on ticket or do you want me to? [09:23:39] *you to do it yourself? [09:25:20] jynus: please do [09:36:02] Emperor: see if my comment on phab reflects well your thoughts too [09:39:33] seems fair enough to me, thank you [12:56:18] I think there is something wrong with the memory usage of tools-db-1 (MariaDB 10.4) -- looks a bit like a memory leak? https://phabricator.wikimedia.org/T349695#9322218 [12:57:42] 'show global status like "memory_used";' reports about 600MB used, but the actual memory used by the mysql process grows almost constantly up to 64GB, where it's killed by the OOMkiller [12:59:11] I would agree with that [12:59:44] the version is 10.4.29 and I don't see any memory leaks reported in the changelog for that version [13:00:43] also the problem only appeared a few weeks ago, so it's weird [13:14:49] I see a couple of threads online suggesting to use jemalloc, e.g. https://stackoverflow.com/a/60488432/1765344 [13:15:30] does it make sense to try this? I know nothing about jemalloc :) [13:17:40] dhinus: can't say for sure but did you pin innodb buffer pool? [13:19:18] I tried decreasing it form 32GB to 20GB and that helped but eventually it went OOM anyway [13:19:43] it went back at 32GB after the last couple of restarts, because I hadn't modified the config file [13:21:25] I can try lowering it to 20GB permanently, but it seems it simply makes the leak slower (it took 3 days instead of 1 to reach the OOM) [13:23:39] I've just lowered it temporarily to 10GB, so hopefully it won't crash again during the weekend... but it doesn's look like a long-term solution [13:26:38] that's weird, if it's pinned why it's going further [13:26:50] I'm not sure what could be the reason [13:27:45] this is a VM so there could be additional weirdness that you don't see when MariaDB is installed on a physical host [13:31:46] lowering innodb_buffer_pool_size from 32G to 10G instantly freed about 20G of system memory (as shown by "top") [13:32:16] but the total is still 35G, so 20G more than what I set :) [13:32:23] (total shown by "top") [13:37:01] innodb is one part of it, I think mariadb also puts aside some memory for temporary tables [13:37:54] reducing innodb buffer just frees more memory to grow by temp tables [13:38:52] https://mariadb.com/kb/en/mariadb-memory-allocation/ ? [13:39:55] I know nothing about it, but in the mayority of cases it is an ongoing query/transaction- that should be the first suspect to look at (not only on regular client connections, but also procedures & internal automations) [13:40:27] given I think it is used by external processes, I would try to debug processlist first for long running transactions [13:41:24] check performance_schema or sys for obvious memory consumption- memory profiling is costly but can be enabled on performance schema too [13:42:00] temp tables could be one of the things memory is being used for [13:42:18] as well as query buffers or other caches [13:50:58] thanks, will have a look [13:57:10] not saying it is not mariadb [13:57:47] but given we only have seen those leaks on analytics and cloud, I would bet first on a long running query pattern [13:58:49] the think I'm confused about is: if it's caused by a query or process, it should use more and more memory and then eventually free that memory when the query completes. but instead taavi the available memory keeps on going down. [13:59:10] SHOW PROCESSLIST has some long queries (I've also enabled the slow query log) but nothing that lasts for hours [14:01:00] sorry taavi that was an unvoluntary autocomplete :) [14:01:27] note memory usually is allocated statically by buffers- that is- it never gets released back to the os (e.g. by the buffer pool), just made available for other data [14:02:42] (except per-user memory) [14:03:33] also, I believe jemalloc was compiled statically already [14:03:44] for the WMF package [14:18:29] dhinus: I believe a few days I suggested in the task to try to monitor the output of show process list. I'd still suggest to do so and see if you can catch what was running right before each crash and see if there's a pattern there [14:20:38] marostegui: yes, I've monitored a few times and I didn't see anything odd or too long. of course it's possible something happens just before the crash, and I don't see it... but the new thing I discovered today is that the memory keeps on going down quite constantly [14:20:47] so it doesn't seem to be a "one off" query that causes the OOM [14:21:30] it could just be a large transaction [14:21:51] I could try to talk to Mariadb but we don't have enough data to be able to show anything [14:22:59] is there something I could collect that would be useful? [14:23:47] I'm not convinced about the large transaction, because the available memory is going down as we speak, yet nothing long-running is shown at the moment in the processlist [14:25:19] Given the history of toolsdb I am more on the side of thinking that this is something related to the usage and not Mariadb itself [14:26:16] I agree on that, also because it started suddenly [14:27:19] but maybe it could be a combination? I would like at least to map where the memory is being used, and what is using all that memory given it's not innodb_buffer_pool_size [14:28:30] dhinus: is that host running bullseye? [14:28:43] yes, on cloudvps [14:28:56] maybe migrate to 10.6? [14:29:01] and see how it goes there? [14:29:22] I think it might be worth trying, yes [14:29:24] we are migrating to 10.6 in production, we've been running it for a year now [14:31:00] would you recommend upgrading the replica and promoting that to primary? or an in-place upgrade? [14:32:26] dhinus: I would recommend the replica first to make sure nothing weird happens [14:33:19] the upgrade is just as easy as stop Mariadb, remove the package, install the new one, start mariadb and run mysql_upgrade [14:33:30] thanks, I will probably give it a go next week [14:34:39] let me know how it goes [14:35:06] sure!