[00:05:42] that was the attempt before I thought of a different reason for the deadlock, and Amir was pretty sure it would solve it. Amir's patch should be safe cause we maintain an in-process cache for any changed options, so no matter when we save the updates to the DB, we see the new option values. I do agree that was risky [00:06:43] did you do a codesearch for things that query the user_properties table directly? [00:08:12] I'm looking at https://codesearch.wmcloud.org/search/?q=%5Cbuser_properties%5Cb&i=nope&files=&excludeFiles=&repos= , it's not a small list [00:08:26] there's a bunch of jobs, maintainence scripts. The only important one is gender cache, [00:10:54] I think your patch "Defer cleaning up user_properties rows" is premature while your theory is unproven [00:13:12] there are no delete queries in any of the backtraces, right? [00:13:27] no delete queries [00:14:18] do you have a theory of how extra delete queries lead to a deadlock? it shouldn't just depend on transaction time, there needs to be a change in lock order, right? [00:15:02] https://phabricator.wikimedia.org/T286521#7213689 [00:15:25] I've been reading that delete gets a next-key lock, which includes a gap lock [00:15:40] and insert needs an exclusive lock [00:16:20] so more deletes => more gap locks => higher probability of a deadlock, [00:17:10] should be reproducible [00:18:44] like with 2 sql shells doing deletes/inserts? [00:22:02] yeah, I'm trying it [00:24:47] seems like TRX2 already has the lock -- why does it need to wait for TRX1? [00:25:24] maybe if they delete different rows? [00:26:18] note that they're even doing the operation on the different users [00:27:34] the lock that the second transaction holds is X, while it wants X insert intention lock. [00:28:37] from what I understood from reading the docs these are different and not compatible [00:29:14] but I'm not entirely sure, I've learned about mysql locks more today then the entirety of what I knew before [00:31:17] locally it is possible to do "set global innodb_status_output_locks=1" and maybe verify this [00:34:48] I guess most of what you say is verified by the deadlock description you posted [00:34:53] with adjacent user IDs [00:44:43] I can merge it, except for the CI failure, but I'm still not convinced it's going to do anything [00:47:13] Lemme try to poke it a bit more [00:51:20] interesting -- I can reproduce a gap lock if I delete a row that's already been deleted [00:51:40] just deleting a row that actually exists doesn't seem to do it [00:53:37] ah, and I triggered a deadlock that way [00:55:58] got it down to an easy test case, I'll post this on the bug [01:04:32] I think I know a 1-liner fix [01:07:07] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/704649 [01:08:21] good catch, let's try that [01:09:38] my theory was that multiple threads were trying to delete default options simultaneously -- after one succeeds, the next still tries to delete [01:10:01] after I finish messing with hooks and know exactly what's changed, I want to make the DB logic there different - only delete if we actually need to delete, and upsert instead of delete/insert [01:10:24] but the issue you fix there will delete nonexistent rows all the time [01:10:28] know exactly what's change = know exactly which options were modified and not fetch all options and compare [01:16:47] the order of upserts should be consistent [01:17:10] if one thread upserts two rows, and another thread does it in the opposite order, they will deadlock [01:18:51] oh.. didn't know that. [01:19:24] I just tested this with my two shells [01:19:53] that is the classical cause of a deadlock -- locks acquired in a different order [01:21:00] in case of user options, could probably sort them by up_property.. but I think I'm trying to add more consistency where it's not needed or possible [01:22:46] my concern was what if now oldValue is null for us, another thread inserts someting there, and our insert will be ignored. but it's not clear really which one should win in this case [01:28:07] got too excited about the patch that the tests failed. uploaded new version.. [01:29:22] maybe you want replace rather than upsert [01:30:35] it seems like no matter what we need to do some deletes, some pure inserts and some replaces. [01:31:59] with multi-row upsert the current implementation does not really allow different updated values for different rows [01:32:22] Aaron has a patch to fix that, allowing you to build an expression involving the values that would be inserted [01:32:32] but replace() does that naturally [01:36:45] according to https://mariadb.com/kb/en/replace/ there are a few differences between REPLACE and INSERT ON DUPLICATE KEY UPDATE, specifically REPLACE is more like a DELETE followed by an INSERT, so maybe REPLACE will have the same locking issues [01:37:39] but maybe it is a conditional delete -- you can imagine it not doing a gap lock in the way DELETE/INSERT does [01:40:32] the doc has an example of a transaction [01:40:57] and it does a select FOR UPDATE, conditional delete and then insert [01:41:11] so no gap locking