[12:44:41] kostajh: hi, so I had thoughts about the assert-no-error [12:45:08] I am tempted to either do the deep investigation that would explain what causes the deadlock and get that solved somehow [12:46:53] or have api-testing lib/actionapi.js createAccount method to somehow hold a global Lock to prevent concurrent calls to `createaccount` [12:47:32] fun thing, we have similar code in wdio-mediawiki :] [12:49:02] heh, yeah [12:49:23] one thing i was exploring today (after looking through https://mochajs.org/#parallel-tests), is to create a `fixtures.cjs` file that is loaded once, globally at the start of the tests. [12:49:45] cause as I understand nodejs, the POST requests are awaited so they block [12:49:56] we would then create the `const someName` for each user in that file, create all users needed for all tests serially in that file, then make use of the existing users in the tests [12:50:04] but the async code still let several requests to be emitted concurrently [12:50:25] kind of annoying, because you then need a central registry for all the test users, but it would get around the issue [12:50:41] isn't there a mutex system in nodejs? [12:51:19] AIUI, because we are using the `--parallel` flag, there's no way to tell node to not fire a POST request while a request in another test is running [12:51:20] mutex('createaccount').acquire().then( doRequest ) [12:51:26] (I am making things up) [12:51:39] "When Mocha runs in parallel mode, test files do not share the same process, nor do they share the same instance of Moch" [12:51:47] with --parallel it is different process [12:51:49] ah yeah [12:51:50] great [12:52:15] with all those terms we already have lost 99.999999% of the planet ;) [12:53:04] then I am wondering what is causing the deadlock, the original task was indicating some issue with "only" the first 14 bytes of the username to be used to lock the table [12:53:32] but if I remember properly api-testing creates account that do not share any common prefix (AliceXyZ , BobZyX9 etc) [12:53:38] so it must be a different problem [12:54:15] my lunch though would be to have mediawiki/core to detect the deadlock error and issue some magic MySQL command that could give us more details as to why the deadlock happens [12:54:27] after that I have brewed myself a coffee and went to brain dump here [12:56:57] there is also https://mariadb.com/docs/reference/mdb/system-variables/innodb_print_all_deadlocks/ :D [14:04:00] hashar: they do share a common prefix. I tried (not very hard) adding a unique prefix, but that broke the tests, let me look at that again quickly [14:04:51] kostajh: cause from some debug log I was looking at it seemed like the queries were creating Alice / Bob etc usernames [14:04:57] I might have looked at the wrong log though [14:05:39] I also could not find from where the 14bytes limit comes from [14:05:49] maybe it is dynamically determinated based on the number of rows in the db [14:05:57] or comes from some obscure default in mariadb [14:06:39] + from the documentation it looked like the field having an index would disable the gap locking mechanism [14:06:52] but who knows really [14:08:34] https://mariadb.com/kb/en/innodb-lock-modes/ has : Gap locks are disabled if the `innodb_locks_unsafe_for_binlog` system variable is set (until MariaDB 10.4), or the isolation level is set to `READ COMMITTED`. [14:09:34] the quibble images have 10.3 [14:11:55] innodb-locks ON [14:11:55] innodb-locks-unsafe-for-binlog FALSE [14:12:12] that is from the quibble container ( /usr/sbin/mysqld --verbose --help|grep innodb-locks ) [14:13:09] and: transaction-isolation REPEATABLE-READ [14:14:06] I am not sure I want to process the whole descriptions at https://mariadb.com/kb/en/set-transaction/#isolation-levels [14:15:31] in REPEATABLE-READ "nnoDB locks the index range scanned, using gap locks or next-key (gap plus index-record) locks to block insertions by other sessions into the gaps covered by the range" [14:16:18] what is the gap? [14:18:25] as I understand it blocks all records next to the one being affected [14:18:45] so if you have a transaction inserting "Username1234" [14:18:59] and the gap is 8 bytes [14:19:39] if a second transaction tries to insert a record with a field matching the same 8 bytes prefix it is rejected cause of the lock held by the first transaction [14:20:07] hence trying to insert Username1235 is rejected if another is inserting Username1234 [14:20:22] well that is how I understand it based on the deadlock dump anomie pasted on the tas [14:20:22] k [14:21:00] tentatively if we reverse() the username or use a reversed microseconds timestamp as a suffix for the username, we would not be subjet to that gap locking [14:21:28] cause 5321Username would not overlap with 4321Username [14:21:39] well that is my fringe intuition theory [14:21:48] maybe DBAs folk can assist on that front [14:30:50] hashar: I'm not sure, but it looks like putting user save options into a DeferredUpdate in the LocalUserCreated hook will work around the problem [14:37:44] bah [14:38:14] anyway I tried prefixing unique strings to usernames, that doesn't fix the issue [14:38:21] :-\\\\ [14:38:34] I think the problem relates to how we write to user_properties during the same transaction (I think?) as creating the entry in user table [14:39:16] I do see that some extensions are calling $user->saveSettings() in LocalUserCreated hook; that's not needed because AuthManager calls saveSettings right after the hook. And indeed removing the $user->saveSettings in extension hook will adjust the stack trace of the exception [14:39:53] maybe we can get the DBQuery log for the requests or dump the output of `show engine innodb status;` [14:40:05] and putting AuthManager's saveSettings() call in a DeferredUpdate still causes the problem [14:40:42] :-\ [14:41:36] I can get you the DBQuery log [14:41:50] but I think we have that in CI already? [14:42:02] gotta go pick up the kids, that's all for now :) [14:45:20] same, going to get them to walk a bit outside [14:45:32] I might look at having MySQLServer shutdown to do a show engine innodb status dump [14:45:43] much like we have an option to dump the db [14:45:46] & [16:23:27] https://gist.github.com/kostajh/fd7d0bfc73ecc5bdaed4a31875d0c88a [16:29:02] kostajh: well done! [16:29:08] guess you can attach it to the task [16:29:24] n bits 240 on user_properties [16:30:29] which sounds like a different lock than the one reported on the task [16:33:54] so hmm no idea :\ [16:48:23] I think I have a clue [16:49:39] if I don't set any options in LocalUserCreatedHook, the deadlock goes away [16:50:27] so, I think I can experiment based on that observation [17:11:24] hashar: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/753522/1 fixes the problem for me locally. Haven't yet thought through the broader implications of it though, it might not be OK to do in a deferred update. [17:24:04] kostajh: yeah that delegate the deadlock to deferred update [17:24:10] and if those are run serially it will be fine [17:24:16] I have zero idea of the impact though [17:25:42] I could imagine stuff relying on the options to be written back immediately [17:26:27] like submit your Special:Preferences, the page get reloaded but might end up showing out of date pref [17:27:35] but don't quote me on that :-] [18:08:09] hashar: I think I can put it in the presend queue. I need to patch some extensions to not call saveSettings in LocalUserCreated. Then I think it will work [18:24:28] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/753522/3 🤞 [18:32:42] er, this one rather https://gerrit.wikimedia.org/r/c/mediawiki/core/+/753522/4 [18:42:31] https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php72-docker/25503/console that works [20:16:52] I am out, got ci jenkins upgraded :)