[04:28:47] marostegui: i _did_ remember to set an alarm for this morning ;) [04:30:32] haha nice! [04:30:37] no [04:30:41] try not to fall asleep in the next 30 minutes! [04:31:05] It is great to have all the morning silence, isn't it? [04:31:08] You are so welcome [04:34:46] it's a good thing i just got some early-morning kitty snuggles to comfort me :P [04:35:03] see????? [04:35:13] mornings are the best [04:40:31] ಠ_ಠ [04:40:54] marostegui: did the --only-move-slaves run go smoothly? [04:41:00] (and are you running with DEBUG=1?) [04:41:04] yeah it went all fine [04:41:08] (no!) [04:41:11] I will for the switch [04:44:00] ok 🤞 [04:54:27] o/ (sort of) [05:31:20] Sum up: https://phabricator.wikimedia.org/T288500#7274559 [06:13:57] marostegui: i'm looking at https://jira.mariadb.org/browse/MDEV-21873 [06:14:08] which prompted me to look at the plugins loaded on various codfw primaries [06:14:53] db2123/db2105/db2129, all 10.4.x, have semisync_master.so loaded [06:15:01] db2104, our favourite node, has semisync_slave.so loaded [06:15:40] let me run this check against all primaries [06:15:56] From the paste I did earlier I can see: Executing 'INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so'' [06:16:01] But not sure if that was on db2104 or db2107 [06:16:16] I would assume db2107 [06:19:16] kormat: What about s7 one? As we'll do that one soon, it would be good to see if it is the same compared to for instance...s3 eqiad master [06:19:26] Or m2 master [06:20:39] current s7 primary looks ok (has master.so loaded). current s7 candidate also looks ok (has slave.so loaded) [06:21:37] m2 master has no semisync plugin loaded [06:22:48] marostegui: note: i added a feature last year that doesn't try to load/unload the semisync plugins if we're on 10.3 or later [06:22:54] i'm wondering if that was a mistake [06:23:59] the mariadb jira entry i linked above says that the plugins don't actually exist any more [06:26:14] and indeed, on 10.1 there are plugin .so files on disk. for 10.4 there are not [06:30:54] kormat: ah, I forgot about your last addition [06:31:13] yeah, in 10.4 it is integrated as a config option not a plugin anymore (which was very celebrated) [06:34:02] i'll play around in pontoon with the 'plugin' loaded or not, and see if i can reproduce anything [06:39:37] good, let me know if I can help [07:25:10] * Emperor admires all the early morning people [09:39:13] marostegui: testing to see if toggling semi-sync on/off changes values is easy enough to test at least [09:39:21] +1 [09:39:27] That'd be scary [12:10:04] marostegui: no luck reproducing _anything_ so far [12:10:12] what i'd like to try is the same thing we did yesterday on s2/eqiad, [12:10:21] but this time with the semi-sync settings copied from db2104 [12:11:34] semi-relatedly, WTB a tool that can look at a db instance and print out all the important settings/statuses. e.g. repl thread status, seconds behind master, gtid in use?, binlog_format, semi-sync master/slave enabled, etc [12:12:24] Rpl_semi_sync_master_clients too [12:34:37] kormat: we have the check_mariadb.py (I think that was the name) that outputs a series of values [12:34:53] I can check in a bit (I'm having lunch) [12:35:25] and +1 for that test, with current values and with db2104's values [12:42:42] ah, yes. check_mariadb.py became db-check-health [12:43:14] it contains almost none of what i was looking for, but it does have some extra stuff that i'd like included in my dream tool [12:44:01] yeah, what I mean is that maybe we can adapt that one for the other things [12:56:41] right. i knew you were wrong, i just wasn't sure on the specifics. thanks for clearing that up! [13:15:23] :) [13:16:44] Going to merge m5-master proxy for codfw [13:16:50] Nothing should be using it, but just a heads up [13:17:17] 👍 [13:18:57] marostegui: ohh [13:19:29] well i've solved the mystery of how some servers end up with the wrong semi-sync settings [13:19:44] hit me! [13:19:53] profile::mariadb::core sets $semi_sync to either 'master' or 'slave' (or 'standalone', but we don't care) [13:20:19] production.my.cnf conditionally includes blocks depending on what the setting is [13:20:35] so a primary _candidate_ will get the 'slave' settings, which only configure rpl_semi_syn_slave_* [13:20:50] meaning they'll get the upstream mariadb defaults for repl_semi_sync_master_* [13:21:10] when we do a switchover, puppet updated /etc/my.cnf with the new master semi-sync settings, but they won't take effect until mariadb is restarted [13:21:27] But the script itself applies the changes when we do the switch, no? [13:21:34] nope! [13:21:38] Oh [13:21:40] it only touches *_enabled [13:23:00] Mistery solved, but that means that all candidate masters are consistent, so db2104 isn't special in that sense, is it? [13:24:28] if we hadn't restarted mariadb, a host which had gone primary -> candidate -> primary would end up with the 'correct' settings [13:25:12] my best guess right now is that db2104 happens to have the upstream default (very long timeout) settings, and there's something in s2/codfw that's causing it to be less reliable than usual [13:25:56] so in that case db2107 (old master, which was reimaged) would have the right settings [13:25:59] as it was restarted, right? [13:26:15] it'll have the right _slave_ settings [13:26:21] it won't have any master settings [13:26:43] yeah, which is right [13:27:31] I remember we had issues with semi sync and switchovers, but I cannot really recall what it was [13:27:39] But I am pretty sure we had things or at least one thing [13:27:50] marostegui: there used to be an issue about things failing when trying to load the plugins on >10.3.3 [13:27:53] * marostegui goes to phabricator search [13:28:02] kormat: maybe it is that yeah [13:28:08] (there could be other issues too, that's just the one i know of) [13:28:47] But I remember j4ime and myself talking/investigating it a long time ago [13:29:04] But maybe it is what you mention [13:29:58] I am still trying to go thru some phab searches [13:30:03] https://gerrit.wikimedia.org/r/c/operations/puppet/+/711489 is a quick fix for the puppet thing. i'm not going to push it for a while though [13:30:17] because we still don't have an explanation for why things are slow/not working [13:30:43] marostegui: something else i'd like to try is setting rpl_semi_sync_master_trace_level [13:31:54] huh. it's already set to non-zero [13:32:24] oh, ew. requires restarting mariadb to create a trace file [13:33:52] I cannot find anything :( [13:33:57] It must be what you mentioned [13:35:10] don't beat yourself up about me being right. it happens a lot [13:36:47] * marostegui prefers not to comment [13:37:01] :D [13:55:15] marostegui: https://phabricator.wikimedia.org/T288500#7275512 [13:55:42] ok, i'm _not_ running this today. my brain is drained from the early morning [13:55:53] Why waiting 30 minutes? [13:56:03] Like, nothing would happen in those 30 minutes no? [13:56:11] marostegui: because that's been the case in the 2 switchovers [13:56:20] i've no idea how that could be relevant, but.. [13:56:30] Yeah, it doesn't hurt [13:56:37] unlike 5am UTC [13:56:40] which hurts a Lot [13:56:50] Maybe you can do that test at 05:00 UTC [13:56:53] To simulate the time too [13:56:57] Just in case you know [13:57:12] ಠ_ಠ [13:57:13] ಠ_ಠ [13:57:14] ಠ_ಠ [13:57:19] It might be relevant! [13:57:23] I think you need to try [13:57:31] i refuse to live in such a reality [13:57:46] You definitely need to replicate the exact conditions [13:58:18] marostegui: as this is a public, logged channel, i'm doing my very best to not say the thoughts i'm thinking about you right now [13:58:24] but be assured, i _am_ thinking them [13:58:48] That I am right and you need to set your alarm at 4:30AM? [13:58:58] Listen, the hour might be the key here [13:59:03] Best to discard it [13:59:34] ¡te odio! [14:00:03] hahahaha [14:00:09] Even with the "¡" very well