[07:49:53] Morning o/ [07:49:59] o/ [07:50:05] * kormat hisses [07:51:11] brains [07:51:28] Wait, no, I mean "Morning! How is everyone this fine day?" [07:51:59] it's Monday how could it be fine? [07:52:45] ☝️ [08:42:56] marostegui: i'm wondering if switching from semi-sync-slave to semi-sync-master on a running mysqld is actually supported [08:43:26] you mean turning one off and the other one on? [08:43:29] yeah [08:43:31] the docs say: [08:43:33] > If a server needs to be able to switch between acting as a primary and a replica, then you can enable both the primary and replica system variables on the server. [08:43:53] because in the middle of the switch, you're in fully-sync replication for a bit [08:44:01] You can try on the testing hosts, db1124 and db1125 (I would expect it to be fully supported) [08:44:07] Yeah, but that would be just a few ms [08:44:08] no? [08:44:08] if any slave lags at that point, you're not in a good place [08:44:55] oh, i'm wrong. the default repl is asynch, right? [08:45:11] marostegui: it doesn't matter how short the window is if a bad thing can cause a hang during it [08:46:12] * Emperor is having flashbacks to Slony-I clusters [08:46:34] kormat: yeah, and if the master hangs the slaves will most likely show lag [08:47:13] i've found things like this, that maybe didn't get ported to mariadb: https://bugs.mysql.com/bug.php?id=89370 [08:53:41] Not sure what's the master timeout, I think it is 1 second in our env, but maybe we can look at decreasing it during the switch [08:58:03] mariadb doesn't have the patch from that bug applied: https://github.com/MariaDB/server/blob/10.7/sql/semisync_master_ack_receiver.cc#L284-L287 [08:58:10] (now, maybe mariadb fixes the issue in a different way, that i can't say) [08:59:11] yeah, mariadb has its own implementation of semi sync [08:59:23] which is supposed to be fully compatible with mysql [09:01:10] the code in this file looks _very_ similar between mariadb 10.7 and mysql 5.7 [09:01:21] (mysql 8.0 has changed things quite a lot) [09:02:18] anyway, this is just speculation [09:36:20] marostegui: one thing i could do - skip setting semi-sync master to on if it's already on [09:36:35] the --only-slave-move run already enables semi-sync master for the candidate primary [09:36:53] we can try that too [09:36:54] this isn't a fix. but it might be a workaround. [09:37:08] because the only hangs we've seen are during the --skip-slave-move run [09:37:31] yeah, and setting that specific flag [09:37:35] yeah [10:34:24] jynus: thanks for your input on db-switchover's history, that was really useful. 💜 [10:34:48] thanks to you for fixing bugs there! [10:35:27] that was supposed to be a script to document stuff in bad code, not really a proper solution [12:05:49] uff. just noticed that we set a custom prompt depending on whether an instance is a primary or not, but this doesn't get changed during primary switchover. e.g. db2090, the current s4 primary, is proudly announcing that it is a replica [12:06:26] ohh. it's set in /root/.my.cnf, which is no longer managed? [12:08:00] yeah, that's it [12:14:56] seems relevant: https://phabricator.wikimedia.org/rOPMD1b9f13b11f4f69173a7d73adf5aef165567db6ce [12:15:53] * kormat nods [12:16:14] as usual, a temp thing becomes permanent :-) [12:17:13] 2016... [12:17:57] jynus: for you: https://gerrit.wikimedia.org/r/c/operations/puppet/+/713257 :) [12:23:23] so the blocker at https://gerrit.wikimedia.org/r/c/operations/puppet/+/321888 was heartbeat being restarted [12:23:59] I am going to guess that happened at some point? [12:24:38] all heartbeats have been restarted in the last ~4 months or so, [12:24:42] as we changed how heartbeat runs [12:25:46] check the other patch and take over it to see if something else is pending/abandon it please [12:26:09] my guess it was pending on a reboot back in 2017 and then it got forgotten [12:26:54] how does one take over a patch on gerrit again? i'm not seeing anything in the ui [12:27:27] doesn't have to be anything formal tbh [12:28:11] I cannot remove myself as owner :-( [12:29:51] in any case, "ensure => absent" was missing [12:30:49] should I abandon it, or could you do it?- I don't think a 4 yo patch will be too useful [12:31:48] i'll abandon it, but also glance over it to see what might need to still be (re-)done [12:31:54] there's no way it'll apply these days anyway [12:32:36] yeah [12:36:03] that happened when manuel had just been onboarded, so I barely had time to handle all fires at the time [12:36:30] or I had time for the fires, but not the followups [12:37:36] i think putting this change at the bottom of the priority list makes absolute sense in that sort of circumstances :) [12:38:19] sorry, at that time I did what I could, sorry some cases issues later on [12:39:06] *causes [12:50:18] jynus: i was looking a bit into the issue you mentioned in the meeting - i _suspect_ this is a reference to it: https://phabricator.wikimedia.org/T161007#3127963 [12:51:46] I don't understand well my own comment, it could be related (would need more context) but the one I actually meant was an x1 outage on icindent docs [12:51:51] let me see if I can find it [12:53:20] I think I understand, there I am suggesting not to enable semisync cross dc [12:53:29] yeah, that's my reading too [12:53:44] but I remember issues wiht semysinc within dc on one of the non-standard configs [12:54:00] like Primary -> candidate primary -> replica [12:54:42] and also during some dc switchover, there was also increased errors due to some semisync config [12:54:48] on the new version [12:55:17] it is a lot of past issues with semisync weirdness, let me see if I can find some [12:55:33] thanks [12:55:48] looking at the git history for wmfmariadbpy, it doesn't look like the semisync logic has changed there since it was introduced [12:55:48] i know it is documented, just I have to find it among the many issues [12:55:52] hah, ack [12:55:53] jynus: I remember we had something with it during switchovers, but I couldn't find any phabricator task about it past week [12:56:07] yeah, trying to find some [12:56:11] I only found https://phabricator.wikimedia.org/T161007 [12:56:38] And what stevie mentioned last week about removing the plugin if it was a 10.4 host (as the plugin is no longer a plugin) [12:56:45] kormat, I think it it is somewhere, it would be on wmfmariadbpy repo [12:57:08] so changing the switchover behaviour as a response of those issues [12:57:23] jynus: right - but my point is i can't find such a change in behaviour [12:57:25] but let me try to find relevant tickets [12:57:48] then it is possible the issues predate the script [12:58:00] so before the script we had a checklist [12:58:08] db-switchover was first added in 2018-06-12 [12:58:15] the task above is from 2017 [12:58:16] yeah, so likely predate that [13:00:36] this is interesting semysinc analysis, when volans used to be a dba (!): https://phabricator.wikimedia.org/T131753 [13:02:03] oh, I think I remember now [13:02:10] and it was exactly what you said, kormat [13:02:21] but it won't be useful to your issues [13:02:39] the problems comes when the other dc replica happens to be the last to be switched [13:02:57] during replica move [13:03:30] probably not relevant to your current issue (?) [13:04:02] there was some edge case where the way replicas were moved could cause too large of a timeout [13:05:19] but if the original primary had semisync disabled, I wouldn't see how that would be relevant in your case [13:05:42] let me find the other x1 issue with replication [13:09:22] I think it is this: https://wikitech.wikimedia.org/wiki/Incident_documentation/2017-05-03_x1_outage , but the semisync was just a theory that was not backed by any evidence, so it was removed from the final report [13:10:04] which makes sense, because disabling the code fixed the issue [13:10:31] leaving only "Investigate the cause of the high write latency on db masters during / after the switch over. Is this something we need to expect during switches" [13:12:15] ah, i see [13:14:01] and I see where x1 comes to relevancy [13:14:09] x1 used to have only 1 replicas [13:14:30] so if the replica went down, it went "sync" to the other dc [13:15:12] it is not the case anymore [13:15:12] ahh 💡 [13:15:22] relevant for semi-sync [13:15:34] but I am afraid would not match your issue [13:15:56] unless you can see a case where suddenly only 1 host on other dc would be an available replica [13:16:05] so sadly probably not super-relevant :-( [13:17:12] ok. that's useful to know, in any case, even if it's just to rule it out for this specific case [13:17:44] I just remembered semi-sync and timeout, but it happened years ago