[07:49:53] <Amir1>	 Morning o/
[07:49:59] <marostegui>	 o/
[07:50:05] * kormat hisses
[07:51:11] <Emperor>	 brains
[07:51:28] <Emperor>	 Wait, no, I mean "Morning! How is everyone this fine day?"
[07:51:59] <Amir1>	 it's Monday how could it be fine?
[07:52:45] <kormat>	 ☝️
[08:42:56] <kormat>	 marostegui: i'm wondering if switching from semi-sync-slave to semi-sync-master on a running mysqld is actually supported
[08:43:26] <marostegui>	 you mean turning one off and the other one on?
[08:43:29] <kormat>	 yeah
[08:43:31] <kormat>	 the docs say:
[08:43:33] <kormat>	 > If a server needs to be able to switch between acting as a primary and a replica, then you can enable both the primary and replica system variables on the server.
[08:43:53] <kormat>	 because in the middle of the switch, you're in fully-sync replication for a bit
[08:44:01] <marostegui>	 You can try on the testing hosts, db1124 and db1125 (I would expect it to be fully supported)
[08:44:07] <marostegui>	 Yeah, but that would be just a few ms
[08:44:08] <marostegui>	 no?
[08:44:08] <kormat>	 if any slave lags at that point, you're not in a good place
[08:44:55] <kormat>	 oh, i'm wrong. the default repl is asynch, right?
[08:45:11] <kormat>	 marostegui: it doesn't matter how short the window is if a bad thing can cause a hang during it
[08:46:12] * Emperor is having flashbacks to Slony-I clusters
[08:46:34] <marostegui>	 kormat: yeah, and if the master hangs the slaves will most likely show lag
[08:47:13] <kormat>	 i've found things like this, that maybe didn't get ported to mariadb: https://bugs.mysql.com/bug.php?id=89370
[08:53:41] <marostegui>	 Not sure what's the master timeout, I think it is 1 second in our env, but maybe we can look at decreasing it during the switch
[08:58:03] <kormat>	 mariadb doesn't have the patch from that bug applied: https://github.com/MariaDB/server/blob/10.7/sql/semisync_master_ack_receiver.cc#L284-L287
[08:58:10] <kormat>	 (now, maybe mariadb fixes the issue in a different way, that i can't say)
[08:59:11] <marostegui>	 yeah, mariadb has its own implementation of semi sync 
[08:59:23] <marostegui>	 which is supposed to be fully compatible with mysql
[09:01:10] <kormat>	 the code in this file looks _very_ similar between mariadb 10.7 and mysql 5.7
[09:01:21] <kormat>	 (mysql 8.0 has changed things quite a lot)
[09:02:18] <kormat>	 anyway, this is just speculation
[09:36:20] <kormat>	 marostegui: one thing i could do - skip setting semi-sync master to on if it's already on
[09:36:35] <kormat>	 the --only-slave-move run already enables semi-sync master for the candidate primary
[09:36:53] <marostegui>	 we can try that too
[09:36:54] <kormat>	 this isn't a fix. but it might be a workaround.
[09:37:08] <kormat>	 because the only hangs we've seen are during the --skip-slave-move run
[09:37:31] <marostegui>	 yeah, and setting that specific flag
[09:37:35] <kormat>	 yeah
[10:34:24] <kormat>	 jynus: thanks for your input on db-switchover's  history, that was really useful. 💜 
[10:34:48] <jynus>	 thanks to you for fixing bugs there!
[10:35:27] <jynus>	 that was supposed to be a script to document stuff in bad code, not really a proper solution
[12:05:49] <kormat>	 uff. just noticed that we set a custom prompt depending on whether an instance is a primary or not, but this doesn't get changed during primary switchover. e.g. db2090, the current s4 primary, is proudly announcing that it is a replica
[12:06:26] <kormat>	 ohh. it's set in /root/.my.cnf, which is no longer managed?
[12:08:00] <kormat>	 yeah, that's it
[12:14:56] <jynus>	 seems relevant: https://phabricator.wikimedia.org/rOPMD1b9f13b11f4f69173a7d73adf5aef165567db6ce
[12:15:53] * kormat nods
[12:16:14] <jynus>	 as usual, a temp thing becomes permanent :-)
[12:17:13] <Emperor>	 2016...
[12:17:57] <kormat>	 jynus: for you: https://gerrit.wikimedia.org/r/c/operations/puppet/+/713257 :)
[12:23:23] <jynus>	 so the blocker at https://gerrit.wikimedia.org/r/c/operations/puppet/+/321888 was heartbeat being restarted
[12:23:59] <jynus>	 I am going to guess that happened at some point?
[12:24:38] <kormat>	 all heartbeats have been restarted in the last ~4 months or so,
[12:24:42] <kormat>	 as we changed how heartbeat runs
[12:25:46] <jynus>	 check the other patch and take over it to see if something else is pending/abandon it please
[12:26:09] <jynus>	 my guess it was pending on a reboot back in 2017 and then it got forgotten
[12:26:54] <kormat>	 how does one take over a patch on gerrit again? i'm not seeing anything in the ui
[12:27:27] <jynus>	 doesn't have to be anything formal tbh
[12:28:11] <jynus>	 I cannot remove myself as owner :-(
[12:29:51] <jynus>	 in any case, "ensure => absent" was missing
[12:30:49] <jynus>	 should I abandon it, or could you do it?- I don't think a 4 yo patch will be too useful
[12:31:48] <kormat>	 i'll abandon it, but also glance over it to see what might need to still be (re-)done
[12:31:54] <kormat>	 there's no way it'll apply these days anyway
[12:32:36] <jynus>	 yeah
[12:36:03] <jynus>	 that happened when manuel had just been onboarded, so I barely had time to handle all fires at the time
[12:36:30] <jynus>	 or I had time for the fires, but not the followups
[12:37:36] <kormat>	 i think putting this change at the bottom of the priority list makes absolute sense in that sort of circumstances :)
[12:38:19] <jynus>	 sorry, at that time I did what I could, sorry some cases issues later on
[12:39:06] <jynus>	 *causes
[12:50:18] <kormat>	 jynus: i was looking a bit into the issue you mentioned in the meeting - i _suspect_ this is a reference to it: https://phabricator.wikimedia.org/T161007#3127963
[12:51:46] <jynus>	 I don't understand well my own comment, it could be related (would need more context) but the one I actually meant was an x1 outage on icindent docs
[12:51:51] <jynus>	 let me see if I can find it
[12:53:20] <jynus>	 I think I understand, there I am suggesting not to enable semisync cross dc
[12:53:29] <kormat>	 yeah, that's my reading too
[12:53:44] <jynus>	 but I remember issues wiht semysinc within dc on one of the non-standard configs
[12:54:00] <jynus>	 like Primary -> candidate primary -> replica
[12:54:42] <jynus>	 and also during some dc switchover, there was also increased errors due to some semisync config
[12:54:48] <jynus>	 on the new version
[12:55:17] <jynus>	 it is a lot of past issues with semisync weirdness, let me see if I can find some
[12:55:33] <kormat>	 thanks
[12:55:48] <kormat>	 looking at the git history for wmfmariadbpy, it doesn't look like the semisync logic has changed there since it was introduced
[12:55:48] <jynus>	 i know it is documented, just I have to find it among the many issues
[12:55:52] <kormat>	 hah, ack
[12:55:53] <marostegui>	 jynus: I remember we had something with it during switchovers, but I couldn't find any phabricator task about it past week
[12:56:07] <jynus>	 yeah, trying to find some
[12:56:11] <marostegui>	 I only found  https://phabricator.wikimedia.org/T161007
[12:56:38] <marostegui>	 And what stevie mentioned last week about removing the plugin if it was a 10.4 host (as the plugin is no longer a plugin)
[12:56:45] <jynus>	 kormat, I think it it is somewhere, it would be on wmfmariadbpy repo
[12:57:08] <jynus>	 so changing the switchover behaviour as a response of those issues
[12:57:23] <kormat>	 jynus: right - but my point is i can't find such a change in behaviour
[12:57:25] <jynus>	 but let me try to find relevant tickets
[12:57:48] <jynus>	 then it is possible the issues predate the script
[12:58:00] <jynus>	 so before the script we had a checklist
[12:58:08] <kormat>	 db-switchover was first added in 2018-06-12
[12:58:15] <kormat>	 the task above is from 2017
[12:58:16] <jynus>	 yeah, so likely predate that
[13:00:36] <jynus>	 this is interesting semysinc analysis, when volans used to be a dba (!): https://phabricator.wikimedia.org/T131753
[13:02:03] <jynus>	 oh, I think I remember now
[13:02:10] <jynus>	 and it was exactly what you said, kormat 
[13:02:21] <jynus>	 but it won't be useful to your issues
[13:02:39] <jynus>	 the problems comes when the other dc replica happens to be the last to be switched
[13:02:57] <jynus>	 during replica move
[13:03:30] <jynus>	 probably not relevant to your current issue (?)
[13:04:02] <jynus>	 there was some edge case where the way replicas were moved could cause too large of a timeout
[13:05:19] <jynus>	 but if the original primary had semisync disabled, I wouldn't see how that would be relevant in your case
[13:05:42] <jynus>	 let me find the other x1 issue with replication
[13:09:22] <jynus>	 I think it is this: https://wikitech.wikimedia.org/wiki/Incident_documentation/2017-05-03_x1_outage , but the semisync was just a theory that was not backed by any evidence, so it was removed from the final report
[13:10:04] <jynus>	 which makes sense, because disabling the code fixed the issue
[13:10:31] <jynus>	 leaving only "Investigate the cause of the high write latency on db masters during / after the switch over. Is this something we need to expect during switches"
[13:12:15] <kormat>	 ah, i see
[13:14:01] <jynus>	 and I see where x1 comes to relevancy
[13:14:09] <jynus>	 x1 used to have only 1 replicas
[13:14:30] <jynus>	 so if the replica went down, it went "sync" to the other dc
[13:15:12] <jynus>	 it is not the case anymore
[13:15:12] <kormat>	 ahh 💡
[13:15:22] <jynus>	 relevant for semi-sync
[13:15:34] <jynus>	 but I am afraid would not match your issue
[13:15:56] <jynus>	 unless you can see a case where suddenly only 1 host on other dc would be an available replica
[13:16:05] <jynus>	 so sadly probably not super-relevant :-(
[13:17:12] <kormat>	 ok. that's useful to know, in any case, even if it's just to rule it out for this specific case
[13:17:44] <jynus>	 I just remembered semi-sync and timeout, but it happened years ago