[04:55:19] o/ [04:55:26] o/ [04:55:50] you are here cause this affects plwiki right?! [04:55:54] :) [04:56:02] Exclusively :) [05:08:56] if we don't want to attempt the switchover tomorrow again, I would need to undo the topology [05:08:57] thoughts? [05:13:20] I am going to un-do the topology [05:13:21] It mostly depends on whether we can figure out what's wrong with the script today, right? [05:13:27] yeah [05:13:46] but I don't want to have another spof with that host, so I am going to undo, it is a bit of a pain but I rather be safe [05:13:58] Sure [05:18:03] we've not used the script from codfw in years [05:18:39] I've communicated to the community that we used read only time but aborted it, and we'll communicate a new date once we know [05:18:49] Thanks [05:20:19] Once I have finished with the topology rollback, I will enable puppet on the hosts [05:27:46] Are you moving eqiad as well or leaving it under the candidate? [05:28:00] no, I will move eqiad too [05:28:16] Ok [05:29:28] GTID also needs to be re-enabled [05:31:05] I will also probably go ahead and do some data consistency check as well [05:31:10] once I have enabled gtid [06:21:10] Started the data check [07:03:32] marostegui: what script failed? db-switchover? [07:03:39] yeah [07:03:48] hurmmm [07:03:57] it didn't fail with an error, just got stuck forever after * The replication lag is acceptable: 0 (lower than the configured or default timeout) [07:04:02] (there were no lag of course) [07:04:26] I couldn't find what it was stuck on, my guess is it failed while running some of the mysql commands but I didn't find any running at the time on the hosts [07:04:28] when you have a chance, can you open a task with whatever details you have? [07:04:36] I don't have more than that :( [07:04:41] Apart from the crtl+c failure [07:04:50] I'll take anything [07:04:57] ok i will open a task [07:07:19] <3 [07:10:52] kormat: https://phabricator.wikimedia.org/T288500 [07:46:23] at least the BT tells you what it was trying to do at the time [07:46:55] marostegui: did you get the same backtrace from both attempts? [07:47:26] marostegui: also, we do use db-switchover for doing primary swaps in codfw, just not usually when it's the active dc, no? [07:51:27] i guess your point was specifically about using db-switchover in codfw when it is the active dc [07:51:53] kormat: I got the same back trace from cumin2001 an cumin1001 yep [07:52:11] kormat: Yeah, my point was about codfw when active [07:52:31] marostegui: would you mind putting the cumin1001 backtrace into a phab paste too anyway? [07:52:32] kormat: the db-switchover script worked perfectly last week with m2 failover (but that is eqiad of course) [07:52:37] gotcha [07:52:42] sure [07:53:13] if i'd been around (*cough*), i'd have been able to mention that there's a hidden way to enable debug logging for db-switchover :( [07:53:39] `DEBUG=1 db-switchover ...` [07:53:44] I did some strace too, let me find if i can find that [07:56:23] marostegui: do you know what time you did the --only-slave-move migration? [07:56:35] I think it was around 4:30AM UTC [07:56:41] ok [07:57:08] i see a bunch of mariadb journald messages around that time, but strangely nothing after that until 05:14, which is way after the db-switchover attempts were aborted [07:57:15] (on db2104, that is) [07:57:23] I am going to paste my strace [07:59:13] kormat: wc -l /home/kormat/strace [07:59:14] 7566 /home/kormat/strace [07:59:15] XD [07:59:46] that is cumin1001 [08:00:46] Emperor: my plan to walk you through reimaging a host in s7/codfw today is now postponed, just fyi :) [08:00:48] kormat: from db2104 logs, looks like I did start around 4:25 AM UTC [08:00:59] the replicas move that is [08:01:03] ack [08:11:23] kormat: FE :) [08:11:55] I'm talking to Filippo in a bit about Swift in any case [08:16:08] I can also show you bacula if for some reason filippo is not available [08:16:19] Emperor: expn:FE? [08:17:30] "Fair enough"? [08:17:55] ah! :) [08:18:05] * sobanski was just guessing [08:18:16] that does seem more probably than Fuck Everything [08:18:19] *probable [08:25:43] kormat: after your last task comment, am I good to proceed on thursday with m3 (it is eqiad only too) or do you think I should wait? [08:25:43] Fair Enough, not the other thing /o\ [08:26:33] marostegui: what we can do is run a db-switchover temporarily for a core section in eqiad today [08:26:37] and see if that works or not [08:27:07] we can use the test hosts for that? [08:27:12] (if there are things I can look at that might be helpful, do shout, but I suspect I lack too much context to be any use) [08:27:28] marostegui: we could, but even if it works it doesn't give me very high confidence [08:27:33] something fucky is going on here [08:27:43] Emperor: ty :) [08:28:06] kormat: We can also try m3 with the debug flag on thursday too [08:28:23] [noc@ has a mail from someone complaining they're being defamed; I can't see what they're unhappy about, but should I just forward to legal anyway? ] [08:28:31] marostegui: if you don't mind it potentially failing and needing manual recovery, that seems fine [08:28:57] kormat: That's the good thing about m3, that if I don't reload the proxies nothing will happen [08:29:20] Emperor: yeah, that's the usual procedure, but it is normally done by the clinic duty person: https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty [08:29:47] marostegui: Ah. Thanks. [08:29:56] marostegui: ok cool [08:30:20] kormat: It makes me less scared to run it against m3 without reloading the proxies than a core section in eqiad, with a lot more replicas [08:30:32] 👍 [08:30:43] i'll go test to see if it still works in pontoon, at least ;) [08:31:12] I am 99% sure it is a codfw related only thing [08:34:25] somehow this isn't very reassuring. :) [08:34:31] * Emperor tries to resist the pun [08:34:57] Emperor: oh please do share :) [08:35:12] kormat: maybe we can try to manually replicate all the mysql commands the script tries until the point where it got stuck and see what we get [08:35:24] 😬 [08:36:33] kormat: well, marostegui has obviously noticed something fishy about the codfw setup [08:36:39] :D [08:38:36] marostegui: oh, huh. just realised this issue happens before it gets to the point of prompting you if you're sure you want to do this [08:39:19] kormat: yeah it does [08:39:34] kormat: that is why I was suggesting to replicate all the steps till there [08:39:55] yeah ok that sounds less insane now :) [08:47:16] marostegui: it looks like literally all it does (in terms of making changes) is `SET GLOBAL rpl_semi_sync_slave_enabled = 0` followed by `SET GLOBAL rpl_semi_sync_master_enabled = 1` on the to-be-new primary [08:50:13] which works finely [08:50:30] from mysql.py, i assume [08:50:34] yeah [08:50:57] this might be pymysql specific [08:53:21] writing a minimal test python script [09:01:39] aaand it works no problem [09:02:57] of course :( [09:11:09] marostegui: how would you feel about me running db-switchover again against prod (with DEBUG=1 this time)? as it doesn't get to the prompt, the only changes it makes are the semi sync ones above [09:12:08] kormat: if we are fully sure it won't do anything (like just jumping over the yes/no question), I think it is harmless [09:15:48] if we provide --skip-slave-move, it won't do anything else until it gets to the prompt [09:16:07] (if you _don't_ provide --skip-slave-move, it'll move all the replicas before it prompts you :/) [09:16:58] ok, here goes. i'll add a !log [09:18:40] marostegui: aaand it worked perfectly :/ [09:18:50] (i've reverted the semi sync change now) [09:19:19] of course.... :( [09:20:30] one difference is that you ran it from a root shell [09:20:33] i can test that too [09:20:41] and that db2104 had slaves [09:20:49] ohh. mmmm. [09:21:04] that seems much more likely to be relevant [09:24:01] if you are confident, we can try the switchover tomorrow [09:24:07] and run it with the debug mode [09:24:12] confident? why would i be _confident_? [09:24:42] I mean if there's not much else you can see by not running it live [09:24:43] right now i'm questioning my own existence [09:36:16] marostegui: do you happen to have the console output of the --only-slave-move run? [09:36:27] let me check [09:38:15] kormat: https://phabricator.wikimedia.org/P16988 [09:38:38] great, thanks [09:38:46] gives me something else to stare hopelessly at :) [09:38:54] XDDDDD [09:41:04] marostegui: is there a way to have an sql query return the number of rows in `show slave hosts`? [09:42:49] let me check [09:43:52] I know mysql used to have information_schema slavehosts table [09:43:56] Double checking mariadb [09:45:16] there is SLAVES_CONNECTED in information_schema.global_status [09:45:32] kormat: maybe this can also work: select * from information_schema.processlist where command = 'Binlog Dump'; [09:45:40] which has the same value as the number of entries in 'show slave hosts' [09:46:03] marostegui: that also gives the same result, ok :) [09:47:50] ah. a wild goose chase. i think db-schema-change makes an invalid assumption about our semi-repl config [09:48:12] (leading to a misleading but harmless warning) [09:49:36] yeah. oh well. [09:50:47] btw, from 10.5 we should start using show replicas as they will eventually deprecate show slave status [09:51:35] marostegui: if we're lucky they don't get rid of it until after all our hosts are 10.5+ 🤞 [09:51:50] otherwise we need to start writing code that works with both old and new approaches [09:52:01] yeah, I don't think they'll deprecate it on 10.4 [09:58:17] marostegui: let's schedule the maintenance again for tomorrow morning (😭) [09:58:24] kormat: excellent [09:58:32] I will communicate that and prepare puppet patches [09:58:37] if i can't figure out something in the meantime, at least we have a chance of seeing it in action [10:01:17] kormat: https://gerrit.wikimedia.org/r/c/operations/puppet/+/711114 [10:17:51] I am going to take a break for early lunch and probably later start a media backup process on final hw [10:19:25] kormat / marostegui : I'm going to lazily assume you don't need me to kibbitz at 6am tomorrow? [please? :) ] [10:20:10] Emperor: correct :) [10:20:22] Emperor: Not this week! [10:20:29] yeah, that ^ [10:21:49] :'( [10:21:59] <-- not a morning person [10:22:20] Emperor: oh believe me, i feel your poain [10:22:22] *pain [10:22:47] Emperor: we normally do about 1 primary switchover per month on average. they always require Pain O'Clock [10:23:12] but unless you are involved in the specific switch, you generally won't be needed [10:25:01] alarm set for 6am 😭 [10:25:13] I don't even use an alarm! [10:25:59] is 05:00 UTC minimum traffic time? I'd sort of expect plenty of US traffic then [10:26:37] Emperor: We don't really have such minium traffic time as we share the wikis across servers, we normally start early in case something goes wrong, we have more time to debug during the day just in case [10:27:34] Emperor: this is the affected master: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=2&orgId=1&var-server=db2107&var-port=9104&from=now-24h&to=now as you can see, there's not such valley there [10:28:14] Emperor: Number of edits across all the wikis https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&refresh=5m&from=now-24h&to=now [10:28:23] In this operation only writes are affected, reads are not affected [10:32:59] Mmm, interesting; there's a bit of a diurnal pattern, but it's not pronounced [10:34:06] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=7&orgId=1&var-site=All&var-group=All&var-shard=All&var-role=All [10:34:23] ^ all the writes together [11:52:27] anyone left on my side? [11:55:13] 30 [12:08:20] marostegui: one (last?) suggestion of something we can try before tomorrow: move all replicas in s2/eqiad beneath the candidate primary there, and then test the semi-sync conf changes. we can then move the replicas back, without ever changing primary. thoughts? [12:08:34] sure thing yeah [12:08:35] we can [12:08:43] can you place the commands to run on the task? [12:08:49] just to have them handy [12:08:52] they are just 2 no? [12:09:07] 4 including reverts, but yeah [12:13:40] https://phabricator.wikimedia.org/T288500#7271948 [12:14:01] gooood [12:14:14] I will switch that to codfw hosts :p [12:18:16] well ok then. :) [12:18:30] Ah wait [12:18:36] I didn't get your comment entirely [12:18:47] I thought you wanted to do that tomorrow before we do the actual switch [12:19:37] (to test the semi sync + codfw combination) [12:19:49] As the eqiad combination was done past 8th of august with m2 switch [12:20:27] i'm wondering if it could be related to the number of replicas [12:20:31] m2 is ~trivial [12:21:04] yeah [12:21:13] But yeah, if you want to try that test, +1 [12:21:32] So you don't think codfw can be a factor here? [12:21:41] it could, somehow. i just can't think of how [12:21:50] and as it's primary i don't want to test there if i can help it [12:21:56] yeah [12:22:26] +1 to try eqiad so we can discard/confirm the number of replicas being a factor [12:22:39] I have a check data running against all the databases there, but I can stop it and resume it once you are done [12:23:11] ah, yeah please [12:23:58] ok, let me stop it [12:24:13] stopped [12:24:18] ok, here goes [12:29:09] [ERROR]: db1156/(none) failed to be moved under the new master [12:30:29] no reason given :( [12:30:40] Sometimes they do fail, and a retry fixes it [12:30:55] I am checking its logs [12:31:20] nothing really obvious there [12:31:40] ok. i'm going to retry i guess. [12:31:48] yeah, but you need to start replication there [12:31:53] I am on that host, do you want me to do it? [12:32:03] done [12:32:17] ok, let's try again [12:32:23] db1162 also not replicating, fixing [12:32:58] ok, it's caught up now [12:34:07] it's hanging on `Executing 'SET GLOBAL rpl_semi_sync_master_enabled = 1'` [12:34:13] for db2107... [12:34:27] ok, just got past that [12:34:45] for db2107? [12:35:25] and failed again with `[ERROR]: db1156/(none) failed to be moved under the new master` [12:36:44] mmm, db1156 is sanitarium master, that's the only different thing it has from the rest of the hosts [12:37:14] maybe increase the timeout, just in case [12:37:22] i already increased it to 30s [12:37:34] And no idea which command is failing on db1156? [12:37:40] cause replication does get stopped [12:39:17] oh, apologies, the logs are misleading. the set global rpl_semi_sync_master above was _probably_ run against db1122 (s2/eqiad primary) [12:39:50] yeah, that makes more sense [12:39:51] marostegui: i _think_ it's timing out waiting for db1156 to catch up on replication..? [12:40:07] the hacky logging i added is way better than nothing, and still god-awful [12:40:24] kormat: but I am not seeing db1156 changing its master [12:40:27] it remains on db1122 [12:42:43] i'm honestly very close to saying this codebase is a write-off, and needs to be rewritten :( [12:44:24] kormat: can you try with the move-replica? [12:44:25] the code calls replication.move, and if the result isn't success, it prints a fixed error message and exits. [12:44:30] and see if that is able to move db1156? [12:44:38] as far as I know move-replica is used by db-switchover anyways [12:45:12] trying [12:45:38] that worked [12:45:44] yeah, just saw the log [12:45:47] coincidence? [12:46:14] who knows at this point [12:47:01] i'm going to just manually db-move-replica the rest of the replicas for now [12:47:07] ok [12:49:24] done [12:49:43] instead of using db-switchover here, i'm going to use my custom test python script to set the semi-sync config on db1162 first [12:49:45] and see what happens [12:50:03] ok [12:50:13] that works with absolutely no problem [12:50:38] of course... [12:50:53] So it is not the replicas from what we can see, so it is only the codfw factor left [12:51:00] And I am sure it will work fine tomorrow :/ [12:51:39] i tested with db-switchover, also no problems [12:51:43] going to undo the setup now [12:52:16] siigh. db-switchover can't do the revert. [12:52:22] ok, manually doing everything with db-move-replica [12:52:47] double check that GTID is enabled once you are done [12:57:35] the topology looks correct now. checking GTID. [13:03:11] 4 aren't. fixing. [13:04:27] pro-fess-ional: [13:04:29] `for h in $(nodelist 'A:db-section-s2 and A:eqiad'); do inst=$(host-to-instance s2 $h); echo $inst; mysql.py -h $inst -e "show slave status\G" | grep -i using_gtid; done` [13:04:32] all good now [13:04:34] XD [13:04:42] good [13:04:56] s2/codfw also has gtid enabled everywhere [13:05:19] yeah, I enabled it earlier after the roll back [14:47:18] * lmata waves [14:48:33] uh oh [14:48:38] * kormat waves cautiously back