[08:36:07] Emperor: ack, will read in a few :) [08:39:30] Y [08:39:32] TY even [08:41:34] * Emperor may need more coffee [08:53:31] dump restore on db2118 finished after 16h [08:54:15] that seems a bit long, how many threads did you use? [08:54:33] i just followed the docs, so whatever the default is [08:55:23] `recover-dump --help` doesn't mention the default [08:58:13] 16 is the default, which is reasonable [08:58:57] I can have a look later at the metrics to see if there is room for improvement, it didn't use to take more than 12 hours [09:13:04] volans: your CR worked, thanks! ๐Ÿ’œ [09:15:02] you're welcome, 50 EUR :D [09:15:54] volans: i'll happily accept your donation [09:20:06] jynus: something seems.. wrong with this dump [09:20:12] ? [09:20:13] it says it was dumped from db2118. which isn't a backup source [09:20:22] and, unfortunately, db2118 is exactly the host i'm trying to restore [09:20:34] where does it say it? [09:20:52] `dbprov2001:/srv/backups/dumps/latest/dump.s7.2021-08-24--00-00-02/metadata` [09:21:22] ah, no, that is the host where it was replicating from [09:21:40] which I guess it was true, given it was the old primary there? [09:21:46] it was, yes [09:21:59] gtid should just solve it for you [09:22:28] how do you use it in this case? [09:22:45] as documented, nothing changes (unless the doc are wrong :-) [09:22:47] let me see [09:23:02] the docs say to use host+binlog+position [09:23:09] the host is itself, so that doesn't seem like it's going to be useful [09:23:13] there's nothing i see about using gtid [09:23:28] https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Enabling_replication_on_the_recovered_server is what i'm looking at [09:24:24] yeah, I blame marostegui for that part [09:24:25] :-D [09:26:19] the docs are not technically wrong, but it is not what I would do [09:26:30] no need to use binlogs [09:26:37] take the GTID line [09:26:45] 0-180359185-3380903947,171970590-171970590-196269578,171970664-171970664-2362613365,171978767-171978767-4484858466,171978861-171978861-433436917,180355111-180355111-139571514,180359185-180359185-72003839,180363275-180363275-7694462,180363371-180363371-19393161,180367395-180367395-554641873 [09:28:50] and apply it like on the regular recoveries [09:29:06] SET GLOBAL gtid_slave_pos = ''; [09:29:09] then [09:29:13] CHANGE MASTER TO master_use_gtid = slave_pos; [09:29:38] jynus: I'm off today but you can always correct the docs, right? [09:30:14] `Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.'` [09:30:42] uf [09:31:17] can I log in and see it for myself? [09:31:23] by all means [09:31:37] it is db2118, right? [09:31:40] yes [09:36:33] yeah, it is referencing old binlogs from the old primary master from years or months ago [09:37:13] so what we can do is change the full string to just reference the current master, and it should be ok [09:38:14] so that would be 180355206 [09:38:16] does this mean the gtid in /metdata is bogus? [09:38:25] not bogus [09:38:39] there are many that are super-old but cannot be "cleanead up" [09:38:46] that is probably why binlogs are prefered [09:39:17] I think neither binlogs not gtid are perfect, so something in between is probably the right solution [09:39:40] using the gtid of only the primary at the time of the backup [09:40:15] which make gtid not a very simple solution, I think there is some tickets about that [09:40:20] jynus: ok - as the docs don't cover this, and i definitely don't understand the implications here, can you get it working again, and add it to the docs? [09:40:32] I can get it working again [09:40:46] but regarding docs- not sure because normally those would work [09:40:52] but this is more of a wider issue with gtid [09:41:38] the data is ok, but not sure what is the best way to fix gtid complications [09:41:49] for backups, but also for dbs in general [09:42:57] using binlogs is a workaround that only works when there has not been recent topology changes [09:44:49] it looks like this is an issue that will happen anytime we need to restore a primary (or, in this case, a node that was very recently primary) [09:45:26] it is wider than that [09:45:35] it will happen any time replication is handled [09:45:38] jynus: i'm just trying to deal with this instance right now [09:45:45] I know, I am on it [09:46:13] I am trying not to break it after 16 hours, so going slowly :-) [09:48:12] please allow me to do some extra checks on data to be 100% sure [09:52:48] if things go well (that is why I am testing slowly) i will be able to write a procedure that works everytime [09:52:55] ๐Ÿ‘ [09:52:55] (I think) [09:53:22] it is true that the gtid issues were known, but I think I never thought about this one [10:01:22] I think the issue is different from what I thought [10:01:30] and it cannot be fixed on backups [10:01:56] because db2118 as been rebuilt [10:02:12] ๐Ÿ˜ฌ [10:02:14] but has the same server id- because it depends on the ip [10:02:25] it is trying to reference itself for replication [10:02:47] as those event were written originally to itself [10:03:09] good news- it has a solution [10:03:19] we do good old binlogs [10:03:26] which is easy to get from gtid [10:03:40] then we replicate without gtid until we get to the new server [10:04:20] if gtid can be enabled after that, cannot say [10:04:55] I guess it should work, otherwise we would have found this easie earlier [10:04:57] the binglog+pos in the /metadata file don't seem to match the GTID [10:05:07] yeah, I mean easy to find [10:05:09] select binlog_gtid_pos("db2118-bin.002581", 110134450); [10:05:14] 180363371-180363371-19393161,171970664-171970664-2362613365,171978861-171978861-433436917,180367395-180367395-554641873,171978767-171978767-4484858466 [10:05:15] I can tell you that how at a later time [10:05:30] that seems to imply there's a problem with the gtid recorded in /metadata [10:05:32] we got to the binlog with gtid, then we can check the binlgos [10:06:02] no, metadata is ok, but gtid model is broken for how we use it [10:06:17] and it is not an us-only thing [10:06:33] we are referencing ids for a host that no longer exist [10:06:47] when recovering a backup, we should change the gtid of the server [10:06:53] or when rebuilding it [10:07:04] otherwise it will want to reference its own binlogs, from the same id [10:07:30] but those no longer exist, despite the same data on the same server being available [10:07:41] the binlogs are still there [10:07:59] yes, but only on the new primary [10:08:08] no, db2118 still has its old binlogs [10:08:13] there could be physically there, sure [10:08:20] but that is not somethin gtid likes [10:08:31] from the point of view of gtid "we have already executed those" [10:08:41] so "I don't like them" [10:08:45] so to speak [10:09:07] so for gtid there are 2 servers, the old one before corruption and the new one [10:09:23] and the idea is to trick without gtid that they are the same :-) [10:09:28] which is the issue [10:09:50] advance until the old binlogs are not needed [10:10:20] and that explains why manuel used binlog here, it is not a great admin experience :-) [10:11:47] so, to summarize, at least for now [10:12:30] find the equivalent of 180367395-180367395-554641873 / db2118-bin.002581:110134450 on db2121 [10:12:40] start replication without gtid [10:12:53] wait until we go over the switchover time [10:13:29] then try reenabling gtid after it, when no longer has to reference binlogs from the "virtually different previous instance" [10:14:00] I can document that, and even automate to some extent [10:14:16] how do you map gtid to binlog position? i've only managed to find binlog_gtid_pos, which goes in the opposite direction [10:14:17] but the issue is with reusing gtids after a "reimage" [10:14:50] kormat- we search for it - more or less getting the binlog for the time on the replica, and then lots of greps :-D [10:15:10] and yes, gtid was supposed to solve binlog issues :-D [10:15:23] "universal identifier" [10:15:51] do you want to try on your own first? [10:15:59] we can do both do it and double check with each other [10:17:31] heatbeat and gtid usually simplify the process [10:17:57] but this is in part why many large installations deploy their own pseudo-gtid [10:17:59] :-( [10:18:04] Scrolling up Iโ€™m somewhat overwhelmed, so +1 to documenting this. [10:18:35] the problem it is a mix of a mariadb big bug and wmf arch, most of the first than the second [10:19:20] I have my coords for binlog [10:19:50] do you wanna give it a try yourself first? [10:19:56] yep, will do [10:20:17] and one thing that is super confusing [10:20:27] gtid tracks executed transactions, binlogs tracks offsets [10:20:40] which makes easy to make one-off mistakes [10:21:13] jynus: so, to clarify, i should be looking for the binlog on db2118 that was in use at the time of the backup? [10:21:20] correct [10:21:27] ok [10:21:30] the one equivalent to 180367395-180367395-554641873 [10:21:54] ok, which we can tell because the domain id/server id are the id of db2118, right? [10:22:02] or db2118-bin.002581:110134450 [10:22:06] yes [10:22:25] those are the 2 things you have from the metadata [10:22:35] ok, now i'm confused [10:22:48] i'm looking for a binlog position. but the metadata contains a binlog position. is that the wrong position? [10:22:59] it is the right position on the wrong server [10:23:14] becauese it was the priamary at the time of the backup [10:23:28] and that is ok, normally we ignore the binlog and use gtid [10:23:49] (that's not true. we normally use binlog, at least IME) [10:24:04] ok, then let me reword it [10:24:17] in case of a master switchover you cannot use binlog [10:24:27] for what is worth. the reason I used binlogs on that wiki page was to let the server replicate without messing up with gtid until it was totally in sync with its master. [10:24:36] and you were right, marostegui [10:24:44] to avoid exactly what happened above [10:24:50] but then there is this other bug with gtid [10:25:09] the primary switch prevent using the binlog [10:25:17] and the server rebuild prevents using the gtid [10:25:40] because gtid refuses to accept the server was put "back in time" [10:25:51] after the recovery :-( [10:26:48] and that is because it is not easy to cleanup old gtids [10:26:58] (I think) [10:27:20] wait - don't i want to find out the _db2121_ binlog position that corresponds to db2118-bin.002581/110134450 ? [10:27:20] (db2121 being the new primary) [10:27:20] because once we have that, we're sorted, no? [10:27:50] so with binlog you always want the binlog referencing the direct primary [10:28:22] but the one you have on metadata was the position of the at the time master of the backup source [10:28:56] i can't tell if you're answering my question or not [10:29:10] sorry, I may have copied the wrong metadata [10:29:25] db2118-bin.002581:110134450 [10:29:56] we are both saying the same, aren't we? [10:30:06] i can't tell :P [10:31:10] just search for the equivalent to db2118-bin.002581:110134450 / GTID: 180367395-180367395-554641873 on db2121 [10:31:26] db2121-bin.002439 / 437786208 [10:31:29] I think those are the ones from the metadata [10:31:42] mmm, I got different coords [10:32:15] what i did: `select binlog_gtid_pos("db2118-bin.002581", 110134450);` on db2118 [10:32:46] db2118 or db2121? [10:32:59] looked at the binlogs on db2121, saw that db2121-bin-002439 was active at the time of the swi... oh. i should have been looking at the _Backup_ time. not the switch time. [10:33:05] sobanski: db2118 [10:33:05] that is data from the 25 at 6am, I think think that is [10:33:18] *don't [10:33:44] Iโ€™ll keep quiet and not confuse things anymore [10:33:47] ok. dump is from 2021-08-24 00:00:02 [10:34:00] binlog would be db2121-bin.002436 [10:34:25] what i got was db2121-bin.002436:130479946 [10:34:47] and double checked the transaction before and after to check with the server data [10:36:06] for search, I used the gtid- with works for greps, not in practice :-D [10:36:32] what i've got: db2121-bin.002436 / 130479291 [10:36:47] I think you got the previous transaction [10:37:11] remember what I said about "gtid marks completed transactions, binlogs marks offsets" :-) [10:37:37] offset== the space just before the next transaction to be executed [10:37:53] ok - went to next transaction [10:38:03] (well, strictly the next gtid entry) [10:38:10] and i also get 130479946 [10:38:15] cool :-) [10:38:20] this is a very error prone method [10:38:50] so when I used to do it with manuel on very bad situations, we allways did it indenepdently and rechecked :-) [10:39:00] so there you have your coords [10:39:07] to apply on doc [10:39:16] let me stop replication [10:39:37] and now give you full control [10:39:52] of db2118 [10:40:07] set it up, let's make sure it is going [10:40:20] and then think how to solve this issue for the future [10:40:24] i'm going to do a 'reset slave all' to start [10:40:29] +1 [10:40:39] there could be garbage on relays, et.c [10:41:45] and once we get to binlogs after 6:30 on the 25 GTID "it should just work" LOL [10:43:25] I can double check coords if you need it [10:43:55] but I think you got it [10:44:18] it's reporting `Seconds_Behind_Master: 0`, which is making me nervous [10:44:24] he [10:44:31] because of the same server id, I guess? [10:44:54] that is the same issue than gtid [10:45:02] except it doesn't fully break replication [10:45:08] oh god, right [10:45:25] so can you see why I say it is a more global issue than backups [10:45:45] not saying that I cannot do thing differently there, ofc [10:45:47] wait, if the server_id in the binlogs is db2118's, doesn't that mean it won't apply any of the transactions? [10:46:03] good question [10:46:39] `Seconds_Behind_Master: 187953` now [10:46:43] I think it may (with 90% confidence) [10:47:05] we will see soon if replication breaks [10:47:15] "yaay" [10:47:33] lets double check a transaction that would have been missed [10:47:37] or I will [10:47:46] yeah, please [10:47:54] and we can also run compare later, which we should have done anyway [10:49:16] yeah, they are there [10:49:32] replication is unlikely to survive such a large gap [10:49:43] but still, compare will be in order after so much manual handling [10:49:44] yeah ๐Ÿคž [10:49:50] ack [10:49:56] but all of this is super scary [10:50:05] ๐Ÿ’ฏ [10:50:17] on the good news, we have snapshots that are a bit more reliable in this aspect [10:50:28] as they bundle its internal gtid state and data toghether [10:50:36] and are more frequent [10:50:52] so in a big emergency, just fall back to snapshots [10:51:01] but this is a real issue [10:51:03] for backups [10:51:07] and for replication in general [10:52:31] I think we will need manuel on monday to see how to better solve this, as he was more aware of this than me [10:52:38] and that is why he was using binlogs [10:52:49] BUT binlogs won't work for topology changes [10:52:53] This did all look quite alarming [10:53:12] Emperor: in this case, looks were _not_ deceiving. [10:53:14] well, no user impact == not a big deal today [10:53:26] I cannot say anything about tomorrow :-) [10:53:46] and as I said to kormat once, the backups are there and the data is safe [10:53:55] now about recovering it... :-P [10:54:25] "recovery costs extra" and all that [10:54:41] my bigges worry was not to waste kormat's time with another 16 ours of recovery [10:54:47] *hours [10:54:59] I knew it had some solution [10:55:34] I think this is big enough for delaying until monday for big meeting, sobanski ? [10:55:58] not that I wouldn't like to discuss it more now, but I reall need to finish some clinic duty stuff [10:59:38] I think itโ€™s worth a dedicated meeting [10:59:46] +1 to me [11:00:28] Letโ€™s discuss on Monday so we can plan for the most efficient way to run it [11:39:45] well crap. replication broke. [11:39:51] Last_SQL_Error: Error 'Duplicate entry '951915-edit' for key 'pr_pagetype'' on query. Default database: 'eswiki'. Query: 'INSERT /* WikiPage::doUpdateRestrictions */ INTO `page_restrictions` (pr_page,pr_type,pr_level,pr_cascade,pr_expiry) VALUES (951915,'edit','autoconfirmed',0,'20210826151244')' [11:42:51] that transaction is from 2021-08-25T15:12:44 [11:43:15] we should also check the one that inserted that value beforehand too [11:44:45] but it is many times very difficult to know the exact origin [11:45:35] the current status doesn't have that row [11:46:43] let me grep the binlogs [11:49:57] my guess is replication did some transactions before erroring out [11:50:24] as that value is inserted very soon after the place we started replicating to [11:50:55] so one on of the START SLAVE commands, it started, replicated a few records, then failed [11:51:41] replication ran for almost 30 minutes before it errored [11:52:01] yeah, which would explan just a few records of drift [11:52:16] only a few records got inserted incorrectly [11:52:45] so my theory is not that it was in this run [11:53:01] but on previous "start slave"s [11:54:09] ok. so we're back to data we can't trust. [11:54:19] at this point i'm inclined to restore from a snapshot, and then run a full check. [11:54:42] if the idea is to have something working ASAP +1 [11:55:17] I am sure if we have used the new coords from the beginning it would have worked, too [11:55:25] but we didn't know it was going to fail [11:56:26] one tip when recovering dumps is, if there is space, shutdown and copy the data to a different dir as a quick backup [11:57:09] e.g. to prevent copy and paste errors, or I guess issues with gtid like this [11:57:43] let me check the date of the last snapshot [11:58:13] ah, you mean after restoring the dump, but before trying to enable replication? [11:58:25] yeah, because it has happened to me recovering for 12 hours [11:58:33] then pasting the wrong coord [11:58:47] and trashing that :-/ [11:59:26] in this case, it was not that, I think it was the gtid- starting and stopping [11:59:26] ack [11:59:35] but it could help too [11:59:41] ok, well, seeing as no matter what i do it won't be done today, [11:59:47] i might as well re-try the dump-restore [12:00:20] I can force a new dump too, which would be pain-less [12:00:34] how long does that take to generate? [12:00:39] or recover you a snapshot in 1:30m [12:00:43] let me see [12:01:05] my point is that if we recover a snapshot, i'd still want to do a full mysqlcheck run on it [12:01:33] which won't be done before end-of-business today [12:01:59] I would do (and help) with whatever you decide indeed [12:02:15] let me check how much time it took for last dump of 7 on codfw [12:03:58] 186 minutes [12:04:19] alright, tell you what: [12:04:28] if you can please trigger a new dump of s7/codfw, [12:04:33] sure I can [12:04:46] i'll re-restore the existing dump, and we'll see if our recovery method above actually works [12:04:53] if it fails for whatever reason, i can restore the fresh dump on monday [12:05:15] looks good to me [12:05:19] great, thanks :) [12:05:20] I would add one more thing [12:05:24] go on [12:05:41] I know it failed because of gtid, but not sure the exact cause [12:06:10] if it was because same gtid + topology change [12:06:24] or because I think the instance wasn't completely destroyed [12:06:31] (correct me if I am wrong) [12:06:53] e.g. there is gtid references on the mysql table [12:07:07] would it make sense to try what you said but starting from 0? [12:07:18] as in mysql_install_db [12:07:40] and of course you can pre-dump grants and heartbeat table and other stuff you need [12:08:13] do you see my - more of a question than a suggestion? [12:15:02] new dump is ongoing at ddbprov2002: dump.s7.2021-08-27--12-14-16 [12:15:06] BTW [12:15:57] that won't have the gtid issue for sure because it will reference the current primary [12:16:08] and binlogs will be usable [12:27:51] jynus: i believe 'reset slave all' specifically drops the old gtid info [12:28:04] uf [12:28:20] i'll do: restore dump -> copy /srv/sqldata -> start replication -> see what happens [12:28:26] (with reset slave all) [12:28:30] that's indeed what I would expect too, don't get me wrong [12:28:37] if that doesn't work, then we can try more scorched earth approaches [12:28:39] that's ok to me [12:28:44] it would be good to know what is the minimum required fix [12:28:53] yes, that was my thought [12:29:38] and if you have space, please see if you can keep a duplicate after dump and before start [12:29:46] faster to try things :-D [12:31:28] we've seen weird things being kept hidden on binlogs being read on start and mysql. tables in the past [12:32:08] as in, replication leftovers, see: https://jira.mariadb.org/browse/MDEV-11969 [12:32:09] that's what i was suggesting doing [12:32:39] that's not the one [12:33:27] I cannot find one, but we got an answer where devs say "it is impossible to remove leftovers in your situation" [12:33:40] (as long as replication kept being up) [12:33:45] but I think it doesn't apply here [12:33:58] because they would be kept on the primary only in this case [12:34:01] so all good [12:36:25] i'm going to open a task to keep track of this attempt [12:36:31] thanks [12:39:24] I am pretty sure that the issue requires the topology change AND recovering to a host that is being used as the old GTID reference, for referencing transactions written by themself, and that it wouldn't happen on other replica [12:40:55] what about just restoring a primary? [12:41:15] well, that doesn't require anything [12:41:24] uh? [12:41:30] as it won't replicate from others [12:41:38] no START SLAVE there :-) [12:41:45] ok, primary in the secondary DC :P [12:42:02] it is impossible to reference itself in that case [12:42:07] we have backups per-dc [12:42:36] not saying it is a nich case [12:42:47] the use case should work [12:43:08] i'm not sure we're understanding each other [12:43:20] say db1136 (s7/eqiad primary) needed to be restored [12:43:30] the metadata for the latest dump would reference.. db1136 [12:43:38] are you saying we're restore from the codfw dumps? [12:44:09] well, I think you wouldn't ever recover a primary, you would promote a replica first [12:44:38] but yeah, there is no references cross-dc [12:44:50] i don't know what you mean by that [12:45:07] well, you cannot recover a primary AND it being available [12:45:13] even on a seconday dc [12:45:17] sure? [12:45:47] you can partially recover it (e.g. a table) [12:45:49] or a row [12:46:01] ok, i'm lost again [12:46:03] but then you don't stop replication, just are just yoloing the recovery [12:46:38] how to do maintenance on a primary instance of a secondary dc while keeping replication going?= [12:46:54] where's the assumption coming from that replication would be running in the secondary dc? [12:47:21] then if it is not running, there is no reason for not failing it over [12:47:38] if you fail it over, you automatically run into the situation above [12:47:40] promote another host to master [12:48:03] for the host, but not for the full datacenter :-) [12:48:39] the main issue is the backups cannot predict where the new primary will be failed over [12:49:23] at most we could automate what we did, and find the transaction on the binlog of all host on backup? [12:49:30] *all hosts [12:50:05] in a way, that is what gtid does - provide a global transaction id [12:50:44] but we cannot replicate using it because its flaws- trying to be too "intelligent" [12:51:57] ending up reimplementing gtid, something like https://github.com/openark/orchestrator/blob/master/docs/pseudo-gtid.md :-( [12:52:39] Or if the issue is only because shared gtid [12:53:08] making sure on host rebuilding, to never reuse server ids [12:53:32] that might be simpler, iff there's some easy way to do so [12:53:45] things need a lot of thinking and testing, I really don't see a simple solution [12:53:49] we can't change server ids [12:54:17] at least, until recently we relied on them being == the server's ipv4 address [12:54:20] that might be less true now [12:54:27] but still, i'd be very wary of making that change [12:54:41] it wouldn't need to be for existing host [12:54:52] only for those rebuilt ("new instances") [12:54:57] I am sure there are other options [12:55:11] in other way, I mean [12:55:50] GTID promised solving all our problems :-) [12:56:21] and it works nicely* (*- as long as you don't have to do maintenance on the servers) [12:56:27] :-D [12:57:35] the backups part is the easy part I think - a workaround for alternative coords wouldn't be that hard [12:58:02] you don't even need all binlog coords, maybe just more than 1, and then move the replica topology around [12:58:38] e.g. after backup, record the binlog coords for the current primary and for the candidate [12:58:46] (in a consistent way) [12:59:21] "binlog coordinate tracking redundancy" [12:59:45] or a script that finds it easily how we did [12:59:50] after the fact [13:01:27] I am open for more suggestions [13:29:37] Emperor: i'm not going to get to your puppet/systemd woes today i'm afraid [13:33:47] kormat: no problem, I'm worrying about swift today [13:34:26] (AFAICT the plan for moss is 2 servers in each DC with asynchronous replication between them which is quite a scary thought) [13:35:01] oh good :) [15:59:14] the dump completed at 13:55:09 UTC [15:59:24] on dbprov2002