[04:30:00] jynus: https://phabricator.wikimedia.org/P17120 [04:30:44] I guess the alert will clear once the postprocess is finished? [04:30:49] And everything gets rotated? [05:29:34] Amir1: https://phabricator.wikimedia.org/T290057#7323672 [06:25:34] marostegui, yes, but I will have a look, maybe backup distribution is bad after latest rebalancing for reimages/may need cleanup [06:45:07] so it was a combination of bad timing, pending s4 switchover, pending s4 optimization and ongoing recoveries [06:45:53] I have made some space, but it will be better organized after the s4 change [07:10:35] good! thanks for double checking [08:37:01] marostegui: so, db2118 caught up on replication overnight 😅 [08:37:17] the next steps would be running checks, which will take a long time [08:37:39] so, what i want to do instead is make a copy of the current good state, reset to the old state one last time, [08:37:54] and see if we can correctly handle the dump->switchover period of replication using GTID [08:38:21] that'll take maybe 2-3h to do. and then i can switch back to the good current state, and start running checks [08:38:40] sound reasonable? (or, at least, not less unreasonable than usual? ;) [08:40:23] sobanski: now that the db2118 situation seems to be under control, i see that there's some parsercache discussion that i missed over the last week. i'll try to catch up on it today [08:48:36] marostegui, as promised: https://gerrit.wikimedia.org/r/c/operations/puppet/+/715919 [08:49:05] have a look when you can and we can merge (even you can do it if I am not around) before or around the switchover taking place [08:59:48] kormat: sounds good to me indeed [09:00:19] jynus: thanks! +1ed [09:06:05] kormat: I've assigned you to https://gerrit.wikimedia.org/r/c/operations/software/+/715926 per our discussion yesterday, but feel free to get other eyes on it if you like :) [09:07:35] kormat: thanks. It's mostly about the order of the next steps at this point. [09:08:27] sobanski: i'm currently trying to reconstruct the timeline of what happened when so i can try to make sense of timo's analysis of things [09:08:46] having 3 different tasks (that i can find) for this _really_ doesn't help [09:10:15] kormat: do you want me to create a meta one? [09:10:25] marostegui: do you enjoy having working knees? [09:10:32] (random, unrelated question. naturally) [09:10:36] It is definitely useful [09:10:36] aww, you two are so cute ;-p [09:11:05] * kormat grins [09:11:47] oh, bah I got confused by the gerrit UI [09:12:05] Emperor: SNAFU [09:14:13] jynus: do you want me to merge the change monday morning or friday evening? The switchover is scheduled for monday and the candidate master is already reimaged [09:14:35] we don't really have to be super-accurate [09:15:08] I think we better leave the sunday snapshot going and we merge it once switchover completes [09:15:31] sounds good to me! [09:15:40] either you can one everything looks good [09:15:46] or I can later in the day [09:16:22] sure no worries, we can speak after the switch [09:28:53] marostegui: thanks. It's not as much as I want it to be :( [09:29:03] It is pretty good I think [09:31:03] basically brought the fs usage to the state it was in May: https://grafana.wikimedia.org/d/000000377/host-overview?from=now-6M&orgId=1&refresh=5m&to=now&var-cluster=mysql&var-datasource=eqiad%20prometheus%2Fops&var-server=db1136&viewPanel=12 [09:36:50] Emperor: sigh. i think we should look at switching our pontoon env to use the mariadb::core and mariadb::core_multiinstance roles [09:36:58] because not beinag able to test stuff is teh suck [09:40:51] kormat: you might be right; I think it has the change that makes mariadb::misc::db_inventory use mariadb::service now, so that should be more useful. [09:41:06] But I need to rebase and redo the branch I'm working on ATM [09:41:24] * kormat nods [09:42:12] (but yes, "ask kormat nicely to make at least one multi-instance server in pontoon" is lurking on my TODO list) [09:43:03] * kormat notes that Emperor is unaware of who might actually end up doing the work there ;) [10:02:18] I would appreciate a review of: https://phabricator.wikimedia.org/T288594#7324150 [10:04:14] marostegui: i totally get why you'd want a review, but there's too much voodoo there for me 😨 [10:04:46] So basically double check that I am writing the correct database to ignore wikitech when replicating back from eqiad to codfw (as a pre DC work) [10:04:57] so once we are in eqiad, everything BUT wikitech replicates to codfw [10:08:22] how that comment looks to me: https://usercontent.irccloud-cdn.com/file/ZN8qaoro/image.png [10:09:11] hahahaha [10:10:11] marostegui: i can't believe you're asking me to sacrifice my last remaining sanity like this, but can you point me to the accursed task for the wikitech migration? [10:10:42] So essentially before the DC switch we have to configure replication back from eqiad to codfw, so once we start writing on eqiad data gets transfered to codfw normally. s6 eqiad master is configured now to have multi source replication, as it replicates from codfw AND from m5 (but only for labswiki, which is wikitech). So once we are in eqiad, codfw will receive also m5 updates, but we want to ignore them, as codfw [10:10:42] doesn't have labswiki yet and if it receives any, it will break replication [10:11:07] kormat: The doc is cleaner than the task: https://docs.google.com/document/d/1fOyK3cScppj9mDTBoqMjyIGPb-Mx_naHj8v9vV9Ccho/edit [10:11:28] But the task is: https://phabricator.wikimedia.org/T167973 [10:11:36] I suggest not to read the task XD [10:16:53] te odio 🥀 [10:17:13] marostegui: your proposal looks sane* [10:17:21] (* for a very stretched version of sanity) [10:18:09] \o/ [10:18:13] thanks :) [10:19:30] * kormat grumbles about how she had been deliberately and successfully avoiding any details of that migration up until today [10:28:22] kormat: what database work involves sanity [10:28:34] Ugh, all this rebasing and squashing into single commits for review is making a real mess of my git history [10:28:49] RhinosF1: `systemctl stop mariadb; rm -rf /srv/sqldata` ;) [10:29:15] kormat: well that involves no database [10:29:24] RhinosF1: see? :) [10:29:29] Emperor: gerrit driving you mad? [10:29:51] kormat: yes, no database work [10:37:08] kormat: https://gerrit.wikimedia.org/r/c/operations/puppet/+/715934 was a fix you suggested :) [10:46:03] also now updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/714358 lets see if the PCC is now happy... [10:47:12] https://puppet-compiler.wmflabs.org/compiler1002/891/ \o/ [11:56:17] Emperor: "git commit --amend"? sorry if you know this one too :D [11:56:45] https://www.mediawiki.org/wiki/Gerrit/Tutorial/tl;dr [11:57:07] Getting to know how gerrit interacts with git takes a lot of time but I find it much better than github :D [12:23:17] Contrarywise, I liked gitlab's MR-based workflow, and wrote a tutorial for my last job on using git rebase to make better MRs :) [12:26:34] Emperor: soon (TM) [12:56:54] ok, i can confirm that mariadb does _not_ skip transacations that have a gtid containing the current server's domain_id/server_id in the gtid. [12:58:00] --replicate-same-server-id is almost certainly still required, though [14:15:07] kormat: I've downtimed db2118 for a day, so it doesn't appear on icinga [14:15:18] spoilsport [15:14:12] As mentioned earlier (and having double-checked my previous employer are OK) - https://gitlab.wikimedia.org/MVernon/rebasing_demo