[08:03:01] jynus: what do you mean with https://phabricator.wikimedia.org/T288803#7280854? [08:03:20] Available backup source on the standby DC (db1150 [08:03:44] if this is done after dc switchover, you would invert that and Available backup source on the primary DC (db2139 [08:03:54] ah your comment is related to sources [08:04:02] yes, Re: my edit [08:04:07] yes, codfw is primary and eqiad passive, otherwise......see my above comment [08:04:17] we'd end up with 10.4 primary and 10.1 passive, which we don't want [08:04:24] (and also failing over directly to a 10.4) [08:04:47] I updated my wording to clarify I am commenting on my edit, not your comment [08:05:08] yep, makes sense [08:05:35] and re:backup sources, I have obviously no comment about upgrade times, etc. [08:05:49] lots of moving pieces for now [08:06:00] I understand [08:06:12] So far, nothing to be done (from your side or my side), I will keep you posted if there are changes [08:06:47] I wanted to reflect reality on the summary [08:07:04] e.g. it said "Reimage dbprov2002" [08:07:16] which was already reimaged, probably due to copy and paste [08:10:16] yeah, copy&paste error for sure [08:11:23] I am also interested on timeline of your decision, as it helps me prepare the puzzle of reimages of backup sources :-) [08:11:39] will keep you posted [10:15:22] marostegui, sobanski: i'm about to respond to https://phabricator.wikimedia.org/T280599#7280595. i'm inclined to say that given how things are looking, i'll promote new machines in pc2 + pc3 (in codfw) during next week. then wait a week, for baseline metric collection. and then let discussion tools start rolling out. [10:15:50] once that's done, _then_ we can start looking at increasing the retention time, as that's a very discrete setting [10:16:32] if we did the other order - increase retention time back to 30d, and then rollout DT and discover that the combination isn't working for whatever reason, then we'd need to decrease retention time again, which takes longer [10:18:30] that plans looks good to me yeah [10:18:49] will you switch eqiad too next week? [10:19:25] Makes sensie to me [10:19:35] Sense even [10:23:21] marostegui: i think it's less urgent, but yeah, either next week or the week after. [10:32:53] yeah [12:09:13] marostegui: alright, time to test out the semi-sync stuff in s2/eqiad. perfect thing for a friday afternoon. [12:09:21] XDDDD [12:13:54] WCPGW? [12:14:18] Beat me to it [12:14:19] uh? [12:14:29] what's WCPGW? [12:15:06] What could possibly go wrong [12:15:22] ah ok...I guess I need an translator for those XD [12:17:15] With varying emphasis on _possibly_ depending on how sarcastic you're feeling :) [12:17:19] marostegui, YOLO [12:17:24] :( [12:18:33] and with YOLO, I mean Yolo, California [12:20:49] marostegui: ok, starting state, before doing any moves: https://phabricator.wikimedia.org/P17019 [12:21:17] (yes i've written one of my famously artistic bash scripts for this) [12:21:23] xddddd [12:21:50] Now I had to check if Yolo, California actually exists [12:21:54] jynus: you obviously love owls, right? [12:22:02] Spoiler warning, it does [12:22:30] kormat: do we get to pedant your script? :) [12:22:44] Emperor: at your own peril, sure ;) [12:26:21] first --only-slave-move run failed to move db1182. re-running with --timeout 45 🤞 [12:45:11] kormat: I've done things like https://github.com/wtsi-ssg/irods_migrate/blob/f3f8c95a34e4baf3ad6d61c3508ba5b49185f73e/mark_resource_unusable.sh#L52 so I'm not one to talk really :) [12:46:43] * kormat gazes at that line very suspiciou... oh. oh my god. that's.. evil genius? [12:47:05] or, maybe, https://github.com/wtsi-ssg/ceph-disk-utils/blob/dbec53ae865196199c4f15fdc602c83efc6a57ea/ceph_remove_failed_osd.sh#L101-L105 (which has 4 paragraphs of explanation beforehand) [12:47:16] <-- overly partial to vile shell hacks [12:48:13] * kormat winces [12:48:17] bravo. ish. :) [12:48:36] I'm starting to like you Emperor [12:48:57] :D [12:49:55] lol [12:51:01] ok, second attempt worked. going to wait a bit before doing more. [13:11:31] there, a bit has been waited [13:13:38] aaand no luck reproducing the issue, sigh. [13:15:02] ok. manually undoing everything like a sad panda 🥀 [13:20:13] soooo the codfw factor is the only one left? [13:20:31] marostegui: db2104 is a more specific factor, but.. yeah. [13:20:56] true [13:21:24] so once we attempt the s7 switch we can either discard or confirm if it is db2104 or codfw [13:21:55] "yaaay" [13:23:06] "DBAs do not like codfw" [13:26:47] Reedy: it might be mutual [13:27:00] marostegui: on the "plus" side, i'm learning even more about mysql replication 😭 [13:27:37] Don't you love it? [13:30:45] marostegui: i love it with the same magnitude and direction as my love for you [13:31:00] So huge! Amazing! [13:31:13] kormat: your joy is unconfined... in the lower bound? [13:33:27] Emperor: precisely. :)