[05:33:50] dhinus: https://phabricator.wikimedia.org/T365424#9980124 [07:02:12] volans: Let me know when you want to do https://phabricator.wikimedia.org/T369882, there's no time constraint, it can be done anytime today, so whenever works for you, just let me know. No rush [07:13:13] marostegui: give me just the time to go through the procedure in the task and I'll ping you [07:21:54] volans: no rush, take your time [07:24:05] marostegui: ack, using 1021 as a testbed is a good idea! [07:24:41] dhinus: cool, if you want to coordinate with btullis, that'd be great (I pinged him in the decom task, so he doesn't do it until he's synced with you) [07:25:00] marostegui: for questions should I just ask here or better in private to not pollute the channel? [07:25:36] marostegui: ok [07:26:44] volans: arnaudb will answer them, as it is a good practice, I'll be reading and if there's something not clear, I can try to help [07:27:36] sounds good [07:29:14] dhinus: do you have any DB running stretch? [07:29:19] I'd like to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/761927 [07:30:34] dhinus: nevermind, it was already done in a different patchset [07:30:59] ok cool :) [07:31:11] dhinus: anything still running buster? [07:31:42] arnaudb, marostegui: so looking at the preparatory step of checking the config I noticed that the new master has 10.6.16 while the old one 10.6.17 and the latest in the fleet is 10.6.18. Should we consider to upgrade it before switching? [07:32:50] marostegui: that's more likely, let me check [07:33:10] dhinus: Thanks (it is EOL btw) [07:33:23] volans: usually we don't bother, but you can try to upgrade the host before to experience a bit more of the server's lifecycle [07:33:53] volans: I'd prefer if we don't, we try not to have a master running a higher version than the replicas (it is not a big deal and will most likely run fine, but just in case) [07:34:20] it is already like that :D [07:34:33] some replicas are .16 and the current master in codfw is .17 [07:34:50] ack, leaving as is [07:35:01] volans: yeah, but most of them are 17. But anyway, you can if you want, it is not a big deal [07:35:23] nah, let's dilute the excitement, not all together :) [07:35:29] dhinus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054271 [07:35:39] dhinus: Gave it -2 for now [07:36:22] marostegui: we have a task to upgrade or remove all buster hosts but there are a few left. I think none of those is using the db puppet classes though. [07:36:39] https://os-deprecation.toolforge.org/ [07:36:44] dhinus: I can try to merge and if not...we can revert, whatever you prefer [07:38:24] should be safe but let me do a quick search for that puppet class [07:38:32] thanks [07:41:37] arnaudb, marostegui: ok the procedure looks good. Do we need the final steps too? (From (If needed): Depool db2121 for maintenance.) [07:42:04] last bit is that one of the final steps says "Change db2121 weight to mimic the previous weight db2218:" but there is no step that says save the current weight :D [07:42:45] for the rest all looks good and fairly trivial (if there is no issue :D ) [07:44:58] volans: we rely on phabricator pastes for missing info usually, but it's indeed a missing instruction :D [07:45:16] should I add it to switchover-tmpl.py ? [07:45:31] sure! [07:46:28] we don't use switchover-tmpl.py anymore do we? [07:46:35] doh [07:46:50] volans: you'd need to add it to https://gitlab.wikimedia.org/toolforge-repos/switchmaster/-/tree/main/switchmaster?ref_type=heads [07:46:52] those templates [07:46:56] that's what I found with codesearch, I hoped it was used by the toolforge tool too [07:47:06] ack [07:47:06] thx [07:56:33] template changes for later [07:56:45] arnaudb, marostegui: I think we can start whenever you want. I've created a tmux on cumin2002, you can attach with: sudo -i tmux attach -rt T369882 [07:56:46] T369882: Switchover s7 master (db2121 -> db2218) - https://phabricator.wikimedia.org/T369882 [07:57:39] * arnaudb is attached [07:57:58] the only request I have is that because I'm also oncall, if anything else happen I might ask you to takeover [07:58:35] ack, be sure to mention the step/state at which you let things, no problem otherwise! [08:00:03] volans: I will be ready in like 10 minutes, but arnaudb has done this a million times, so there's no need to wait for me anyway [08:00:59] ok [08:02:20] from the config diff, anything special I should check? the semi-sync bits are expected and there is no statement/row line so they have both the same (and I've checked on orchestrator they have both statement [08:02:59] I generally try to spot any major difference (i.e. version, semi_sync, missing field) [08:03:19] do you see anything unexpected here? [08:03:47] nothing raising my attention outside of what you said about minors [08:04:14] ok to proceed then? [08:04:19] yes! [08:04:49] why do we downtime also eqiad's hosts? [08:05:19] I think this is a safety measure to avoid alarms while running the procedure, given that we're supposed to be monitoring the situation [08:05:30] k [08:05:30] but I'd be glad to have a refresher from marostegui on this ↑ [08:06:15] volans: I checked earlier today and it was all as expected [08:06:21] so you're good [08:06:25] <3 [08:08:38] arnaudb: let me know how often do you want me to stop before proceeding ;) [08:08:53] I keep an eye on your session volans [08:08:59] if you don't see me scream, that's ok [08:09:44] lol, ok, but I wouldn't know if you stop looking :D [08:10:24] I keep the terminal on display so I don't! if it moves/stops moving I check [08:11:31] edit looks good? [08:11:36] yup [08:11:39] I see that the candidate master bit is changed later [08:12:02] it is still a candidate master! [08:12:20] correct [08:12:21] the coup is yet to happen [08:13:43] how safe is this magic box? :D [08:13:51] a lot to very [08:14:06] I just monitor on orchestrator that it does what it promises? [08:14:25] indeed, I also tend to keep a tab open with alertmanager for peace of mind [08:14:45] any specific filter on AM? [08:15:11] I usually discard warnings and try to narrow down to DP [08:15:18] nothing fancy [08:16:12] ok [08:17:08] proceeding with the replicas move [08:17:21] great start :D [08:17:30] it failed on the select @@read_only; [08:17:42] first time I see this 🤔 [08:18:41] they are noth RO because we're in codfw, so I guess all expected [08:18:49] yep [08:18:59] that's why we pass --read-only-master [08:19:08] btw $ sudo -i tmux attach -rt T369882 [08:19:08] no sessions [08:19:08] T369882: Switchover s7 master (db2121 -> db2218) - https://phabricator.wikimedia.org/T369882 [08:19:17] on cumin2002 marostegui [08:19:20] not 1002 [08:19:23] ah! thanks! [08:19:24] (i did the same :p) [08:19:26] does it have some debug logging saved somewhere? I don't see any debug option [08:19:51] nah it's quite rigid afaik [08:19:58] I haven't had to debug with it yet [08:21:26] you guys want to try yourselves or you want me to help with the troubleshooting? [08:21:27] it failed here https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/-/blob/main/wmfmariadbpy/cli_admin/switchover.py#L107 [08:21:55] I've executed manually the query on both and it's ok [08:22:55] marostegui: is it possible it's just credential issue for the way I'm running things? [08:23:14] volans: I just checked and I can reach them fine [08:23:18] (sudo -i) [08:24:28] I would retry the command just to ensure is a hard failure and not transient [08:24:28] volans: can you try from cumin1002? [08:24:30] so we can isolate? [08:24:33] sure [08:24:35] volans: that also works yeah [08:24:49] https://phabricator.wikimedia.org/P66476 for what is worth [08:24:53] ok same error again on cumin2002 trying on 1001 [08:24:55] *1002 [08:25:10] yep I've checked that too [08:26:27] created same tmux on 1002 too [08:26:43] * arnaudb (attached) [08:26:49] ready to run once you're attached [08:27:05] go! [08:27:21] and here it works [08:27:27] that's interesting [08:27:35] we need to check what it is [08:27:48] indeed [08:27:53] it is worrying [08:28:07] let's finish the switch and then we can double check [08:28:11] sure [08:29:09] it has sterted to move hosts [08:31:40] why the lag of the replicas doesn't goes down after the move? [08:31:52] dhinus: I've merged, if you see something strange let me know [08:32:34] volans: because the replication on the master is still stopped (as expected) [08:32:40] once it is started, it will go down [08:33:21] marostegui: my "quick search" became a bit longer as I got lost trying to use Cumin :D volans: is there a way to do something like P:mariadb::packages_wmf with cloudvps hosts? [08:33:26] sure, I meant when started, I see it stops and starts for each hosts, but apparently the others catched up, just db2220 is a bit slower to catch up [08:33:47] sometimes transactions take a bit more time to be handled [08:34:03] there is no real predictability on replag recovery [08:34:09] dhinus: https://openstack-browser.toolforge.org/puppetclass/profile::mariadb::packages_wmf :-P ??? [08:34:34] also T179816 [08:34:34] T179816: Cumin: create external backend for WMCS Puppet API - https://phabricator.wikimedia.org/T179816 [08:37:03] volans: I'm not sure that page is actually accurate, it only includes direct references, not indirect ones through other classes [08:37:12] I know :D [08:37:40] * dhinus subscribes to T179816 [08:37:43] for projects with their own puppetmaster and pupeptdb you can run cumin within the project and use the same query (like deployment-prep) [08:37:53] dhinus you have to implement it, not subscribe :D [08:38:05] it's not in the TODO there was no interest by WMCS on that at the time [08:38:23] and if it doesn't report properly the things it might be misleading [08:38:23] * dhinus likes, subscribes and hits the bell button [08:38:47] replicas move status: 3 to go [08:38:54] sounds like volunteering to me, dhinus ;p [08:39:19] +1 [08:41:42] last one to go [08:42:12] Emperor: hahaha [08:42:38] volunteering for many things equates to volunteering for none :P [08:43:53] looks good volans [08:44:15] the semi-sync warning is expected? [08:44:26] yeah it's not enabled everywhere [08:44:34] orchestrator looks good to me [08:44:57] nothing screaming in AM [08:45:16] yep, everything looks ok! [08:46:03] ok continuing with teh safe stuff [08:46:45] tbhe patch is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1053827, review welcome :) [08:47:15] +1ed! [08:48:22] ack merging [08:48:57] next steps are faster than the first ones [08:49:08] ack [08:49:49] puppet merge completed [08:50:38] are we a GO for the actual switch? :) [08:50:51] lgtm volans ! jump when you feel like it :) [08:51:27] ack going [08:51:54] 🤞🤌 [08:52:12] :D [08:52:22] seems all good, saying yes [08:52:37] arnaudb: looks good to you? [08:52:46] yep! [08:53:38] famous curl printed here [08:53:52] yes but for later [08:54:01] (it's in the doc now, you can let it slide) [08:54:31] lgtm [08:54:36] monitoring confirms [08:54:58] agree [08:55:19] orchestrator too, jsut a bit of lag [08:55:40] although show slave status showed 0, but I guess they measure it in a different way [08:55:54] ok to proceed with dbctl? [08:56:23] yep [08:56:45] LGTM [08:56:50] same! [08:57:15] running puppet [08:57:50] after the curl, replag seen on orchestrator will be trustworthy again [08:57:56] ok [08:58:08] atm it's all aske [08:58:09] w [08:58:45] cleanup tasks [08:59:48] orch back in order [08:59:57] nice [09:00:17] isn't missing a dbctl commit? [09:00:32] sudo dbctl config commit -m "Depool db2121 T369882" at the end [09:00:33] T369882: Switchover s7 master (db2121 -> db2218) - https://phabricator.wikimedia.org/T369882 [09:00:34] but indeed [09:00:42] ok waiting [09:00:48] nah go for it [09:00:55] it's a missing instruction [09:01:11] oh no, wait [09:01:14] it's a noop change [09:01:18] so you don't have to commit [09:01:25] right it's just metadata [09:01:32] noop for mw config [09:01:35] yep [09:02:00] zarcillo seems correct [09:02:05] agreed [09:02:17] do we need mainteanance on db2121? [09:02:38] don't bother, schema change will depool it automatically [09:03:14] but if I set the weight without depooling it will be pooled directluy with full weight right? [09:03:33] and it has only warm cache as a master, not as a replica (I guess query patterns are quite different) [09:03:54] the weight at which it's pooled will help warming up the cache [09:04:22] but it's depooled atm [09:04:31] volans: yes, thr order is correct [09:04:34] We should first depool [09:04:38] Then change the weight [09:04:46] And then repool (ideally in small steps) [09:04:52] ok so the (If needed): should be removed? [09:04:59] Otherwise you are going from 0 to who knows what [09:05:04] exactly [09:05:48] volans: so normally the switchover is done for a schema change, so you can just depool, change the weight and ping whoever needed the schema change (me) [09:05:56] So I'll run the schema change and repool when done [09:06:06] ack [09:06:20] I'm restoring the old weight including API ok? [09:06:26] yep [09:07:50] arnaudb: LMW if edit looks ok [09:08:07] lgtm! [09:09:43] edit does the commit too, weird, there was no commit message [09:10:16] it's a noop change iirc [09:10:20] since the host is depooled [09:10:24] so no commit here [09:11:26] the diff is there https://phabricator.wikimedia.org/P66478 [09:11:29] and it did commit [09:11:41] I just didn't recall that dbctl edit does commit automatically [09:11:47] no two step as usual [09:12:01] sorry I'm stupid [09:12:06] mixing lines in the output [09:12:14] volans: we forgive you for that [09:12:31] ofc it's a noop [09:13:04] marostegui: all done, all yours, task updated until the schema change [09:13:21] volans: you can just remove that step andclose the task as fixed [09:13:29] thank you for doing it :) [09:14:34] thank you, it was helpful and we found an issue from cumin2002 :D [09:14:45] yeah, volans arnaudb can you try to see what the issue there could be? [09:15:26] sure [09:16:01] thanks :* [10:03:05] if I read correctly the code, db-swithover does ask for confirmation only after the slave move, so a "safe" command to test it could be this one (that I guess will fail anyway in the preflight because the new master is not a replica of the old one) [10:03:11] db-switchover --replicating-master --read-only-master --skip-slave-move db2121 db2218 [10:03:19] (same command of the switch we did today) [10:03:38] * volans will also put a print and exit, but to be on teh safe side [10:06:41] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain [10:07:31] {hotfix removed, will send patch} [10:08:00] volans: should we capture this on a task so there's a record of it? [10:10:47] yeah [10:10:52] I'll open one] [10:10:53] creating the ticket [10:11:01] ah [10:11:08] go ahead [10:11:11] ack! [10:13:06] https://phabricator.wikimedia.org/T370029 [10:15:09] thx updated [11:01:17] hello everyone, back in the office. let me know here on in private if I need to do anything for you! [11:01:28] welcome back! [11:01:36] kwakuofori: welcome back [11:01:45] thanks, volans! welcome to you! [11:01:45] hi. Good break? [11:01:56] marostegui: thanks! [11:02:09] Emperor: it was indeed! thanks for asking [11:02:22] 👍 [11:33:42] my findings so far: https://phabricator.wikimedia.org/T370029#9980901 [12:37:46] volans: it looks like it relates to this issue: https://phabricator.wikimedia.org/T355157 [12:39:01] arnaudb: in which way? [12:40:19] we had issue with the certificate bundle on orchestrator, which was the origin of verify=False flag [12:43:11] I've tried with True and works fine for me [12:45:00] ack [14:50:17] fyi I've restarted s6's dump, it's been failed because of T367781 [14:50:17] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781