[06:53:25] creating a ganeti VM, cookbook step runs DNS update but the diff I get includes an unexpected change: [06:53:29] -asw-a-eqiad 1H IN A 10.65.0.17 [06:54:54] I don't see a matching change in DNS repo so it should be from someone editing mgmt in netbox? [06:55:54] mutante: https://netbox.wikimedia.org/extras/changelog/65107/ [06:56:06] and the icinga alert was alerting tonight [06:57:15] volans: thanks! hmm.. it seemed unusual to delete only mgmt [06:57:22] I think asw-a8-eqiad [06:57:25] were offlined [06:57:26] but looking at it .. Host asw-a-eqiad.eqiad.wmnet not found: 3(NXDOMAIN) [06:57:32] so the main IP is already gone [06:57:34] should be 2 [06:57:50] the actual one is asw2-a-eqiad.mgmt.eqiad.wmnet [06:57:52] I guess then it can't hurt much and I can accept the diff [06:57:58] but if XioNoX or topranks could confirm [06:58:01] would be better [06:58:05] yea [06:58:06] *agree but... [06:58:41] yeah the very old switch stack got decom [06:59:06] aha:) So ok to go ahead and delete mgmt, right? [06:59:28] yep [06:59:36] thanks, doing! [06:59:57] https://phabricator.wikimedia.org/T218734 [07:02:42] I would leave a comment to remember running the dns cookbook but I think the actual issue is jclark doesn't have the needed access [13:20:57] legoktm, rzl: FYI I've released and deployed the latest Spicerack today, AFAICT the downtime/remove_downtime services works as expected. Only nit is that it needs to match the whole string, because icinga-status uses fullmatch(), we should probably add that do the docstring. [13:21:45] p.s. you got lucky for the live-test that I was releasing today anyway, I think that a bit more coordination could have helped here ;) [14:02:37] jbond: why is pcc lying to me? :( [14:04:13] e.g. https://puppet-compiler.wmflabs.org/compiler1002/31038/db2112.codfw.wmnet/index.html - the diff is lying about the current prod config [14:04:37] looking [14:05:13] if you compare with icinga, you'll see that the check already has #p.age appended to it [14:13:23] I don't see the "#page" at https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db1163&service=MariaDB+read+only+s1 [14:13:33] db1163 [14:13:44] kormat: i think it is because it realies on mediawiki::state which reads a file directly from the puppet master /etc/conftool-state/mediawiki.yaml [14:14:24] on the compilers we have "primary_dc: eqiad [14:14:25] " [14:14:47] mutante: sure? [14:15:15] jbond: oh :/ [14:15:22] jbond: that's.. problematic [14:16:02] kormat: yea, no string "page" on the host overview for that one? https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=db1163 [14:16:13] mutante: sorry, let me rephrase [14:16:18] mutante: yes, that's expected. [14:16:24] the related CR is acatually going to change that [14:16:31] but that's been the standard up until now [14:17:00] then the compiler says that it's going to change [14:17:12] I just looked at the first host [14:18:10] kormat: https://phabricator.wikimedia.org/T290665 ill see if there is a quick fix later today/toomorrw [14:18:10] if it's just about codfw and not eqiad.. then ACK [14:20:36] jbond: wonderful, thank you! [14:27:29] kormat: no probs [14:45:17] kormat: intermittent "(Can Not Connect to MySQL)." on Phabricator [14:46:02] mutante: thanks, looking. [15:10:43] volans: nod -- full-match is the expected behavior but you're right, it's only documented on the icinga-status side, will send a fix [15:10:56] thx! [15:54:55] volans: thanks, let me add that to our /Coordination subpage [16:01:26] use the following command to view the live-test dc switch on cumin1001: tmux attach -t dc-switch-live-test [16:01:51] can't find session dc-switch-live-test [16:01:56] sudo? [16:02:02] as root [16:02:05] ok :) [16:02:22] jelto: did you want the rest of us to use -r as well so that our tmuxes are read-only? [16:02:48] I think most of you can use -r [16:04:57] for non-roots, we have a google meet with screensharing, I'll put the link in _security [16:05:14] (but all the discussion will happen here, on IRC) [16:05:30] no live-commentary? :-P [16:05:30] the services switch doesn't have a --live-test mode [16:05:39] we can --dry-run it though [16:06:00] like so? [16:06:01] (which would go before the cookbook name iirc) [16:06:08] before the cookbook name [16:06:11] is a cookbook option [16:06:16] not the specific cookbook one [16:06:21] *it's a general cookbook option [16:06:22] 👍 [16:06:33] cookbook -h [16:06:41] and cookbook sre.switchdc.services -h [16:06:46] are alwayas your friends ;) [16:07:00] is there a reason services doesn't have a --live-test? should we add one? [16:07:01] I tried that and I just got "#--- Switch Datacenter for Services args=['-h'] ---#" [16:07:11] jelto: lgtm [16:07:19] rzl: ah, because services is a directory [16:07:23] right [16:07:33] dry_run=True, all good in that sense [16:07:53] so 00-reduce-ttl-and-sleep would be the first cookbook we would like to execute, right? [16:07:56] tat's a good question if we should support -h for directories, but offtopic [16:08:01] legoktm: I think --live-test wouldn't actually do anything different from just running the full cookbook, volans can check me [16:08:14] assuming you did it in the same direction, e.g. eqiad->codfw right now [16:08:17] indeed, if you run it with DC inverted [16:08:23] would be the same AFAIR [16:08:36] as it should be a noop [16:08:39] mediawiki live-test has to do things like skip setting RO, but for services we don't have any steps like that [16:08:42] unlike the mediawiki oen [16:08:46] right [16:08:52] jelto: yep [16:09:43] okay then I will execute the first cookbook now 00 for services [16:09:50] 🚀 [16:10:47] I think we need to exclude "mwdebug" from this [16:10:55] we should add elevator music to that cookbook [16:13:48] where is the "label swift did not match regex..." coming from? [16:15:29] conftool apparently [16:16:06] should I wait because of the swift error or can this be investigated later? [16:16:15] no, you can keep going [16:16:30] legoktm: in dry-run mode verbose is also activated [16:16:33] that's the expected output since swift isn't one of the services we're switching [16:16:34] and conftool is pretty verbose [16:16:45] ack [16:16:50] * volans not sure at which line you're looking at though [16:16:50] okay then I will keep going with 01-switch-dc [16:17:03] volans: is all of that output logged to a single file somewhere? I see it isn't in -extended.log [16:17:12] yes it should [16:17:21] for the directory though [16:17:30] oh no yep there it is [16:17:45] /var/log/spicerack/sre/switchdc/services-extended.log [16:18:46] in the real thing, here's where we'll stop and make sure everything is still healthy [16:18:57] before running 02-restore-ttl I mean :) [16:19:18] makes sense :) [16:19:37] woot [16:19:50] so, if we want to do a services live test, we can rerun the same thing without --dry-run [16:19:57] I don't know that we need to do that, though [16:20:17] there's not much complexity there and I don't think we've touched it since last time, right? [16:20:48] the only issue I noticed is that mwdebug needs to be added to MEDIAWIKI_SERVICES so it gets excluded (and also added in the MW cookbooks) [16:20:49] so now I keep going with sre.switchdc.mediawiki cookbooks [16:21:05] legoktm: yeah, smart - want me to write up a task, or will you just remember? [16:21:52] I wrote it down in my notepad [16:21:55] 👍 [16:22:16] jelto: I'll defer to others but I'd be inclined to run the MW cookbooks with --dry-run first, and then --live-test after [16:22:30] +1 on --dry-run first [16:22:57] like that? cookbook --dry-run sre.switchdc.mediawiki eqiad codfw --live-test [16:23:16] I think only --dry-run, I'm not sure if we've ever tested --dry-run and also --live-test? [16:23:26] I guess there's no reason it wouldn't work [16:23:45] volans: ^? now I'm curious [16:24:05] (but the more useful test is probably just one or the other) [16:24:14] args lgtm [16:25:30] dry_run is embedded in all the spicerack modules and in some cookbooks that do something specific [16:25:30] ok I will start 00-disable-puppet now, ping me to stop :) [16:25:42] will surely "work" using both options for some definiton of work [16:25:51] depends how the cookbook or libraries use the results [16:25:56] of things that don't change [16:26:06] nod [16:26:08] jelto: go ahead [16:28:50] heh, the real advantage of slower warmup scripts was we didn't have to sit around waiting for this in dry-run mode :P [16:29:34] this time.sleep is the last line of the cookbook btw, so if we were really impatient we could just ^C out of it [16:29:38] I don't feel strongly though [16:29:47] jelto: I think you can ctrl+c...yeah, what rzl said [16:30:08] everything else looks good, modulo legoktm's point about mwdebug [16:30:39] the ERROR is just from the ctrl-c, we're all good [16:31:10] jelto: wait [16:31:20] this would be a scary question since codfw is the wrong DC, but we're in dry-run [16:31:24] oh [16:31:27] forgot about dry run [16:31:33] so it won't actually warm anything up [16:31:33] jelto: continue :) [16:31:54] yeah, I thought we were in --live-test inversion mode and was expecting eqiad [16:32:00] (when we live-test instead, it'll actually do the warmup, but it'll swap DCs for this step so that we warm up the passive DC instead of the active one) [16:32:06] > Warmup completed in 0:00:00.000162 [16:32:15] efficiency! [16:32:19] lol [16:32:30] and I see six runs as expected [16:32:54] we should maybe clarify that `The script will re-run until execution time converges.` log line, but not a blocker [16:33:53] those cumin failures are expected in dry-run, we didn't actually kill any processes so they're still running [16:34:19] yep [16:34:45] ok the next cookbooks should be executed rather quickly, right? because 02-set-readonly has user impact [16:34:55] during the real thing, yes [16:35:43] instead of stopping for approval at each step, you'll stop before 02- and then continue all the way through 07- without asking -- instead just keep an eye on IRC, and stop if we're all yelling stop :) [16:35:58] plus probably stop if the output is full of errors or whatever [16:36:12] allright [16:36:52] (but there's no user impact expected today, during either --dry-run or --live-test, just to be clear) [16:37:19] sorry hit ctrl+c :( [16:37:32] no worries, you can restart from where you were [16:37:54] all you lose is the nice [PASS] [ERROR] states, so nbd [16:38:04] output lgtm so far but I haven't been checking individual DB hostnames or anything [16:38:29] legoktm: oh, did you happen to verify we did the right thing wrt x2? [16:38:56] I guess that was before the last switch so it's already tested, but [16:39:12] * legoktm looks [16:39:42] jelto: hang on a sec :) [16:39:48] ack [16:41:19] it should've just excluded x2 entirely but I do see it being queried in the logs [16:42:21] hrm [16:42:34] no, that's just for the heartbeat part [16:42:41] ahh, I was about to ask [16:43:39] lgtm, it did not try to set it read only [16:43:53] 👍 [16:44:08] so continue with 06-set-db-readwrite? [16:44:19] +1 from me [16:44:35] yep, fire away [16:46:40] cool, and in the real thing this is where we'll stop again for a while [16:47:08] to make sure everything is basically in good shape, and switch back quickly if we need to (unlikely) [16:47:58] go ahead when ready though [16:50:02] woot [16:50:40] the only note I have is rzl's comment about improving the message about the warmup script [16:50:52] no reason we have to do that before Tuesday IMO [16:51:14] indeed [16:51:49] https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Phase_8_-_post_read-only says "The parsercache hosts and x2 will need to manually be updated in tendril" [16:51:57] but I see [16:51:59] 2021-09-09 16:49:53,540 DRY-RUN jelto 9449 [DEBUG remote.py:651 in _execute] Executing commands ['mysql --skip-ssl --skip-column-names --batch -e "UPDATE shards SET master_id = (SELECT id FROM servers WHERE host = \'db2142.codfw.wmnet\') WHERE name = \'x2\'" tendril'] on 1 hosts: db1115.eqiad.wmnet [16:53:32] I think my patch related to x2 may have accidentally fixed this? cc: kormat, marostegui ^ [16:54:32] ready for live test? [16:55:04] don't forget this will make some noise in the SAL, worth !logging something ahead of it [16:55:27] blah blah live test blah blah no real user impact expected but we're monitoring blah blah etc [16:55:38] so DC_FROM eqiad -> DC_TO codfw is correct now? [16:57:52] eqiad -> codfw is correct yep [16:57:55] args lgtm [16:58:00] +1 [16:59:50] the good news is the service names are in the cookbook, so the mwdebug fix doesn't require a spicerack release afaict [17:04:03] yep :) [17:05:18] cache warmup in eqiad is correct -- we might see some appserver latency alerts in eqiad, they're okay [17:05:35] +1 [17:06:48] you can see the spike at https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200&from=now-1h&to=now [17:08:09] first warmup took 30 seconds, all the rest took ~15 [17:09:21] I continue with 02-set-readonly ok? [17:10:50] I still see php processes on mwmaint2002, do we skip killing them in live-test mode? [17:10:53] checking [17:11:06] yeah we do skip it, okay [17:11:13] yeah, since it's active [17:11:46] however all the systemd jobs have correctly disappeared from mwmaint1002 [17:12:14] 👍 [17:12:22] jelto: lgtm to continue [17:12:49] jelto: go ahead and practice doing these steps one-after-another without pausing, if you want [17:13:09] and also practice watching in here in case we yell stop :) [17:13:16] ack [17:14:22] nice [17:14:26] woot [17:15:46] for today I can continue right? Next week wo would look a little bit more if everything seems healthy? [17:16:26] yeah, here we would test edits, look at dashboards, etc [17:16:44] we don't want to wait *forever* with maintenance disabled, but we would pause and check things out [17:16:53] there was an icinga alert for scs-c1-eqiad.mgmt.eqiad.wmnet rebooting just now [17:17:03] yeah I saw that, it has to be coincidence though right? [17:17:48] topranks, XioNoX: around? ^^ [17:17:55] I guess we could rerun and see if it reboots again :P hard to imagine though [17:18:30] yeah, I think it has to be a bad coincidence [17:18:58] should I rerun some cookbook or continue? [17:19:00] the only major traffic we sent was the warmup script, and that finished minutes before the alert fired [17:19:03] what did happen in parallel? [17:19:08] but yeah [17:19:19] we're running the DC switchover live test (eqiad -> codfw) [17:19:59] but it runs the warmup process against eqiad to avoid sending a bunch of requests to codfw, impacting real traffic [17:20:15] yeah coincidence [17:20:37] ok, thanks for looking :) [17:20:51] jelto: I think you can continue now [17:20:54] +1 [17:21:17] legoktm: rzl https://phabricator.wikimedia.org/T238036#7342571 [17:21:49] oh, perfect [17:23:36] legoktm: have we ever talked about running 08-start-maintenance first, and moving all the other 08- cookbooks to 09-? [17:23:49] since there's more and more stuff in phase 8 now [17:24:15] cc volans if you're still nearby, and happen to have context on it ^ [17:24:19] no, but I think that would be reasonable [17:24:33] I think the jobrunner step should probably be 08 still though [17:24:40] mm that's true [17:24:50] and it's quick anyway [17:25:10] I think 08 was the catchall for all the cleanup stuff after we're back in RW and safe and sound [17:25:11] so 08 is "get everything MediaWiki running again" and 09 is updating other things and resetting TTLs [17:25:19] no problem to add additionals teps [17:25:28] the TTL should be the last probably [17:25:33] yeah agree [17:25:48] if you have some priprity post-RW steps [17:25:53] and I think in particular, maintenance scripts are the last thing people will actually be *waiting* for [17:25:55] leave them in 08 and move the others in 09 [17:26:27] not nearly as much as they're waiting for read-write, but it's still a little time-sensitive especially until the WDQS dispatcher rewrite [17:27:15] yeah, we won't be at 100% edit rate if maintence doesn't update the WDQS lag status, which puts a hold on all Wikidata bots [17:27:47] so I guess the only question is, do we want to do this before Tuesday or not [17:27:59] we should probably start holding off on last-minute changes, but the rename is pretty low-risk IMO [17:28:40] I think as long as someone dry runs it before Tuesday then it's fine [17:29:09] +1 [17:29:15] cool [17:29:27] imagine! phase nine. what a time to be alive [17:30:00] nice job jelto :) [17:30:05] everything else lgtm [17:30:12] and yeah +1, smoothly operated [17:30:18] we'll have a 5 phase lead on the MCU once again [17:31:58] thanks for the support! I leave the tmux session open for a bit in case somebody wants to check the backlog/output? [17:32:17] I think no need, it's all in /var/log/spicerack/sre/switchdc/mediawiki-extended.log [17:32:38] btw you should ask people joining to keep the terminal large enough [17:32:45] okay then I will close the session [17:32:46] as tmux will resize to the smallest one [17:32:57] for your readability [17:33:30] he tried using https://wikitech.wikimedia.org/wiki/Collaborative_tmux_sessions#Create_sessions_with_fixed_size but it didn't seem to work [17:33:35] i tried some fancy tmux setting which forces the screen size but I was not working. So yes I will put a disclaimer next time to use bigger screens [17:33:48] lol, great for trying [17:33:52] that only works as of bullseye I think :( [17:34:04] at least if it's the same setting that I was looking at last time [17:34:26] cumin2002 is on bullseye [17:35:02] really! [17:35:05] hmmmm. [17:35:12] since a while! [17:35:47] but do we need to do another dry-run / live test on cumin2002 then first to be sure everything is working as expected? [17:38:16] heh, I think we should do the actual switchover from cumin1001, and try bullseye next time [17:38:34] why? [17:39:00] do you think it's not that big of a change? or have been people running cookbooks from 2002 regularly? [17:40:09] cumin2002 is a regular cumin host used by people and is on bullseye since before the summer. The /var/log/spicerack directory was created on july 7th [17:41:11] ok, I take that back then [17:42:36] I can run the --dry-runs on cumin2002 if you renamed/added stage 9 before Tuesday if that helps. But I would also be fine just to stick with cumin1001. The screen size issue is not too bad [17:43:11] you say that now ;) it turns out to be a lot more of an issue when a couple dozen people are connected, somebody always has a 10x24 terminal or something [17:43:25] just because they don't realize it's important, easy mistake to make [17:44:24] we'll just have to nag people again, like we did last time :p [17:44:43] but I think doing the dry-runs to verify the stage 9 stuff on cumin2002 sounds good [17:44:45] haha yep [17:45:01] okay :) [17:46:11] I can send a CR for the renames but I'll be offline this afternoon and tomorrow, so I won't be able to help test unless we do that on Monday [17:46:34] (also happy to let someone else do that rename too, no reason it has to be me) [17:48:21] I'm also not sure if https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/718936 will land before Tuesday or not, but no big deal either way [17:57:35] mailed https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/720075/ but feel free to merge it without me, I'll be offline for the week in about 1h [17:58:26] I'll take a look shortly [17:58:57] and j.elto and I just tested, the tmux on cumin2002 has the feature to force the screen-size to the person running the commands [18:00:14] I can do the dry-runs on Friday again, also on cumin2002 if thats ok for you. [18:00:51] but I might ask around if someone more experienced want to look over my shoulder :) [18:01:15] sounds good to me [18:01:43] perfect [18:02:27] I'm out for today o/