[02:09:22] if you'd like to see the systemctl/awk/xargs pipeline I ended up with https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/701053 [07:44:45] jbond: is it possible to annotate `Hosts:` lines in git commits with comments? [10:54:39] kormat: not 100% sure what you mean can you give an example (although i think the answer is no) [11:09:27] jbond: so for example if you look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/701335 [11:09:39] each Hosts: line contains machines from different db sections [11:09:53] but you can't tell that from looking at it. you have to manually look up all the hostnames to see it [11:10:11] but if i could put, say, `# pc1` at the end of a line, i could indicate to humans what the line contains [11:12:17] kormat: ahh i see, that is not supported but its may not be too hard to add. that specific bit of parsing is dont by zuul, let me try and find a pointer [11:12:28] i found where it is [11:12:32] i think i can send a patch [11:12:55] ahh cool happy to review but i dont have +2 on that repo [11:13:22] https://phabricator.wikimedia.org/source/integration-config/browse/master/jjb/operations-puppet-catalog-compiler.yml$56 [11:13:43] yep thats it [11:14:44] kormat: looking at that it would work as is if the comments where above the Hosts: line as appose to on the end [11:15:10] jbond: the pcc-parsing bit would allow that, yes, but i suspect the "commit footer" checks would be upset [11:15:18] ahh true [11:15:37] also, as i discovered earlier, putting a `# asdf` comment at the start of a line in a git commit just gets stripped ;) [11:16:56] ahh TIL :) [11:24:24] jbond: https://gerrit.wikimedia.org/r/c/integration/config/+/701370 - if that looks sane to you, i can then loop in someone with +2 rights [11:25:29] ack, looking [16:41:12] I deployed a fact earlier today that's making some noise on puppet (postgres_replica_initialised), fix here https://gerrit.wikimedia.org/r/c/operations/puppet/+/701428/ [17:02:55] kormat: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/701433/ [17:03:02] no CR needed, but just FYI :) [18:36:01] rzl: effie promising results from onhost tier so far, but one weird thing we're looking into. data at https://phabricator.wikimedia.org/T264604#7175809 [18:36:36] huh, interesting [18:36:36] TLDR: given 12 concurrent requests to MW, we see ~ 12/s on onhost tier, and <=0.3/s on backends [18:36:44] except for that one weird case [18:38:08] this was my first time using memkeys for stuff on localhost, took me a while to realize I had to specity a differnet network interface [18:38:27] I ended up getting what I wanted to see from `sudo memkeys -i lo -p 11210` [18:38:33] but I don't know if that's the "right" way to do it [18:39:36] confusingly, running `sudo memkeys -i eth0` with the default port on an app server actually does show a lot of traffic [18:39:52] which surprised me given nothing is running on the standard memc port, (except maybe nutctracker?) [18:40:03] but as I understand it, what I was looking at was outgoing traffic, [18:40:30] which surprised me that it would just be scanned for and ingested the same way as incoming traffic, but I guess it's all the same from an observer [20:56:08] we're going to start the dry-run / live test of the switchover now [20:56:22] if you'd like to follow along and watch, on cumin1001 $ sudo -i tmux attach -rt switchdc [20:56:33] * volans braces for impact [20:56:42] wkandek, apergos, topranks ^^ [20:56:59] heh [20:57:16] don't use too small windows, tmux will resize the smallest one [20:57:48] I'm in [20:59:48] for context, the dry run is just running the cookbooks with the normal dry run flag. the live test is that we're going to basically switch from codfw -> eqiad, and skip any steps that would do bad things to eqiad or apply them to codfw instead [21:00:20] uh [21:00:42] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/switchdc/mediawiki/__init__.py#20 [21:00:45] for context [21:00:48] all well and good but do I need to flip through the available windows or something? [21:01:06] no, I haven't started yet [21:01:12] because right now I just see the "hi" (and btw I'm a screen user not tmux so, about flipping through windows... :-P) [21:01:14] it should be sequential [21:01:17] ah ha, good good [21:01:32] * apergos goes back to lurking [21:01:53] volans: just to make sure, for the dry run I should still go eqiad -> codfw? whereas the live test is codfw -> eqiad right? [21:02:06] tmux is just screen except you press ctrl-b any time your muscle memory wants to press ctrl-a [21:02:20] that's what I usually do, the dry-run you can technically do it both ways [21:02:26] (and then you also don't listen to any of the tmux nerds who get mad when I say that) [21:02:36] as it doesn't do anything RW, some RO might fail due to not the expcted status [21:02:39] and that's ok [21:03:10] args lgtm, verified dry run [21:03:15] +1 [21:03:47] legoktm: you can copy/paste just the name of the steps and it runs them (was obvious but just inc ase) [21:04:13] yep [21:04:45] (ah I misremembered, it doesn't log to SAL until live test, but the early !log certainly doesn't hurt) [21:04:59] yes dry-run doesn't !log [21:05:00] even though it's asking me to confirm, it's still a dry run right? [21:05:08] yes says [21:05:09] DRY-RUN: START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [21:05:28] * volans sees the trembling fingers [21:05:31] :D [21:05:42] oh, and I missed 00-reduce-ttl [21:05:55] ahaha I didn't think about the "Rerun until execution time converges" in dryrun [21:06:02] Warmup completed in 0:00:00.000122 <-- record time [21:06:08] lol [21:06:16] xD [21:06:20] feelfree to scroll anytime [21:06:23] I assume I can scroll up somehow [21:06:40] ctrl b + [ [21:06:42] then scroll [21:06:43] IIRC [21:06:47] ctrl-b, left bracket puts you in whatever-it's-called mode [21:06:51] it really is the same as screen then :-P [21:07:02] RO clients can't do that, we see whatever you're seeing when you scroll [21:07:04] s/a/b/ for many things [21:07:27] ah, gotcha [21:07:32] now if you look at [21:07:38] ok nevermind :D [21:07:53] oh, that's one of the new pipelines [21:08:42] wait no, ignore me, the new pipelines ran fine [21:08:47] or didn't run [21:08:54] yeah, it looks...right? the units are still enabled so [21:09:01] only the checks marked with safe=True got runned for real [21:09:20] so the `systemctl disable` was skipped while the `systemctl is-enabled` checks actually ran [21:09:25] yeah, I got thrown off by the "fail" cumin output, forgot we expected that [21:09:26] and it looks as expected I think, we can check that they pass doing hte dry-run in the other direction [21:09:29] later [21:09:38] sorry for the misdirect [21:09:53] > DRY-RUN: Failed to get siteinfo │······························································· [21:10:02] what did it say about certificate [21:10:04] ? [21:10:08] at the top [21:10:22] Certificate did not match expected hostname: api.svc.codfw.wmnet. [21:10:25] DRY-RUN: Certificate did not match expected hostname: api.svc.codfw.wmnet. [21:10:25] is that from switching the siteinfo url to https? [21:10:32] I guess so [21:10:36] it has api.svc.eqiad.wmnet though [21:11:31] _siteinfo_url: str = "https://api.svc.{dc}.wmnet/w/api.php?action=query&meta=siteinfo&format=json&formatversion=2" [21:12:10] legoktm@cumin1001:~$ curl https://api.svc.codfw.wmnet/ [21:12:10] curl: (60) SSL: no alternative certificate subject name matches target host name 'api.svc.codfw.wmnet' [21:12:45] but hitting https://api.svc.eqiad.wmnet/ works fine [21:14:03] let me try in isolation [21:14:18] FYI what I do is the first few lines of https://etherpad.wikimedia.org/p/volans-tmp [21:14:22] and then get a mediawiki instance [21:14:46] using the onw with dry-run true ofc [21:15:09] m.get_siteinfo('codfw') fails [21:15:54] with eqiad succeed [21:17:04] when I run `openssl s_client -connect api.svc.codfw.wmnet:443` it presents me with the cert for api.svc.eqiad.wmnet [21:17:45] welp [21:17:57] no alternatie cns? [21:18:01] with curl fails too [21:18:10] volans: yeah, just found the same, was about to say [21:18:24] as a workaround, should we switch _siteinfo_url back to http, and pass X-Forwarded-Proto again? [21:18:30] so it's definetly an issue in the fleet, not the cookbook [21:18:37] and file a task to fix the cert issue for real, when we have a little more time [21:18:44] :/ [21:19:13] the eqiad variant is hardcoded in a few services too: https://codesearch.wmcloud.org/search/?q=api.svc.(eqiad%7Ccodfw).wmnet&i=nope&files=&excludeFiles=&repos= [21:19:28] ohdear [21:20:36] +1 to reverting back to HTTP and filing a task [21:20:44] I guess I take it back, we have to fix that before we can switch [21:21:58] e.g. https://gerrit.wikimedia.org/g/operations/puppet/+/production/hieradata/role/common/mediawiki/appserver/api.yaml#23 in particular [21:23:06] can envoy have both names? [21:23:12] at least I think we need to, I'm not sure what happens if we switchover and that setting says "eqiad" in codfw [21:24:49] has any test from ats to mediawiki backends in codfw been ever performed since we have envoy? [21:25:16] or to rephrase: do we know it will work traffic wise? [21:25:45] because if that's using https and hits the same issue... that would be no good ;) [21:26:04] yeah, that's what I was thinking about, but I was misreading these date stamps because between 2020 and 2021, time is fake [21:26:15] tldr, we did the envoy change *before* the last DC switchover [21:26:26] so, we've done an extremely large test from ATS to MW backends :D [21:26:49] I'm not sure if this eqiad/codfw name was messed up all along though, or if that's more recent [21:27:02] still looking [21:27:16] so...is api.svc.eqiad special in some way that it always goes to the right DC? [21:27:32] shouldn't be, afaik [21:29:10] beside that, should we continue the dry-run to see if there is any other issue? [21:29:23] yeah, sgtm [21:29:41] I'll have to logoff soon~ish given current time [21:29:41] ok [21:30:55] although they look good, please double check with the DBAs that those are still good, I know kor.mat was playing with heartbeat recently [21:31:08] * legoktm adds to list [21:33:25] can you scroll back a second? [21:33:30] so just adding some context, not sure if it changes anything above [21:33:32] to which part? [21:33:34] to the check RW query [21:33:47] but the cert in question does have all the production SANs (e.g. *.wikipedia.org) as well as these: [21:33:50] DNS:api-ro.discovery.wmnet, DNS:api-rw.discovery.wmnet, DNS:api.svc.eqiad.wmnet [21:33:59] and the caches connection to them using those api-XX.discovery.wmnet fine [21:34:11] err no, they use the public names for SAN matching I think [21:34:25] but either way, probably the "right" fix is for the client sides to use the discovery hostnames? [21:34:47] but we want to hit a specific DC [21:34:48] can it before the switch though? [21:35:00] oh, hmmm [21:35:08] legoktm: ok was wondering why db2142 has RW enabled, it's x2 master in codfw [21:35:12] add to the list please ;) [21:35:20] well, you can always connect to a specific DC, if you use both the DNS name and the SAN for that DC [21:36:16] volans: oo, nice catch [21:36:17] we could use the IP from svc.$DC and the discovery as DNS name [21:36:25] oh I see now [21:36:58] the codfw cluster has the same SAN set as the eqiad one (only svc.eqiad in SAN not svc.codfw) [21:37:04] yep [21:37:27] I moved my list from a notepad to https://etherpad.wikimedia.org/p/2021-switchdc-testing [21:38:12] k [21:38:14] so yeah, long-term, probably fix the certs (codfw should have codfw SAN) and everything would be fine [21:38:17] keep going? [21:38:22] +1 [21:40:21] btw I can scroll even in RO mode with my trackpad :D [21:41:02] legoktm: I guess we could make the output less verbose for the enabled ones [21:41:08] but it's a wishlist, not a blocker [21:41:09] so yeah, the biggest issue is all those services that seem to be puppetized to contact api.svc.eqiad.wmnet (parsoid, citoid, etc?) [21:41:18] yeah, > /dev/null [21:42:22] great [21:42:45] I guess we can't do the live test until we get the siteinfo thing figured out [21:42:47] ditto for restbase.svc.eqiad in the same instances I see [21:42:47] given that siteinfo is called in multiple places, I'm not sure if trying the reverse live_test makes any sense at this point [21:43:06] the two examples at the bottom of this are typical: https://gerrit.wikimedia.org/g/mediawiki/services/citoid/+/ea68e28ecde37a7fbcdd8ecfbd33b8d8358c4234/config.prod.yaml [21:43:14] yeah -- we could hot-patch the http url back in, just so that we can live-test everything else [21:43:29] (url + x-forwarded-proto header) [21:43:45] but that won't fix the other issue linked above, right? [21:44:11] right, the only purpose would be to see if the live-test exposes any other problems [21:44:14] ok [21:46:24] I'm going to edit /usr/lib/python3/dist-packages/spicerack/mediawiki.py on cumin1001 [21:46:42] with the inverse of https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/700963/ [21:47:19] ack [21:49:37] I think I re-ran all the steps that hit siteinfo [21:49:41] legoktm: fyi, it's normal for appserver latency alerts to fire in the passive DC when we run the warmup -- may want to downtime those before the live test [21:50:14] appserver and apiserver now, I guess :D [21:50:16] we could add it to the cookbook (downtime if in live test) [21:50:26] I guess we still want it alert during the real thing [21:50:46] btw are we hitting the api_appserver also with the site urls or only api urls? [21:50:58] I didn't follow all the details of the changes [21:51:09] "High average GET latency for mw requests on appserver in codfw"? [21:51:16] that's the one [21:51:51] (with live traffic you'd worry about high latency causing saturation so you'd expect that alert too, but with the warmup script that's not an issue) [21:53:33] volans: we're sending https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/mediawiki/files/maintenance/mediawiki-cache-warmup/urls-server.txt to both clusters [21:53:41] to each server in both clusters, I mean [21:53:49] ack [21:54:08] and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/mediawiki/files/maintenance/mediawiki-cache-warmup/urls-cluster.txt to only the appserver cluster, each url once and loadbalanced [21:54:12] (as before, no changes) [21:54:27] great, tx [21:54:30] *thx [21:55:00] afk 1m [21:55:20] so is it ctrl-b d to detatch then? as it seems you might be done for now [21:55:24] and also it's 1 am here [21:55:43] apergos: yes [21:55:46] awesome [21:56:22] thanks for having me along as a lurker, and may this have been the most interesting part of the whole procedure :) [21:57:03] * legoktm stabs his bouncer [21:57:14] apergos: thanks for lurking :) [21:57:17] I enjoyed it, too. [21:57:19] huh, TIL Greece is one hour further than most of western europe [21:57:21] back [21:57:23] lurking I mean. [21:57:39] eest, we and kenya are timezone buddies [21:57:52] I think we're going to try the live test switchover now, unless anyone objects? [21:58:07] good luck! [21:58:15] legoktm: maybe clean the screen so we distinguish the 2 runs [21:58:53] apergos: yeah, makes sense I guess. especially with turkey being another hour already [21:58:54] https://en.wikipedia.org/wiki/Time_in_Turkey#/media/File:Time_zones_of_Europe.svg [21:59:15] ugh, not what I meant [21:59:19] but I pride myself for knowing almost every country name between germany and greece, and they all have +1 [21:59:34] greece is the first and last one in that particular straight line to have +2 [21:59:51] --live-test codfw eqiad [21:59:53] lgtm [21:59:57] legoktm: unless rzl objects for me ok to go ahead [22:00:00] https://en.wikipedia.org/wiki/UTC%2B03:00#/media/File:Timezones2008_UTC+3_gray.png [22:00:05] * volans checking args [22:00:30] args is ok [22:00:33] *arre [22:01:13] it would be nice if the cookbook could say "DC_FROM: codfw, DC_TO: eqiad" [22:01:47] legoktm: wait I can show you that [22:01:48] oh like at the top of the menu? yeah, it would have to be a directory-level thing I guess [22:02:15] haha or we just add a 00-verify-direction cookbook that prints them out :D [22:02:25] or I guess volans-style, it prints them out and then makes you type them back in [22:02:36] I think that would be step -01 [22:02:42] volans: ok, waiting :p [22:03:01] legoktm: no sorry I though in the logs there was [22:03:03] but not explicit [22:03:14] same thing cookbook_args=['--live-test', 'codfw', 'eqiad'] [22:03:34] yeah, it'd add a little confidence to remind you the order is DC_FROM, DC_TO [22:03:43] agree [22:03:48] especially in the upside-down live-test universe [22:03:52] I'm going to keep going [22:03:55] 👍 [22:03:57] ack [22:04:41] for the warmup it should say it [22:04:43] explicitly [22:04:49] yeah, it'll make you confirm [22:06:24] wow look at that, it's quicker the second time [22:06:31] why does it warmup codfw? [22:06:47] because it will harm the cluster if we were warming up eqiad [22:07:00] so --live-test inverts for the purpose of testing [22:07:02] yeah it's just a live test thing [22:07:13] when we really move eqiad->codfw, we'll warm up codfw [22:07:17] wkandek: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/switchdc/mediawiki/00-warmup-caches.py#23 [22:07:41] > Warmup completed in 0:00:16.973384 │······························································· [22:08:36] and in the output, mw2308.codfw.wmnet is an api_appserver, so we're now warming up that cluster too [22:08:47] nice [22:09:42] ruh roh [22:09:43] oof [22:09:51] Too few arguments. ? [22:09:57] rzl: that was my question, we are warming up codfw, so not everythign is reveresed i guess? [22:10:04] wkandek: yeah exactly [22:10:14] wkandek: that's why we run with --live-test, instead of just inverting the args and running everything as normal [22:10:24] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/switchdc/mediawiki/__init__.py#20 [22:10:24] wkandek: remember that we're currently testing going from codfw -> eqiad. [22:10:27] (that, and we don't actually want to go read-only) [22:10:46] got it. [22:11:09] it looks like the [22:11:10] "systemctl list-units 'mediawiki_job_*' --no-legend " "| awk '{print $1}' | xargs systemctl disable" [22:11:12] is what's failing [22:11:40] rc=123, [22:11:56] command: 'systemctl list-units 'mediawiki_job_*' --no-legend | awk '{print $1}' | xargs systemctl disable' [22:11:59] sudo systemctl list-units 'mediawiki_job_*' --no-legend is empty [22:13:16] hmm [22:13:24] I think "systemctl stop mediawiki_job_*" entirely got rid of them? [22:13:51] I see slices with system-mediawiki_* but not units [22:13:57] the timers are gone from `systemctl list-timers` and `systemctl list-units` [22:14:32] which, is what we wanted in the first place tbh [22:15:58] btw, if it wasn't already mentioned you can see the debug logs at /var/log/spicerack/sre/switchdc/mediawiki-extended.log (the ones without -extended are without DEBUG) [22:16:11] ok, I think I need to go back and figure out whether we need both systemctl stop and systemctl disable and probably get rid of one of them [22:16:54] nod [22:18:04] but crons and timers are disabled on mwmaint2002 [22:18:42] keep going? [22:23:52] perfect time for a netsplit [22:24:16] indeed [22:25:27] what's our IRC is unusable back up plan? Slack I guess? :| [22:25:44] wall ;) [22:25:54] if it's before you set RO just wait for it to come back [22:26:11] if it's during RO, keep going until RW I guess, Slack is a decent fallback [22:26:24] in the past i had a call conference open with j.oe when we did it together [22:26:32] but if nothing else is visibly wrong, don't extend the read-only period just for the IRC outage [22:26:33] you two could have one just in case [22:26:46] it's quicker to talk than write and switch windows [22:26:52] +1 [22:27:00] to what rzl said [22:27:12] yeah [22:27:57] ok, I'm going to keep oging now [22:28:11] it skipped the crontab checks I think [22:28:14] but I guess is ok [22:28:15] yeah, adding voice is a matter of taste I think -- I don't like it personally, I prefer to have everything in writing, but whatever's easier for you [22:28:46] yeah, the crontab part is untouched since last time I think, no need to comment out the earlier stuff in order to test it IMO [22:29:06] I checked the crontab manually to verify [22:29:07] ack hopefully you'll havethe time to retest that step later on [22:29:11] 👍 [22:29:18] moving on sgtm [22:29:38] actually... [22:29:46] 03-set-db-readonly will change that x2 master [22:29:54] not sure if it might cause issues [22:30:04] not in live test, right? [22:30:13] it will set it RO [22:30:15] in codfw [22:30:16] ohh, do we actually run it on -- yeah [22:30:28] ctrl+c'd [22:30:35] and was not, although it was weird [22:30:40] that it's in RW [22:30:41] in codfw [22:30:50] but dunno the current status if it was for a reason [22:32:20] yeah, digging to see if I can find anything [22:32:29] is there a way to test 03-set-db-readonly and have it skip x2? [22:32:42] It might have already done it [22:32:42] | read_only | ON | [22:33:20] but now I'm not 100% sure it was RW earlier, it was different that all the others [22:33:33] but I might also have misread it [22:33:52] https://phabricator.wikimedia.org/T269324#6815006 [22:33:58] x2 is supposed to be writable [22:34:13] and paged [22:34:19] whoop, guess that answers that [22:34:34] also TIL how loud my victorops noise is when it takes over my headphones unexpectedly [22:34:38] switch to -operations? [22:34:42] let me set it [22:35:57] | read_only | OFF | [22:36:12] thanks [22:36:24] 👍 [22:36:29] sorry took longer was rusty on the syntax :D [22:37:28] so...1) this is a problem for live test, that x2 should not be RO, and then 2) after switching, x2 in eqiad should be also set to RW? [22:37:40] I guess so [22:37:57] to be checked with dbas [22:38:32] yeah that's a m.anuel or s.tevie question [22:40:38] so...I suppose we need to keep going otherwise codfw will be in a weird state? [22:41:55] 08-start-maintenance will revert 00-disable-puppet [22:42:06] 08-restore-ttl will revert 00-reduce-ttl [22:43:25] RO was set in codfw where it should alredy be that [22:43:36] and the db readonly apart x2 should have been a noop [22:43:47] yeah, no need to 07-set-readwrite [22:44:08] I think it will do it in eqiad, so noop there too [22:44:24] yeah unless x2 is supposed to be RO in eqiad? haven't checked [22:44:35] the ticket said it's supposed to be RW in both [22:44:46] ah okay, then yeah agreed noop [22:44:58] select @@GLOBAL.read_only; returns 0 [22:45:37] do we want to do 04-switch-mediawiki, and 05-invert-redis-sessions for completeness? [22:45:55] x2 isn't in use (yet), when it is, it'll be RW in both, same as parser cache in that regard. [22:46:07] ah thanks Krinkle [22:46:15] in that case the long-term solution is exclude it from 02-set-readonly I guess [22:46:28] but not sure if it's time to do that yet, or if they're just testing for now [22:46:53] still a DBA question [22:47:20] I don't know how much of this script will be left when we're active-active. for now, it's not an issue one way or the other if writes are refused in codfw x2, but it'd probably make for easier testing and less logstash noise if it wasn't readonly. [22:48:02] haha I didn't mean THAT long-term :D but noted [22:48:11] it was added in https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/662631 [22:48:43] well, I'm optimistically anticipating that this might be our last switchover of this magnitude. [22:48:56] unless we do 2 per year [22:49:02] do we? I don't remember now. [22:49:06] 2 per year is the intent yeah [22:49:12] ok, then probabl not the last one [22:49:16] for now though [22:49:25] do we have a short term plan for active/active RW too? [22:49:43] it does look like x2 is permanently RW in both DCs, per that ticket, so I guess we should exclude it from 02-set-readonly as of now [22:49:52] I agree [22:49:59] it doens't matter what happens to x2 right now as it's not yet in use, it's idle/unused. once we use it, i'll be RW in both from the get go [22:50:04] (sorry I know that's what Krinkle was already saying, I just wasn't clear on the timeline) [22:50:14] but it should be set RO or not during the RO time? [22:50:46] I'd say treat it like parser cache if that helps. I don't think it needs to be RO at any time, given circular replication [22:51:04] if that replication is already bi-di today then it'd be fine to let both be RW today already [22:51:20] if not, then I guess it'd be safer to keep RW-RO, RO, RO-RW like the rest [22:51:48] I don't know if the bi-di part of x2 was enabled yet in mysql [22:52:12] yes it's bi-di [22:52:50] ok ,well, then RW all the way. [22:52:54] is it RO in codfw now? [22:53:04] no, RW in both [22:53:29] ok, I'd take that as signal from DBAs that it's okay to keep RW [22:53:53] and either way, no worries since nothing can read/write to it now anyway, it's isolated/unused [22:54:04] let's still talk to them to confirm :) but I want to be mindful of volans's time also, is there anything else to do with this test today? [22:54:27] I think just run the last two cookbooks to undo and then start filing bugs :) [22:54:33] ack [22:54:36] sgtm [22:56:57] is running puppet, have patience [22:57:18] oh btw, I forget where we're at with adding a five-minute sleep after 00-reduce-ttl, now that the warmup script doesn't always take that long [22:57:23] did we ever do that, or just talk about it? [22:57:46] I guess we just talked about it, that step was fast in --live-test [22:57:56] # TODO: add sleep for previous TTL, skipped for now because the warmup step is longer than that [22:57:58] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/switchdc/mediawiki/00-reduce-ttl.py#24 [22:58:22] oops, great job rzl [22:58:34] we can use 10s in live test so that we test the code but with a different sleep [22:59:46] legoktm: are you reverting the live hack? [23:00:16] oh, will do [23:00:29] and add it to the list ;) [23:00:55] I guess we'll need one more spicerack release to fix that, the systemctl disable stuff, and removing x2 from core dbs [23:01:09] yeah I guess so [23:01:21] ping me when it's all ready and I'll make it... [23:01:36] ah I was about to ask if there were docs on how to cut a release [23:02:03] it's "complicated" for various reasons, ownership, gpg key, pypi account, etc... [23:02:12] oof, got it [23:03:16] we could just add some patches to the debian package [23:03:43] nah don't worry it's easy enough to do a release for me [23:05:28] ok [23:05:58] thanks everyone for all the testing and checking, especially rzl and volans :) [23:06:23] great, I'll head to bed then ;) ttyl [23:06:46] sleep well volans, thanks for the late night and have a good long weekend [23:07:01] thx, you too [23:10:17] legoktm: need anything from me? otherwise I'm going to check out for the day too and get some dinner [23:10:25] nope! [23:10:41] happy to either make some of those cookbook/spicerack changes tomorrow or review yours, whichever you like [23:11:32] once I have everything filed/documented I'm going to go offline for food/dinner etc and then I'll be back in the late evening to sync up with the DBAs [23:11:47] ack :) [23:11:50] 👍 [23:52:09] https://etherpad.wikimedia.org/p/2021-switchdc-testing has links to all the bugs I just filed