[01:10:33] Krinkle: is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/809326 ready to go now? [01:32:43] TimStarling: I'll badger again when he's up, message dispatched. [01:33:39] Ive rolled out module_deps>mainstashdb a bit meanwhile, will do a big wiki in a few minutes (perhaps dewiki) and then either group1 everything after that. [01:34:01] the wrstats switch uneventful? [01:58:07] yes, seems to be working, you can see the incr traffic in https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=now-7d&to=now [01:58:59] nice [02:02:55] we can test mcrouter cross-dc hashing [02:03:01] During the verbose logging for the rl->mainstash testing, I did notice something odd or at least not-obviously-correct. We seem to be making half a fetchSecondsSinceHeartbeat queries on a load.php request, including 4 against 4 different replicas of the same core wiki db, and one against s4/commons, but with a heartbeat query asking for shard=s5 (for dewiki). [02:03:13] e.g. https://logstash.wikimedia.org/goto/fb8eb877af18ebd460187880c524d2da [02:03:22] half a dozen* [02:11:38] doing that query against all replicas is apparently normal behaviour for a cache miss, it is LoadMonitor::computeServerStates() [02:15:27] expiry time for this cache is 0.5s [02:15:42] LoadMonitor::POLL_PERIOD_MS [02:17:21] it depends on the server clock being correct, if the clock is wrong by 100ms it could end up doing all the queries [02:19:26] Ah, I've hit that one. It's in memc for 60s and uses a dc-local lock for one server doing it, I must've hit that I guess. [02:22:53] And the condition comes from lagDetectionOptions [02:22:55] in wmf-config [02:23:09] which we set once and then re-use for all LBs even the external/cross-wiki ones [02:24:29] wikiadmin@10.64.16.175(commonswiki)> SELECT * FROM heartbeat.heartbeat WHERE shard != 's4' LIMIT 1; [02:24:29] Empty set (0.001 sec) [02:31:53] here's a little test I did showing that cross-DC mcrouter basically works: https://paste.tstarling.com/p/VICXCj.html [02:33:34] I would not expect any problem with cross-DC hashing, based on the mcrouter config, but I think this proves that it works? [02:34:41] yeah, we tested something like that earlier already afaik for the stats. I trust that it generally works. [02:35:05] you're also concerned about the gutter pool? [02:36:42] also/mainly, yes. and also more generally to sync up with them on how things are changing so that people aren't surprised when incidents happen and eveyrthing looks weird/different. E.g. what if anything do they want to manage differently given /$dc/mw being in general use, and specifically a confidence check that gutter-pool will work correctly during TKO events where a shard is unresponsive which happens a fair bit. [02:36:58] that's basically a rewording of my message to j.oe [02:37:40] we can simulate a netsplit with iptables, see what happens [02:40:42] well, unless we can also remotely alter and update the brains of the people responding and investigating mcrouter/memc when stuff happens, I think we've hit a point where we should sync up just for sync sake. Or we could queue it up and forge ahead further to test wikis meanwhile, but it's post-poning the inevitable and increasing the stack size of things to documetn and talk through retroactively. [02:44:06] sure, but I figure it's better if we can answer any questions [02:45:34] if you say "I don't know if TKO/gutter will work" and joe says "OK, check that", then that is a delay until next week, right? [02:55:11] okay, yes we can test it. That'd be cool actually. We'd expect it to failt to reach and e.g. not fallback to dc-local gutter pool but use the same gutter pool as it otherwise would if reached from primary dc. The decision to depool is done by each mcrouter separately but generally each will end up using the same gutterpool and hash the same way during that downtime. We can emperically prove/test that. [02:55:14] Sounds good to me. [02:55:48] wrote some stuff at https://docs.google.com/document/d/1EVBFol4pEE6A_YBw5RW3dH4i8QdpPndQDIHYyshQbmc/edit# to be refined before/during next meeting, which I'd recommend we have before we move multi-dc user traffic past testwikis. [03:01:17] TimStarling: okay let's roll it out then [04:19:04] if you're on codfw the /eqiad/mw route fails over to the codfw gutter pool, haven't tested it yet but that's what the config says [04:19:40] if we want it to return false instead of using the local gutter pool, we will have to configure that [04:20:28] iirc there's a thing in mcrouter where certain operations are buffered during TKO failure and replayed later [04:20:51] I think we use it for deletes by default [04:21:13] so during failover or gutter maintenance etc they go there but they also go again when the shard is back [04:21:38] and I imagine we also made it apply to SET for /mw-wan which are functionally deletes [04:21:57] although at the complaint from SRE that we're using Memc as a relational database [04:22:46] I don't disagree with that assessment after we finished a full review of WANCache vs on-host tier and basically concluded it was impossible to support. [04:23:29] so anyway, it would be *possible* to potentially have those noreply/background writes not lost during failover if we wanted to, but given both are kind of short-lived, probably not worth it. [04:23:48] and they're not meaningful deletes or resets afaik [04:25:09] return false might be better than split brain local incr ops to create zero/near-zero values that we then return again possibly on get similarly. [04:25:37] although I imagine there are cases where a functional local temporary state isn't too bad. [04:25:55] gutter pool is limited to 10s storage anyway so it's all pretty innocent in the end [04:27:05] the 'get' would be the more important one to review, not the set/incr throwaway ops. E.g. rate limiter, does that allow infinite ops if it returns false? or does it assume limit reached. [04:27:32] where do you get 10s from? config says "failover_exptime": 600 [04:27:58] I confirmed at https://github.com/facebook/mcrouter/wiki/List-of-Route-Handles#failoverwithexptimeroute that it is seconds [04:28:02] https://wikitech.wikimedia.org/wiki/Memcached_for_MediaWiki#Magic_numbers [04:28:24] you're right. it's 5min for gutter, 10s is on-host tier. I recalled the wrong meeting notes [04:28:52] pretty sure 600s is 10 mins not 5 [04:29:16] indeed [04:29:54] well, at least I'm conistent in having made the same brain fart when I edited that page, permalinking the line in Git. https://github.com/wikimedia/puppet/blob/9f9e389dac7830c7aabbf95803f51fd288cc6357/hieradata/role/common/mediawiki/appserver.yaml#L10 [04:31:12] I need to go afk pretty soon for about 45 mins [04:31:27] I'm not used to time based values being sensible to base10 intuition, I somehow see that quite consistently as 5min, not sure why. I have no excuse for my type 1 system [04:31:30] I'm planning on doing a TKO test when I get back, maybe you'll be asleep by then? [04:31:54] probabl not, but I am signing off for today now. been working since the morning. [04:33:19] I imagine it'll likely involve some linux concepts I'm not super familiar with so I'll try to reproduce it tomorrow perhaps, or we can do it then. Either way dont wait for me with deploing the config change. [04:34:20] got it [04:34:42] when you said "let's roll it out then" earlier I wasn't sure whether to believe you [04:34:56] I will deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/809326 imminently [04:36:16] if I have a script doing incr in a loop, and I interrupt the cross-DC network, I expect that the values will soon start being routed to the gutter pool and so I'll see the values reset to zero [04:37:45] then I'll restore the network, and we can see how the value changes when mcrouter notices the hosts are reachable again [04:39:43] ack, it might be interesting to also test the more common scenario of a specific memc host being unreachable, though perhaps the other way around to not impact prod, e.g. make one codfw memc unreadchable from both a eqiad and codfw mw server, see codfw rehash to gutter, and eqiad to its own vs false, diverge vs false, and recover. [04:40:52] anyyway gotta go now. happy to do either a do-over tomorrow or only tomorrow. I'm interested mainly to learn, not that I don't trust the results obviously :) [04:41:12] bearing in mind tomorrow is saturday for me [04:41:24] aye, okay, some other day then. [04:41:28] bye now [04:41:31] bye [08:30:08] tested cross-DC failover, reported methods and results at https://phabricator.wikimedia.org/T278392#8064503