[09:27:55] errand+lunch [09:51:56] Lunch [13:06:36] greetings [13:26:27] inflatador: o/ [13:26:36] o/ [14:56:06] FYI: starting rolling-restart of codfw shortly re: https://phabricator.wikimedia.org/T309720 [15:00:50] \o [16:17:28] hmm, i wonder in which contexts textcat doesn't work, the user testing that was linked in email said searching an arabic title on english wiki got no results, but using a Special:Random title from ar.wikipedia seems to work fine: [16:18:40] https://en.wikipedia.org/w/index.php?search=%D8%A3%D9%88%D8%B3%D9%83%D8%A7%D8%B1_%D8%B4%D8%AA%D8%B1%D8%A7%D9%88%D8%B3&title=Special:Search&profile=default&fulltext=1 [16:24:07] Probably won't make the Puppet window, but if you have PRs, Ryan should be able to help, or I can help in ~2h [16:26:22] * ebernhardson shrugs and instead looks into why saneitizer writes in prod don't generate the same requests they do locally [16:47:34] oh well thats fun, phpdbg segfault :( It's certainly not a well loved project [16:47:44] ;( [16:48:13] Fired up the rolling operation cookbook again, it timed out waiting for some large shards...hopefully won't happen again [16:50:48] pondering more about what happens with those large shards and my previous investigation, i'm now thinking about this line in the docs: "Synced flush is a best effort operation. Any ongoing indexing operations will cause the synced flush to fail on that shard. This means that some shards may be synced flushed while others aren’t. See below for more. [16:53:04] we know from looking at the code that seeing a full index recovery means the sync_id didn't match. The other difficulty is the line right after that one, "The sync_id marker is removed as soon as the shard is flushed again.". I suspect we can't really engineer our way around that, as flushing happens automatically as-needed and isn't limited strictly by the refresh time. hmm [16:54:06] i guess this goes back to why they suggest pausing writes :P [16:56:46] ebernhardson: so do we want a successful flush so that the sync_id marker is removed which prevents a mismatch between the `sync_id` and the actual ID? thus why we'd want to pause writes so that the synced flush doesn't risk failing? [16:56:50] or am i misunderstanding [16:58:40] ryankemper: we want a succesfull flush, particularly on the largest indices to ensure they restore from disk. I think what the docs are saying is if elasticsearch decides to flush, which it will do every 30s per shard (with ~120 shards per host, or ~4 flushes/s) and sometimes more depending on indexing loading, we lose the sync_id and are forced to pull the shard over the network [16:58:52] although that sounds too strict for what we are observing, the reality is probably somewhere in the middle [17:00:04] that would suggest thourgh that the node has to come back within 30s and that the fraction of valid indices drops qiuckly over that 30s, which doesn't seem to match what it actually does [17:00:55] there must be more complexity they aren't detailing in the docs (re https://www.elastic.co/guide/en/elasticsearch/reference/6.8/indices-synced-flush.html ) [17:01:08] > there must be more complexity they aren't detailing in the docs [17:01:12] but that never happens! /s [17:01:17] xD [17:01:32] my hope is that once we get on 10GB, it won't be as big of an issue [17:01:40] 10Gbps networking that is [17:01:50] Okay reading their explanation of what a synced flush is helps, I was mainly thinking of just normal flush [17:02:17] hmm, how far off are we from 10G? I suppose i haven't looked directly but i was under the impression this years hardware upgrades were supposed to put it there [17:03:42] ebernhardson: codfw has replacement hosts waiting to be brought in service (come to think of it I was going to start on that this week and then the cloudelastic stuff happened) [17:03:49] that's the only blocker on flipping codfw up to 10gbps [17:04:17] ahh, ok that makes sense. I'm pretty bad at keeping track of where hardware is in the pipeline [17:05:19] anyways, i guess i just like thinking about these it's not end of the world to wait. maybe we might want to increase how long it waits before timing out [17:05:53] yeah, we might need to [17:06:03] I'll start getting the patches up today to bring em into service [17:06:16] But yeah agreed networking aside we want to better understand why the local recoveries might fail [17:06:21] cool, maybe we can work on that during SRE pairing [17:06:46] sounds like a plan [17:06:55] +1 [17:11:00] ebernhardson: inflatador: so wrt the flushed sync stuff, I'm thinking we put the frozen writes block back in, but without waiting for the write queue to drain? [17:12:00] ryankemper: i suspect we have to wait for the queue to drain, otherwise it backlogs more and more. [17:12:38] ebernhardson: how much of a problem is the backlogging though? isn't the idea we'd clear the backlog once the rolling operation is done? [17:13:31] IIRC the original reason to wait for the queue to drain was a fear that we would get so backlogged we would never recover...although I might be confusing that with the justification for the wdqs streaming updater :P [17:13:37] ryankemper: hard to say, we don't have any defined service level to meet so it's all arbitrary. Most use cases can withstand a bit of delay, although wikidata tends to build some ui's that expect quicker return [17:15:08] if the write freeze (w/o waiting to drain) makes a noticeable difference in our local index recoveries then I'd hope that would make it worth it [17:15:44] ryankemper: hmm, i suppose with us being close to the jobrunner peak capacity there could be some worry about the backlog continuing to grow, but i suppose i'm doubtful. either the runner can run all the jobs being inserted + x% more, and the queue declines, or the queue would have grown regardless of pausing writes [17:16:15] cause unfortunately waiting for the queue to drain after each host sounds reasonable in theory but then in practice it would make rolling operations fail >=50% of the time it felt like [17:16:16] i suppose if x% if very small it could lead to a very long time to drian the queue [17:17:49] so perhaps we try freezing writes followed by a brief sleep without waiting for the queue to fully drain, and see if that seems to improve things? [17:17:51] I'm headed to lunch, I have a tmux window up on cumin1001 if you need to monitor the cookbook. It might time out again [17:17:59] inflatador: ack, I'll keep an eye on it [17:18:11] thx, should be back in time for pairing [17:19:09] ryankemper: not as easy to test, we removed the related bits from cirrus to simplify the 7.x transition. [17:20:37] wouldn't be super hard to put back, but it requires a bit of a hack because the url requested varies between 6.x and 7.x, [17:20:42] bleh, I'd forgotten about that detail [17:20:49] (and writes have to support both versions at same time) [17:21:42] yea that was the other half of the calculus, restarts take awhile even with freezing writes, and the freezing code was annoying to generalize to work in both contexts at the same time [17:23:48] dinner [18:24:57] lunch [18:26:47] back. As predicted, the DFW rolling operation timed out [19:22:05] inflatador: there is already some documentation on the issue we had during the cluster restart: https://wikitech.wikimedia.org/wiki/Search#Rolling_restarts [19:22:16] look at the "things that can go wrong" section [19:22:56] thanks, this a great resource. I just need to read it more closely ;( [20:05:56] back