[05:16:56] We've got some red indices in eqiad psi: [05:17:02] https://www.irccloud.com/pastebin/8U2SSSGe/ [05:17:42] Looks like a few smaller indices. I assume we can just manually reindex these, but not sure [05:21:09] cluster was green before starting the operation (decom'ing two hosts), so it theoretically shouldn't have dipped to red [05:26:02] however the `_cluster/allocation/explain` output is complaining that `"cannot allocate because all found copies of the shard are either stale or corrupt"`, so that might be why [05:26:51] indeed both `elastic1086-production-search-psi-eqiad` and `elastic1097-production-search-psi-eqiad` are marked as `"in_sync": false` [13:01:04] greetings [13:01:22] Can we bring back the decommed hosts>? [13:01:27] will check [13:26:36] inflatador: why remove tags? [13:29:00] RhinosF1 trying to reduce the noise in operations, I will check on what those tags should be used for, but neither seems appropriate IMHO [13:30:07] You have more experience with this system than me, so if they are in fact appropriate, feel free to re-add [13:30:18] inflatador: I've listed it as a follow up on the incident report started [13:30:32] because it distracted from the actual error [13:31:09] That's fine, as long as it doesn't ping everyone over and over [13:31:33] inflatador: can you add tags back then? [13:31:59] currently in 1x1 with my boss, you can add them back [13:33:25] ok [13:54:04] Trey314159 Do you have any experience running the reindex cmds listed at https://wikitech.wikimedia.org/wiki/Search#Recovering_from_an_Elasticsearch_outage/interruption_in_updates ? was about to give the first one a shot, we lost some indices last night as shown in the scrollback [14:02:53] inflatador: I have some, not a lot with lost indicies. Do you think it can wait an hour until the wednesday meeting? [14:03:02] We should have more people to ask [14:03:13] Trey314159 yeah, will definitely wait [14:03:18] cool [14:27:19] \o [14:27:32] if the red indices are small wikis, it takes like 15 minutes to build from scratch [14:28:42] looks like elwikinews, nawikibooks and scnwiki in eqiad? I'll start the rebuilds (per https://wikitech.wikimedia.org/wiki/Search#Full_reindex) [14:29:11] hmm, i wonder if that really needs the chunk handling, seems over complex for this purpose [14:30:03] alternatively, i suppose we could snapshot/restore from codfw to eqiad, but these are so small it's probably irrelevant [14:32:37] meh, our --startOver flag doesn't work. It says 'Blowing away index to start over...ok' but then a few lines later 'The alias is held by another index which means ...' [14:32:46] easy enough, just requires manually deleting the missing index [14:32:49] the red one [14:36:45] ebernhardson I can delete those if you like [14:38:17] inflatador: no biggie, everything should be recovering now. All the indexing was pushed into the job queue so might be a couple minutes before fully populated, but it should now repair [14:39:07] the process amounted to: curl -XDELETE https://search.svc.eqiad.wmnet:9643/ [14:39:18] mwscript extensions/CirrusSearch/maintenance/UpdateOneSearchIndexConfig.php --wiki nawikibooks --cluster eqiad --indexSuffix general --startOver --indexIdentifier now [14:39:36] and then: mwscript extensions/CirrusSearch/maintenance/ForceSearchIndex.php --wiki nawikibooks --cluster eqiad --queue [14:39:40] for each of the three indices [14:40:52] seems i should also update the docs, they didn't really say what i expected here [14:42:50] nawikibooks and elwikinews are fully recovered now, counts match in eqiad and codfw. scnwiki was a bit bigger (~13k docs) and will take a few more minutes [14:45:55] all the counts match now, everything should be back to norminal [14:46:52] doing a really naive comparison with the following, but for three indices its easy enough: for cluster in eqiad codfw; do echo $cluster; curl https://search.svc.$cluster.wmnet:9643/_cat/indices/elwikinews_general,nawikibooks_general,scnwiki_general; echo; done [14:47:56] i suppose when we do a reindex like that though we do lose any secondary data in weighted_tags (link recomendations, ores topics, etc.). Not sure best bets there ... hmm [14:48:21] similar with popularity_score, although popularity doesn't get used on these indices anyways [14:48:52] maybe i'll write a quick python loop later today to suck that data out of codfw and push it into eqiad [14:49:05] longer term i guess we need some other source of truth that knows what that data should be? [14:50:41] i guess don't even need a python loop, can probably have the elastic remote-reindex api load the data cross-cluster [14:55:13] oh, these were all _general indices, so nothing else to recover. weighted_tags are only used in the content index [14:55:20] even easier :) [14:59:10] * inflatador scribbles notes [15:39:15] did something happen at the end of July that dropped our WDQS update lag below SLO? https://grafana.wikimedia.org/d/yCBd7Tdnk/wdqs-wcqs-lag-slo?orgId=1&from=now-90d&to=now&var-cluster_name=wdqs&var-lag_threshold=600&var-slo_period=30d [15:49:40] mpham.. I will ask in the Wednesday meeting in a bit when we finish the current discussion [15:52:18] mpham: sadly no clue, thats a david question [15:53:28] ok thanks, i'll file a ticket to keep track of it when he's back [16:04:52] quick workout, back in ~40 [16:52:12] back [17:25:04] ryankemper thanks for restarting the cookbook, I didn't notice it had died [17:28:46] inflatador: I didn't check specifically for it but I was imagining it was just the start datetime [17:29:00] ie that we've gotten to the end of what the rolling operation cookbook will do for us so just manually reimaging the last 2 [17:29:07] ACK, sounds good [17:38:51] lunch/errands, back in ~1h [18:43:23] back [19:06:04] Trey314159: reimage is all done in eqiad, feel free to kick off the reindex whenever's convenient [19:06:21] ryankemper: excellent! thanks! [19:29:09] lunch break [20:26:37] back [20:58:27] meh, of course we also need es710 branch of WikibaseCirrusSearch, everything gets a branch :P [21:01:40] * ebernhardson isn't sure why yet though that WikibaseCirrusSearch, when run through es710, gets a 'term' query instead of a 'match' query. Suspect it's not really supposed to use term (which skips analysis chains and directly looks for the term in the index) [23:28:40] heading out