[09:52:35] lunch [09:54:26] lunch [13:12:14] having bouncer issues again ;( [13:21:47] o/ [14:41:06] \o [14:42:28] o/ [14:51:40] .o/ [14:54:50] We've got a few shards that aren't replicating, such as `cuwiki_general_1728073763` on psi. I don't see it in the output of `/_cluster/allocation/explain`...I'll keep looking, but if anyone has ideas on it LMK [14:56:59] curious, looking [15:01:14] inflatador: I suspect by default you don't get all explanations [15:01:26] seems like you can get it with: curl -s -XGET -HContent-Type:application/json -d '{"index":"cuwiki_general_1728073763", "shard":0, "primary":false}' https://search.svc.eqiad.wmnet:9643/_cluster/allocation/explain | jq . [15:01:35] inflatador: i looks like its because that api "finds an unassigned shard and explains why it can’t be allocated to a node", it doesn't report all of them [15:01:50] yea david has the rest :) I was looking at same [15:03:12] it looks like there are nodes that say yes, it should be ready to put the replicas somewhere? [15:05:23] ebernhardson: I don't see those, did you specify other arguments to force a check on other nodes? [15:05:49] I've been running `_cluster/reroute?retry_failed=true` after every reimage, not sure why it's still failing [15:06:28] dcausse: hmm, nothing special. I ran: https://search.svc.eqiad.wmnet:9643/_cluster/allocation/explain -H 'Content-Type: application/json' -d '{"index": "cuwiki_general_1728073763", "primary": true, "shard": 0}' [15:06:46] oh, silly me, primary should be false [15:07:28] something I don't get is e.g. cirrussearch1102 fails with lucene index version [15:07:48] I thought that a node being name cirrussearch1XXX would mean it's running opensearch [15:08:47] hmm, indeed those are curious [15:09:01] and if it's running opensearch it should not fail with "IndexFormatTooNewException"... [15:10:23] ah could be me misunderstanding the explanation output [15:11:11] dcausse: looking closer i think you are right, the output is very odd. The explanation can be that it saw the shard fail too many times elsewhere? [15:11:12] * dcausse rereads the doc [15:14:25] or it's "max_retry" and the error is just for historical reasons and not related to the target node? [15:14:58] might be it, error string is always the same [15:15:59] it's a curious error, it says failed on node R6WEXEKMT9yz3R4SuxSWdw but i don't see that node in cluster state [15:16:13] :/ [15:16:18] could be gone? [15:16:24] thats what i'm thinking, yea [15:17:22] calling /_cluster/reroute?retry_failed=true [15:18:18] I see things moving but only 2 shards, perhaps we could bump this a bit? [15:19:07] I tried to bump up the number of active recoveries to 20, maybe I missed a zero? One sec [15:19:18] "node_concurrent_incoming_recoveries": "20" [15:19:39] hmmm [15:19:42] no I see 20... weird [15:20:21] oh well we're green now [15:20:39] inflatador: do you call /_cluster/reroute?retry_failed=true on the 3 clusters? [15:21:14] Damn! I was calling it for CODFW again [15:21:20] :) [15:21:52] Thanks...the perils of relying on CMD-R too much ;( [15:23:05] inflatador: if you have a minute some time today, could you have a look at https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/1148304 please (pull dependencies from GitLab rather than archiva)? [15:23:27] please [15:23:33] just called reroute on all 3 eqiad clusters, hopefully that clears things out. [15:24:08] pfischer 👀 . Can you give me an example CURRENT_VERSION_NUMBER so I can test manually? [15:26:52] inflatador: sure: 0.3.156 [15:47:08] ryankemper sorry for late notice, I moved up pairing today as I can't make normal time. Wanted to talk about T143553 since I've been struggling a bit w/it and you have more recent experience with LVS [15:47:08] T143553: Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553 [15:57:20] inflatador: sounds good! [16:00:59] pfischer +1'd [16:31:35] dinner [16:54:06] EQIAD is now 100% on OpenSearch! [17:01:44] So we're getting alerts for `CirrusSearchTitleSuggestIndexTooOld` and I'm trying to figure out why. Both alerts are absented in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/mediawiki/maintenance/cirrussearch.pp#15 and it appears they were moved to k8s in T388538 ? [17:01:48] T388538: Migrate discovery-search jobs to mw-cron - https://phabricator.wikimedia.org/T388538 [17:10:46] inflatador: that would mean they live in operations/alerts [17:10:53] alertmanager [17:11:59] ebernhardson I wondering if the alerts needed to be updated, or if the mw-cron stuff doesn't work...any ideas? [17:12:34] inflatador: we intentionally have it absented, sec [17:13:34] ebernhardson sorry, I dunno why I said 'alerts'...I meant jobs. I understand we moved the jobs to mw-cron, just curious if the new alerts are valid [17:15:04] regardless, I guess we can re-enable the eqiad job [17:15:17] the k8s-based one, that is [17:15:20] inflatador: yea, i think we just need to enable the eqiad job in puppet? It's commented out afaik [17:15:24] i can make you a patch, sec [17:16:00] inflatador: i think this is it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151274 [17:17:30] That was fast! [17:17:31] thanks [17:31:44] lunch, back in ~40 [18:13:10] back [18:16:12] Trey314159 my friend Ann is a huge fan of SpecGram! She was impressed when I told her I worked w/you [18:17:08] inflatador: Hah, that's funny—I do autographs for a reasonable fee! [18:21:57] * inflatador never knew he worked with a celebrity ;P [18:56:26] inflatador: few mins late to pairing [18:58:02] ryankemper ACK. np [19:19:11] ebernhardson I know it's a weird time, but if you're around ryankemper & I are in https://meet.google.com/eki-rafx-cxi?authuser=0 [19:19:25] we're talking T143553 [19:19:27] T143553: Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553 [19:22:29] inflatador: sure, sec [19:28:45] ryankemper: inflatador: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151300 [19:41:32] ryankemper: inflatador: https://gerrit.wikimedia.org/r/c/operations/dns/+/1151304 [20:31:12] ebernhardson ryankemper here's the final(?) patch for envoy: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151316 [20:34:04] makes sense, thanks! [20:42:17] break, back in ~15 [21:12:50] non-LVS patch for updating conftool after the last eqiad row: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151294 [21:21:47] picking up my son, back in ~30 [21:49:04] back [22:04:05] ryankemper it's a little too late for today, but what do you think about doing a rolling restart of eqiad before we repool it? I remember doing the rolling operation for cloudelastic OpenSearch but I think it'd be good to do it in prod as well [22:04:57] SGTM [22:05:15] cool, I'll give it a shot tomorrow [22:20:36] I'm out for the day, but here's a patch for disabling alerts for relforge if anyone has time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151381