[12:34:38] ebernhardson: if you are around, I would appreciate your help on the spark-submit test runs. [13:03:49] we still need to roll-restart EQIAD to finish T397227 , I'll do that once I have a breather between meetings ;) [13:11:39] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [13:43:32] \o [13:43:37] pfischer: sure just getting here, whats up [14:19:15] * ebernhardson really thinks array_map in php should work on generators...pretty meh that it throws an exception [14:49:05] * cormacparle waves [14:49:18] cormacparle: howdy [14:49:27] how's that A/B test on the completion suggester going? [14:50:15] cormacparle: took longer to get the code out :( Merged june 30, lemme see if it made last week trains [14:50:44] yup it made it into wmf.8, should be able to ship the test today [14:50:52] ace! [14:51:14] double checking, we did roll out wmf.8 everywhere last week, so should be good to go [14:52:06] i've got it on the deployment schedule now [14:52:33] 👍 [14:58:20] I’ll be 3-5’ late for triage [15:00:55] triage or staff meeting? [15:03:58] oh, my bad. staff meeting [15:58:21] Shall we run a quick triage (in 2’)? https://meet.google.com/eki-rafx-cxi?authuser=0 [16:02:15] sure, sec [16:06:34] Trey314159: are you around? [16:09:32] pfischer: sudo -u analytics-search kerberos-run-command analytics-search yarn logs -applicationId application_1741864027385_485238 | less [16:41:58] pfischer: I am around but i didn't get a notification when you pinged me [17:04:08] cirrussearch is unhappy in EQIAD [17:04:26] is it part of some operational thing? [17:04:37] yeah, I was restarting the cluster [17:04:59] I'm not sure why it broke quorum but I'm gonna fail over to CODFW. Hopefully DNS discovery works [17:06:25] kk [17:07:48] OK, we're failed over to CODFW, let me check the dashboard [17:08:04] certainly i'm getting some timeouts trying to hit seach.svc.eqiad.wmnet:9243 [17:08:55] oh, actually its timing out trying to fetch _cat/health, it returns banner fine. Thats same cluster quorum problem [17:09:40] [2025-07-07T17:09:33,281][WARN ][o.o.d.SeedHostsResolver ] [cirrussearch1100-production-search-eqiad] failed to resolve host [elastic1100.eqiad.wmnet] [17:10:07] inflatador: the problem is that should say cirrussearch1100.codfw.wmnet maybe? (thats the log from cirrussearch1100:var/log/opensearch/production-search-eqiad.log) [17:10:16] cirrussearch1100.eqiad.wmnet even [17:10:23] ebernhardson ah thanks, looking at it now [17:10:43] best guess is we have old hostnames somewhere, perhaps in the seeds [17:11:06] yup https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/eqiad/elasticsearch/cirrus.yaml#14 [17:11:21] wait, that's elastic. It shouldn't apply to opensearch [17:11:23] checking [17:12:41] hmm, opensearch.yml has cirrussearch seeds, "elastic1100" isn't found anywhere in the .yml :S [17:12:57] some problem in master state? not sure... [17:15:01] and gerrit seems down ;( [17:15:44] nm, got it [17:15:55] master info is here v [17:15:56] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/eqiad/cirrus/opensearch.yaml [17:18:57] hmm, none of the masters think they are master :S [17:19:45] yeah, I'm shutting down 1100 to see if it helps [17:20:24] 1122 became master [17:20:49] still red, but maybe recovering [17:21:00] looks lijke it [17:21:52] looks like shutting down 1100 did the trick? It's curious, i did see the elastic1100.eqiad.wmnet resolve failure on multiple hosts, [17:22:04] but grep doesn't find that string in the opensearch config anywhere [17:22:48] hmm, no indices in active recovery, 2957 unassigned shards [17:22:53] yeah, it did fix it. The red is from an orpan alias [17:22:57] about to delete [17:23:45] OK, we're back to yellow [17:24:53] almost no replicas anyhwere :S But hopefully it will start picking up the pieces [17:25:01] 1405 primaries, and 1405 active shards :P [17:25:46] let me bump up concurrent recoveries [17:25:58] "explanation": "replica allocations are forbidden due to cluster setting [cluster.routing.allocation.enable=primaries]" [17:26:16] ohhhhh, that is left over from the cookbook failure [17:26:17] probably was set by the restart script, needs to be turned off [17:31:15] ok i disabled that, it's now doing a ton of recovery operations [17:32:15] OK, that's enabled [17:32:24] oops, I guess you beat me to it [17:34:37] only 34 unassigned, getting there [17:35:08] ah, then I guess I won't bother to change recovery settings [17:35:23] I'm gonna repool the DC as well [17:35:39] ebernhardson ^^ LMK if you think this is a bad idae [17:36:22] actually, let me fix 1100 and run the cookbook again before we do that [17:38:40] inflatador: should be ok, but sure lets make sure 1100 is going to not blow it back up again [17:39:07] ebernhardson confirmed, I blew away the datadir on 1100 and started OS, it's joined the cluster cleanly [17:39:14] kk [17:40:35] OK, we are repooled [17:44:06] Now, what's the best panel to track the outage? https://grafana.wikimedia.org/goto/_BugULyHg?orgId=1 is the first thing I've found [17:44:34] thread pool queue is reasonable, it tells you how busy the cluster is [17:45:18] got it. This may be the best, then? https://grafana.wikimedia.org/goto/zkQW8YsNR?orgId=1 [17:45:42] yea [17:54:00] created T398856 for the incident report, I'll flesh out as time permits [17:54:01] T398856: Write incident report for cirrussearch outage 2025-07-07 17:10-17:40 UTC - https://phabricator.wikimedia.org/T398856 [17:58:43] lunch, back in ~40 [18:49:44] pfischer: doh, this should have been more obvious earlier. The problem is the venv you build has: venv/bin/python3.10: Mach-O 64-bit arm64 executable, flags: [18:50:08] will see if i can convince gitlab to build it from the branch, probably can [18:51:08] back [18:56:13] finishing off the eqiad restart now [18:56:15] pfischer: selecting `publish_conda_env` from the pipelines page it built https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/packages/1444 which looks reasonable [19:03:15] * ebernhardson finds, reviewing the reindexer output...it's way too spammy :P need to think about what would actually be important for it to print as it goes [20:15:31] sigh...it turns out we don't inject the cirrusUserTesting=abc:xyz into the way autocomplete currently runs it's queries...so the test is started but i don't think anyone passes the right api parameters ( [20:39:41] suspect thats going to be harder than expected...but maybe the experiment platform stuff would work around it. We have to provide the test in the api call because of caching, but with experiment platform caching would be handled for us [20:43:10] ebernhardson: if you think it would work with the experiment platform, it'd be a good test case for us and probably for them, too. It's worth trying if you want to. [20:44:57] yea it's probably not the worst idea, and it would allow starting tests without going through the deploy window [20:59:54] will look into it, but iirc there was something about how we wanted a high % of a small subset of users (searchers vs page views), where they do a low % of a large set of users (everyone) [21:51:43] ebernhardson: thanks for looking into it, and yes, now that you say it, I saw mentions of ARM 64 going by in the condo logs but didn't think about it. [23:04:57] ebernhardson Trey314159 heads up, the production clusters have been restarted and the new plugins should be showing up. Hit us up if you notice anything amiss (ref T397227 ) [23:04:59] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227