[10:07:21] weekly update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2024-02-09 [10:49:58] errand + lunch [14:32:19] o/ [14:59:13] cindy has an old version of the elastic plugins, wondering if that's the reason why it's failing on Acronyms, making a new image to see if that helps [15:06:16] banning the cloudelastic nodes that need to migrate again...we will still have 6 hosts in the cluster [15:30:39] I'm running a backfill for wikidatawiki for about 4 hours worth of updates...the ForceSearchIndex.php script has been going for about a day and a half. Does that sound reasonable for 4hrs of updates? [15:30:59] exact cmd: `ForceSearchIndex.php --wiki wikidatawiki 2024-02-07T19:00:14Z 2024-02-07T23:00:14Z` [15:38:19] inflatador: hm... seems weird that start and end dates are not behind a flag looking [15:38:32] I may have messed it up then [15:38:34] might be re-indexing the whole wiki [15:39:04] just killed it [15:39:49] I'll compose a new cmd and have you QC it if that's OK [15:40:13] inflatador: should be: --wiki wikidatawiki --from START_DATE --to END_DATE --cluster CLUSTER [15:40:19] sure [15:41:10] dcausse thx. `ForceSearchIndex.php --wiki wikidatawiki --from '2024-02-07T19:00:14Z' --to '2024-02-07T23:00:14Z' --cluster cloudelastic` [15:42:47] inflatador: looking at the scripts I got from Erik it's a set of 3 commands [15:42:53] pasting this somewhere [15:48:05] inflatador: should be something along those lines: https://phabricator.wikimedia.org/P56589 [15:48:53] quick errand [15:48:58] dcausse ACK thanks [15:52:43] inflatador: oops please ignore the --archive part it does not have to run on cloudelastic [15:54:20] hmm, I'm getting `Fatal error: no version entry for `--cluster``. Looking at wmf-config/CirrusSearch-production.php for hints [15:54:58] updated https://phabricator.wikimedia.org/P56589 with the exact cmd I'm running [15:55:12] ah...wait [15:55:17] I did not set the wiki var [15:56:03] OK, all seems well after I set that var [15:59:39] \o [16:00:49] in particular, that `--queue` option is super important for ForceSearchIndex.php. Without it i think it tries to do every write in-process [16:07:49] Adding to my notes, sorry I messed that one up. The script (run with the correct args this time) is finished now, if we need to prune some jobs from the job runner or anything LMK [16:10:28] I guess it'd be from the job queue/kafka that is? https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1 [16:11:04] nothing to prune, it will work through whatever is there [16:25:24] The backfill finished [17:01:06] https://gerrit.wikimedia.org/r/c/operations/puppet/+/999088 is ready for review if anyone has time to take a look. cloudelastic1008 is in limbo due to HW problems, so moving on to cloudelastic1007 [17:01:28] workout, back in ~40 [17:04:28] o/ [18:36:42] DC Ops fixed cludelastic1008, so should be able to finish that host today. Lunch first, though...back in ~40 [19:05:50] figured out why we are still getting cloudelastic writes, the read only detection was casting a 0 from array_search into false. Fix: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/999974 [19:13:13] back [19:13:15] {◕ ◡ ◕} [19:19:45] cloudelastic1008 is fixed and it just finished reimaging. ryankemper or anyone else, if you have time to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/998498 that should finish its migration [19:20:36] inflatador: looking [19:29:36] LGTM [19:30:51] ryankemper excellent, thanks [19:48:27] looks like adding/removing the elastic repo config affected the ES keystore. That likely means we can't restart ES services until this is fixed...working on it now [19:51:05] https://gerrit.wikimedia.org/r/c/operations/puppet/+/809276/2/modules/elasticsearch/manifests/instance.pp probably something like this [19:51:40] seems plausible [19:51:56] inflatador: wont the `unless` statement prevent that from running though? [19:54:47] ebernhardson I think so...but this is a new problem w/cloudelastic1008, definitely didn't happen with the previously migrated cloudelastic hosts. Might just be a circular dependency...still checking [19:56:16] stack traces look to be coming from `org.elasticsearch.repositories.s3.S3RepositoryPlugin` [19:58:39] but the 2 smaller clusters look fine...hmm [19:59:32] hmm, looking [19:59:47] and I def deleted the repo config from all 3, looking at my cumin history now [20:00:19] permissions for chi look fine on 1008, where are you seeing the chi logs? It's not in /var/log/elasticsearch somehow [20:00:46] ebernhardson oh you're right, I was looking at psi [20:01:22] journalctl doesn't have anything useful [20:01:27] indeed, hmm [20:02:12] I just re-ran puppet, let's see if that makes a diff [20:02:24] -erence, that is...it did make a diff [20:03:53] inflatador: it's dieing with a timeout, can we increase the timeout? [20:04:27] should be configurable w/systemd, checking [20:06:29] nothing explicit in /lib/systemd/system/elasticsearch_7@.service , checking the jvm options file [20:08:26] going to manually add a timeout of 5 m to the unit file and see what happens [20:08:39] seems plaisble [20:09:45] OK, just started...let's see what happens [20:11:51] hmm, it's doing something. [20:12:22] just when I was about to dust off strace ;P [20:12:45] Increasing the limit seems to have done the trick [20:12:49] the timeout that is [20:12:59] not sure why, but it seems elastic needed more time than systemd expected [20:13:37] Y, I noticed our unit file explicitly ignores `TimeoutStopSec` [20:14:14] anyway, glad it has nothing to do with the keystore [20:14:33] will get a puppet patch up to tweak startup value [20:28:16] OK, here's the CR https://gerrit.wikimedia.org/r/c/operations/puppet/+/1000018 [20:28:59] +1 [20:29:36] ACK ,thans [20:29:37] ks [20:39:51] fell just a little short of 1.000.000th CR ;( [21:11:10] quick break, back in ~20 [21:28:57] back