[08:28:13] wrt https://phabricator.wikimedia.org/T321310: elastic&relforge eqiad restarts are all done, still need to do codfw and cloudelastic tomorrow [08:28:36] eqiad & relforge elastic restarts* [10:33:56] lunch [13:24:38] o/ [14:33:22] \o [14:37:27] o/ [14:41:32] hmm, only index that failed dumps last week is zhwiki-general, with a PartialShardFailureException. It retried a few times over 2s and gave up. I guess it needs a backoff [14:42:00] they are all node_not_connected_exception, perhaps we were having network issues at that moment [14:42:22] I'm still seeing chatter about eqiad row D here and there [14:42:29] (no idea if that was a factor) [14:42:56] have to decide what to do about it being slow though, took 10 days to run the dump and we start a new dump every 7 days :) [14:43:45] it's not particularly easy to do parallel/async requests from php, suspect our only real option is to dump multiple wikis in parallel from the bash side [14:43:55] not that controlled concurrency is easy from bash either :P [14:45:07] LOL [14:45:57] can we do it from Elastic itself? Maybe a snapshot? [14:46:10] we do a single dump at a time apparently [14:46:16] not really, what we want here are not really backups but publicly readable dumps in json format [14:46:25] we could do it from hadoop [14:46:26] oh yeah [14:46:37] the completion suggester is running 4 in parallel [14:46:44] also, rebooting eqiad ATM [14:46:59] dcausse: i'd have to check, what is it using to run parallel queries? [14:47:01] per T321310 if interested [14:47:15] dcausse: or is that from bash? [14:47:21] ebernhardson: yes [14:47:24] xargs I guess? [14:48:16] yes: /usr/local/bin/expanddblist all | xargs -I{} -P 4 sh -c "..." [14:48:55] error handling is a bit messier tho IIRC [14:50:43] will have to see if i get get xargs to call a bash function or something, this does some small bits of extra stuff per iteration: https://github.com/wikimedia/puppet/blob/production/modules/snapshot/files/systemdjobs/dumpcirrussearch.sh#L100-L112 [14:51:27] Maybe gnu parallel would help? https://www.gnu.org/software/parallel/ [14:51:55] same general problem, we want to invoke more bash code for each iteration as opposed to a single command [14:52:18] i think it's possible...but never done that [14:52:30] worst case could make a second bash script to be invoked by xargs/parallel [14:52:42] I have a friend who swears by parallel, been awhile since I've used it though [14:53:00] i use it for some things, but generally things xargs can't do like split an input pipe to multiple output processes [14:57:57] Quick workout, back in ~30 [15:44:24] back from workout but forgot I need to go to parent/teacher conf. ryankemper are you around to watch the eqiad reboot? It's running in a tmux window on cumin2002 [16:10:15] Looks like it's working, I'm gonna step out for ~30m and try to hit the 2nd conference [16:51:18] back [16:58:37] ebernhardson: you've probably heard the song Hallelujah, which is by Leonard Cohen.. though there are lots of covers. One cover was even in Shrek! (Jeff Buckley's version is the best, IMO.) [17:04:46] Trey314159: interesting, yea i've certainly heard that before [17:04:56] maybe in shrek, but probably before that :) [17:05:07] i thought it was much older i suppose [17:06:37] Yeah, the original is from 1984. BTW, Polyphia is good stuff! [17:07:01] their ABC song is weird, but other than that i like most of what they do :) [17:22:01] OK, rebooting codfw now, got turned around and rebooted eqiad by accident ;( [17:43:02] still at "-1.0%" on a shard recovery...? [17:46:05] we're up to 0.0! Also, I'm eating lunch but will keep an eye out regardless [18:51:02] forgot about the wdqs data reload again, I need to put that on my calendar ;( [18:51:53] update: it's still going in a tmux window on cumin2002 [18:52:05] we're up to "Processing wikidump-000000716.ttl.gz" if that means anything [18:52:16] yea it takes awhile, just the first step of generating the data files takes like a day [18:52:48] should be all good by monday, just have to remember to validate the instances and transfer the databases then [18:52:56] * ebernhardson isn't entirely sure how to validate the data :P [19:13:20] wait for users to scream? ;P [20:23:00] Looks like codfw just finished up [20:32:08] All we've got left are `wcqs100[1-6]`, I'll get started on those [20:32:44] ryankemper sounds good [21:00:18] ryankemper FYI, I can't find our elasticsearch-oss package in bullseye repos anymore, maybe I'm missing something? https://apt-browser.toolforge.org/ [21:02:22] looks like component/elastic710 is missing, I vaguely remember us trying to move the elasticsearch-oss pkg into thirdparty/elastic710 , but it's not there either [21:02:48] inflatador: yeah I'm kind of in the same place (vague memories but also confused :D) [21:05:38] creating a phab task now, will link it shortly [21:07:13] inflatador: https://apt.wikimedia.org/wikimedia/pool/thirdparty/elastic710/e/ [21:07:41] maybe it's that? [21:08:25] i'm a bit confused on whether it needs to be in a distro specific path as well tho since that doesn't say anything about bullseye specificallyt [21:09:35] yeah, this is weird [21:09:43] https://phabricator.wikimedia.org/T321414 [21:13:30] about to step out, but it looks like my test server throws errors when I add the repo file. Despite that, it has no problem finding and installing the elasticsearch-oss pkg