[01:25:12] Created the HW request ticket for dc-ops to take a look at the out-of-service codfw refresh hosts which are experiencing the IPMI failures blocking reimage: https://phabricator.wikimedia.org/T313369 [08:20:06] dcausse: would you have a moment for a quick chat? [08:20:26] gehel: sure [08:20:46] meet.google.com/aex-nwiw-qrj [09:53:07] Lunch [10:39:19] lunch [13:58:21] inflatador: I have a refactoring workshop for this evening (my time), no one is scheduled yet. If you want to grab it, it's yours! [13:59:48] gehel sounds good, will add myself to calendar [14:00:36] I've added you. Make sure you have time to play with the code (https://github.com/gehel/exo-heating-system/tree/python) before then! [14:16:37] ACK, will do [14:16:46] quick errand, back in ~15-20 [14:36:28] back [15:04:35] inflatador: I'm going through https://phabricator.wikimedia.org/search/query/8lWdbPH2_gba/#R and re-tagging things from #Discovery to #Discovery-Search [15:05:12] (I know, our phab tags are confusing, but we're trying to improve) [15:06:02] I'm going to close a few tickets as I go, but this is still going to add a lot of things for our next triage meeting. inflatador: if you have time to do another pass once I'm done and close or triage what makes sense, please do! [15:06:08] gehel ACK, I can change them myselfi f you prefer [15:06:56] Nah, I can do that first pass [15:07:03] focus on the ES7 upgrade! [15:15:53] inflatador: you can review https://phabricator.wikimedia.org/project/view/1849/ (the "needs triage" column) [15:59:24] hello search SREs, I have a new spicerack release to install on cumin1001 but I see there is a sre.elasticsearch.rolling-operation currently running for codfw reimages [16:00:53] how do we want to proceed? 1) install while running anytime, most likely will keep working, 2) wait for it to be in a sleep between reimages to install, 3) it will finish soon just wait, 4) stop it, install and resume it (if it has the capaility) [16:04:37] volans it's at a good stopping point, I'll stop it now [16:05:24] ack, thanks [16:05:28] sorry for the trouble [16:06:03] volans np, it's stopped, feel free to deploy anytime [16:07:46] inflatador: {done}, thanks a lot. You can resume anytime [16:08:33] ACK [16:09:20] Working out, back in ~30 [16:48:02] back [16:59:40] no unmeeting for me today, have to take an early lunch [17:50:13] looks like we got a couple of stuck shards in CODFW, trying to push 'em along now [17:55:51] hmmph. Still 2 shards not moving after running cookbook, nothing in recovery or explain. Any suggestions? [17:56:13] they're enwiki_content replica shards if that helps [17:58:15] hmm [18:00:55] inflatador: hmm, I thought we changed reimage to not disable the replica allocation [18:02:13] allocation looks all enabled in codfw [18:05:00] hmm, enwiki_content in codfw has 2 shards per node with 16 shards and 3 replicas, gives 64 total shards, at 2 shards per node it needs at least 32 nodes full [18:05:15] and we have 33 in the cluster, thats probably too narrow [18:05:51] i think we set that assuming 35 [18:06:52] we set it to 16*4 and 2 per node in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/608965/9/wmf-config/InitialiseSettings.php [18:07:53] yea, my comment in that ticket assumes 36 nodes actually: "16 shards, 3 replicas (4 copies total). 64 total shards. Can lose 4 servers before fored yellow." [18:08:07] * ebernhardson has apparently always been bad at spelling, who knew! [18:09:10] We're got two shards initializing now, so should recover soon [18:09:37] Could bring in the codfw refresh hosts now w/o decomming the old ones yet so we have additional capacity [18:09:54] (We also have net new expansion hosts also waiting in codfw, I just think the refresh hosts are closer to being done reimaigng) [18:09:56] what will the counts be once we bring in the new ones and decom the old ones? [18:10:12] i remember we've lost a few due to hardware but i forget how many [18:10:15] ebernhardson: With all the new expansion, 50 before accounting for the few hw failures, so like 47 [18:10:24] ryankemper: after decomming the old ones too? [18:10:37] somehow i forgot we expanded that much [18:10:37] ebernhardson: Yes [18:10:54] ok, then bringing a few nodes in should be fine i suppose [18:11:58] ah, thanks to whoever brought number 34 back [18:12:20] inflatador: I didn't do anything, did a host briefly drop out of the cluster? [18:13:56] ah yeah `elastic2048` came back from reimage recently [18:14:43] ryankemper yeah, it was elastic2048. I thought it had failed reimage but it looks like it went thru [18:15:02] I was gonna say the same [18:15:06] P sure cookbook reported exit code 97 [18:15:11] I guess it finished in the background or something [18:15:42] Yeah, v-olans was doing a maintenance so I killed the cookbook, I thought I got it before it started 2048 but I guess not [18:15:58] anyway, starting the reimage process again [18:19:44] inflatador: I started it an hour or so ago, so depending on when that maint thing was maybe I started it after you'd initially killed it? [18:19:48] like did you have to kill it twice today [18:23:17] ryankemper I think I did, it was in red when I came in (missing an index that wasn't actually being used, not a huge deal) and then for v-olans' cookbook deploy [18:25:12] huh, weird clock skews reported in my reindexing. Should check where those times come from. the codfw reindex last 3 timestamps on the logs are Wed, 20 Jul 2022 20:18:56 GMT, Wed, 20 Jul 2022 19:46:13 GMT, Wed, 20 Jul 2022 19:47:54 GMT [18:25:31] it should be 18:25 GMT right now [18:26:17] oh i'm an idiot. Those are estimated completion times :P [18:26:37] inflatador: ebernhardson: here's a quick patch to add three new hosts: https://gerrit.wikimedia.org/r/c/operations/puppet/+/815778. that will keep us between 36-37 hosts in codfw [18:27:10] ryankemper +2'd, feel free to merge / puppet-merge anytime [18:28:48] bringing in [18:37:28] I forgot to specify the racks. Follow-up patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/815785 [18:44:20] maybe we could document which all things are needed for new servers in wikitech? I glanced at the first patch but had no clue if that was all we needed :) [18:45:13] we have https://wikitech.wikimedia.org/wiki/Search#Adding_new_nodes but seems missing these pieces [18:49:13] yeah, we should def document that [18:55:59] indeed https://wikitech.wikimedia.org/wiki/Search#Adding_new_nodes looks mostly outdated, I think it's basically describing an equivalent of what our automation does now [18:56:42] pretty sure now the hieradata rack info and the site.pp role declaration are sufficient [18:58:52] the `curl | awk` function in there still works tho [19:06:58] es fails to come up until the second puppet run; before the second run (but after first) basically all these units will be failed: [19:07:08] https://www.irccloud.com/pastebin/Kaj9t75h/failed_units.log [19:12:27] * ebernhardson is often amuzed how much our integration test for Cirrus likes catapults [19:33:30] okay revamped https://wikitech.wikimedia.org/wiki/Search#Adding_new_nodes a little bit [19:45:14] That big `curl | awk` function can probably just be simplified to some variant of `curl -s 'localhost:9200/_cat/recovery?active_only&v&h=index,shard,source_node,target_node,time,stage,bytes_percent'` [19:45:53] I think maybe the only difference is the curl|awk function filters for relocating whereas i assume active_only is showing all types, but yeah pretty minor [20:20:36] hmm, so some of the flakiness in cirrus test cases is that somehow not all spaces are input into the search field :S [20:21:04] the integration tests. I added a 5 second delay and connected the chrome inspector so i can watch the browser as it works through tests on cindy. The test for 'catapult + amazing' searched for 'catapult+ amazing' [20:21:21] and typing is already about the slowest thing it does here :P [20:21:56] * ebernhardson will have to ponder ... we could set the input value directly instead of typing into the field, but then it's not testing what real users do [20:34:51] hmm, we do set it directly, its the browser integration itself that does this :( `browser.$( '#searchform [name=search]' ).setValue( search );` [21:16:43] looks like its omega's turn to get stuck in yellow...checking now [21:26:51] some kinda weirdness related to elastic2061 , looks like things are moving along now [21:28:36] * ebernhardson gives up on the integration test failures for now, hope it all changes when we manage to replace cindy with debian version newer than stretch...but i was avoiding doing that migration hoping i would get a new env using mwdd in place instead :P some day... [21:29:18] hmm, I guess not moving along [21:30:29] we've got another peer recovery that seems to be saying at 0% , have y'all seen this before? The last column (I think bytes_total?" Is actually at "-1.0%" [21:30:43] heh, never seen -1.0%. certainly seems odd [21:31:41] it was doing this before , could be related to the new hosts we just brought online [21:31:55] it's codfw port 9443? [21:31:58] Y [21:34:29] not seeing anything that makes sense yet :( [21:34:37] could probably cancel the recovery and let it try again [21:35:52] i dunno if it's related, but thats an index created by the reindexing process, it's not live yet [21:36:32] it updated the index to have the proper number of replicas around 20:18:31 GMT and has been waiting the lat ~18 minutes [21:36:56] could be...I will say that there were 3 other shards coming to or going from a newly-deployed host (elastic2061) in the same state [21:37:21] I kicked the service on elastic2061 and they haven't come back, but I guess there's a gentler way to do that? [21:37:57] is this the right call? https://www.elastic.co/guide/en/elasticsearch/reference/7.17/cluster-reroute.html#cluster-reroute-api-example [21:38:00] hmm, i'm not sure if there is a gentler way, i don't think we have ever tried. But often elastic ha something [21:38:30] inflatador: yea the cancel command there seems appropriate [21:40:45] ebernhardson cool, let me cobble together a curl here [21:42:11] looks like our spicerack uses the reroute route https://github.com/wikimedia/operations-software-spicerack/blob/637ab7b536c9cb8528afadcdcab964c2b24fed21/spicerack/elasticsearch_cluster.py#L645 [21:43:00] yea we use the cluster reroute api once in awhile, i think we've never particularly used the cancel option though [21:43:30] sometimes elasticsearch is bad about canceling things (tasks marked cancellable never actually get canceled in some cases) [21:44:22] in part it's because when we try and cancel a search operation that doesn't actually cancel anything, instead it sets a flag that has to be checked by the code and then the code abandon's at an appropriate place. No clue if that applies here though [21:51:41] well, it appears to have worked [21:51:53] woo! [21:52:15] API Calls here https://phabricator.wikimedia.org/P31556 [21:52:16] i gotta run now, appt with the dentist. Email if anything crazy, but i'm sure it will be fine :) [21:52:29] ACK [21:52:58] The API returns > 100,000 lines of json from that reroute call ;( [21:55:06] oh crap, is it 5 PM CDT already? I guess I shouldn't have started the cookbook again [21:56:35] ryankemper are you able to babysit the cookbook run? We can meet and talk more about the omega issue if you want ,but most of it's in the scrollback [21:57:51] inflatador: yup I can babysit [21:58:25] ryankemper thanks, will be around for next ~15 or so if you need anything [22:21:37] see ya tomorrow! [23:55:01] Unclear why cluster is not fully recovering [23:55:24] Stuck in yellow status w/ unassigned shards, trying a `curl -s -X POST localhost:9200/_cluster/reroute?retry_failed=true` returns `acknowledged: false` [23:56:38] For the main codfw cluster, the index failing is `yellow open enwiki_content_1658309446 4xlHzsgTRCq5s0ZosXKB9Q 16 3 8806054 124633 868.7gb 232.7gb`; it's the only index in the cluster that has 3 replicas (as opposed to 2 for most, or 0 for an index related to the reindex)