[00:20:09] Incident report for CODFW omega https://wikitech.wikimedia.org/wiki/Incidents/2024-03-27_Elasticsearch_Omega_Cluster_Failure [01:43:27] https://etherpad.wikimedia.org/p/resurrect_omega has some rough notes on next steps for Omega, feel free to add/change/edit as needed [01:52:10] thanks! i'll do another pass on the incident report after dinner [10:29:25] errand+lunch [12:34:57] dcausse: I ran into a bug shortly after rolling out the new ES bulk sink. The fix is ready: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/111. Could you have a look, please? [12:43:33] pfischer: lgtm [13:05:01] thx [14:04:39] I can’t make it to our retrospective today. [14:40:55] I probably won't be able to make it to the retro either. Augustin still sick and more than likely to wake up around then (if not, I might actually wake him up myself to make sure he sleeps this night). [14:41:02] I'll cancel retro for today :/ [14:41:43] @team: could you make sure to send your status updates to https://etherpad.wikimedia.org/p/search-standup early? Public holiday tomorrow, so I'd like to send the update tonight if possible. [14:49:12] errand [14:55:28] codfw-omega back to yellow [15:01:26] codfw-omega back to green [15:02:12] next step is to requeue the writes [15:28:23] \o [15:28:34] inflatador: awesome1 [15:28:54] ! [15:29:36] ebernhardson yeah, i found out about elasticsearch-node last night, that was the key. Updating docs but this is basically what I did: https://etherpad.wikimedia.org/p/resurrect_omega [15:30:11] oh interesting, ann out-of-cluster bootstrap routine [15:50:05] Added procedure to search docs: https://wikitech.wikimedia.org/wiki/Search#Cluster_Quorum_Loss_Recovery_Procedure [15:54:09] inflatador: well done! glad it worked [16:02:37] Me too, sorry it was necessary in the first place ;( . Working out, back in ~40 [16:02:38] \o [16:03:30] things break, it's no worry :) If anything i guess our time to depool the cluster and move traffic to eqiad is the only real problem. Things go wrong, but fixing them should perhaps find a way to move outside the mw deploy process [16:44:08] back [16:46:43] ryankemper would you mind doing a quick pass of cat/indices between eqiad and codfw? I don't anticipate data loss since we banned all the nodes ahead of time, but just wanna make sure things look reasonably similar [16:53:40] inflatador: sure [17:00:59] inflatador: yeah I see an identical number of indices, and only (unsurprising) small differences in doc counts since I ran the commands a few secs apart [17:03:43] another tool you could plausibly use is the check_indices.py script from CirrusSearch. This one figures out from config what all should exist, and then checks the configured clusters to see if those things do exist [17:03:46] cool, I still need to requeue the writes, will get started on that shortly [17:04:05] things == indics [17:05:12] * ebernhardson is running it currently, partially of curiosity. Report will show up in mwmaint1002:~ebernhardson/check_indices.20240328T170300.json [17:07:34] Cool [17:09:53] complains about glent indices that should have been deleted but were not, toolhub_lists since it hasn't been whitelisted yet, and a failed testwiki reindex that needs to be cleaned up on cloudelastic [17:10:02] but otherwise, it looks like everything it expects is there [17:10:37] toolhub_lists needs to be whitelisted? [17:14:25] for the requeue command https://wikitech.wikimedia.org/wiki/Search#Recovering_from_an_Elasticsearch_outage/interruption_in_updates , "wiki" is the index without the part past the underscore? Like for "zuwikibooks_archive" I'd just put "zuwikibooks", right? [17:14:48] inflatador: yea [17:15:10] cool, will get it started shortly [17:24:36] inflatador: (less important) I see notifications are disabled for `elastic110[3-7]` in icinga. These hosts aren't in the search cluster yet, but I don't see a service implementation ticket up [17:25:44] ryankemper should be covered by T353878 [17:25:45] T353878: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 [17:25:59] ryankemper actually no [17:26:04] I guess that's eqiad [17:26:29] You might have to create a ticket then, sorry about that [17:28:08] ryankemper FWiW I never killed that decom cookbook, it's still running as my user on cumin2002 [17:30:04] oh didn't realize [17:30:08] yeah we should have finished that off already [17:30:09] okay exiting it [17:30:44] Alright hopefully spamming ctrl+c didn't mess anything up :P think it should be okay tho [17:33:59] other random wrangling (sorry for the noise, just lots of doozies here): I see `cloudelastic1003` marked as decomissioning in netbox whereas the other cloudelastic hosts are marked as `offline`. Perhaps was an incomplete decom, will try to scrounge for phab context [17:34:39] yeah, cloudelastic1003 failed some part of decom. I handed ticket off to dc ops but may have not tagged properly [17:35:54] AFAICT the decom cookbook sets them to decomissioning but not offline, so I guess it doesn't get marked as offline until dc-ops handles it maybe [17:40:15] Okay, reopened the associated dc-ops ticket (https://phabricator.wikimedia.org/T358046#9670481). Suspect there was just a step or two missed by dc-ops [17:54:05] Heading to my son's birthday lunch, back in ~1h [17:57:31] Okay, dc-ops got cloudelastic1103 taken care of [17:57:50] Separately, I made a ticket for `elastic110[3-7]`: https://phabricator.wikimedia.org/T361268 [18:12:06] dinner [18:58:27] back [19:20:59] reminder to self: wdqs1013 still needs data transfer [19:22:01] inflatador, ryankemper, dcausse (and probably others): big thanks and congratulations on the recovery of omega! We've learned a few things and good to see them documented! [19:27:07] will be few mins late to pairing [19:35:55] gehel np, sorry it was necessary [19:38:06] just requeued writes, hopefully done soon [19:42:29] ebernhardson: I rolled out the new ES sink today and it currently fails due to a large page (~5.6mb) [19:48:16] I’ll temporarily roll back to the last working version. [19:48:38] pfischer: hmm, i had hoped those would get fixed on the cirrus side :( ok [19:54:00] Caused by: java.lang.IllegalArgumentException: The request entry sent to the buffer was of size [5606355], when the maxRecordSizeInBytes was set to [4194304] [21:04:32] ebernhardson looks like the codfw omega backfill is done if you wanna re-enable it in mwconfig [21:04:45] inflatador: alrighty, can do