[09:30:42] errand+lunch [12:21:42] going to unban relforge1004 [12:27:24] disabled row awareness allocation on relforge [13:17:32] o/ [13:19:52] I set row awareness in Relforge because Puppet was failing. I should probably look into that again ;( [13:24:03] inflatador: issue is that apparently opensearch was not restarted to pick row attributes and thus failed because it could not find the row attribute [13:24:39] disabled row awareness to get the cluster back to green, will then restart the nodes to pick-up the row attribute (which are properly set in the yaml file) [13:24:47] and then re-enable row awareness [13:25:02] waiting for the cluster to rebalance at the moment [13:51:18] dcausse cool. I'm using relforge small alpha to work on T391151 , if this is causing problems let me know (basically I'm just banning/unbanning hosts) [13:51:19] T391151: Ensure ban.py cookbook can ban not-yet-existing hosts - https://phabricator.wikimedia.org/T391151 [13:51:50] inflatador: sure, no worries [14:49:01] restarting opensearch_1@relforge-eqiad to pick-up rows attributes [14:50:55] done [16:32:54] dinner [17:03:10] Landlord’s doing a walkthrough sometime in the next couple hours so I’ll be spotty availability, ping if need anything [18:29:46] ryankemper since fixing ban.py is gonna be a pain, I went ahead and wrote a playbook: https://gitlab.wikimedia.org/repos/search-platform/sre/ansible-playbooks/cirrussearch_ban . We can go over at pairing today, but LMK if you have questions [18:31:31] Ack! My other idea is to just have the cookbook output curl commands for us [18:31:35] lunch, back in ~40 [18:31:38] If playbook already works tho that’s great [18:32:15] yeah, it took a few hours but I've been testing on relforge and no problems so far. I'll test on cloudelastic once I get back from lunch [19:13:48] back, looks like we're getting a latency alert [19:14:10] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) [19:17:10] Hooray! We got the new CirrusSearchThreadPoolRejectionsTooHigh and it pointed to the culprit https://grafana.wikimedia.org/goto/52qVaK0NR?orgId=1 [20:02:31] Confirmed that the ban playbook works in Cloudelastic...even if there is a mix between names known to the cluster and names not known to the cluster, OpenSearch will correctly ban the nodes it knows about [20:24:02] deleting orphan indices from prod codfw before we start our first row of reimages [20:34:56] deleting the orphan indices made the CODFW chi cluster go back to green! Although that makes me a bit scared, I did triple check the list and none had aliases to live indices [20:35:07] https://phabricator.wikimedia.org/P74639 indices deleted FWiW [20:58:33] ryankemper Patch for migrating CODFW row A is up if you have time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134761 [21:02:36] inflatador: 3’ [21:04:19] ACK [21:42:20] since we're not going to start the migration today, I just re-enabled puppet on codfw [22:01:57] ack [22:05:30] ryankemper This is the slack thread I've been using for the migration: https://wikimedia.slack.com/archives/C055QGPTC69/p1743616112246029 . If you need a code review that you feel the rest of DPE can handle, feel free to stick it there [22:15:05] ryankemper you can also use https://etherpad.wikimedia.org/p/elastic-2-opensearch-T388610 [22:15:05] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [22:21:31] +1'd, we also need to think about special handling for masters...similar to https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Adding_new_masters/removing_old_masters [22:23:30] We should probably pre-stage a patch that adds the new master hostnames. But I'm curious about what happens if a host is restarted when it has config for a not-yet-existing master in its yaml file. Will it be OK if it can only reach X number of masters, or will it spin forever? [22:23:43] This is something we need to test in relforge [22:50:25] we also need to figure out what's going on with the orphan aliases...looks like we have another, `zhwiktionary_titlesuggest_1744057983` (production zhwiktionary_titlesuggest is an alias for `zhwiktionary_titlesuggest_1743670321`) . Maybe the automation that cleans them up doesn't work/is disabled in CODFW?