[13:13:06] \o [14:09:40] * ebernhardson mutters: struct,values:array> is incompatible with struct,values:array> [14:09:44] afaict, those are the same :P [14:14:29] i wonder if this is why i had mjolnir writing parquet files directly and then adding partitions to hive instead of writing through the sql interface... [14:26:02] getting some occassional 503/master not discovered from codfw psi...checking out now [14:31:59] hmm, seeing on chi too. Must be a bad host or hosts in LB pool [14:41:28] :S [14:42:44] * ebernhardson realizes there is also metadata in the spark schema that can't be written through hive...guess i need to also migrate the write+add partition routines for this table [14:44:01] Haven't found the problem child yet...all individual psi hosts seem to be responding to /_cat/nodes [14:51:14] nginx error logs are blank across the board...checking logstash [14:57:23] OK, cirrussearch2078 is our problem...looks like it failed to reimage and booted back into elastic. Not sure why it didn't fail health checks [15:05:22] created T392367 , but my best guess is that it's a depool ratio thing. Once we lost x% of hosts in the pool, pybal won't depool anymore since that could make the user experience even worse [15:05:22] T392367: Review Elastic/OpenSearch health checks and nginx logging - https://phabricator.wikimedia.org/T392367 [15:05:57] meh, found a spark ticket related to my issue: "this is not a bug but an expected behaviour because an user-defined type is not compatible with its internal data type. [15:06:15] although they at least left the bug open [15:06:21] (for 5 years :P) [15:07:32] ??? [15:08:09] maybe they should distinguish user vs internal better? not sure [15:08:24] If that is their philosophical position, they should at least craft a better error message [15:09:00] indeed [15:10:24] well, depooling the non-existent host caused some pybal and SUP errors...repooled for now, will get a patch up to remove it for good [15:28:33] meh, lastpass also being janky this morning...my passwords are all available in the vault, but not from the in-page accessors. Love software that updates itself :P [15:35:29] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137782 patch up for fixing conftool data if anyone has time to look [15:36:42] done [15:38:11] ACK, thanks! [16:07:02] OK, load balancer pools are in better shape now. Working out, back in ~40 [17:00:28] nack [17:00:30] or back [17:00:50] ryankemper did some minor updates to https://gitlab.wikimedia.org/repos/search-platform/sre/cirrussearch_shard_checker/-/tree/main?ref_type=heads after merging your changes from Thurs. Everything looking good so far [17:23:08] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137272 got the +1, we'll need to test its effects on relforge/cloudelastic before merging to prod. g-ehel comment that shards may try to relocate seems plausible [17:23:44] I'm also guessing we can set these values in the API without restarting the service [17:25:32] break, back in ~30 [17:41:14] back [17:49:47] hmm, maybe we **can't** update node attributes without restarting ES/OS. I'm not seeing any methods but GET in https://docs.opensearch.org/docs/latest/api-reference/nodes-apis/index/ [17:50:05] afaict thats true, attributes are only loaded on startup [17:52:36] ah, sounds like we might need to roll-restart after changing this then [18:01:17] ryankemper I just updated T391392 , LMK your thoughts on whether we should merge/roll-restart now or wait till the OS migration is done [18:01:18] T391392: Use profile::netbox::host instead of regex.yaml for Cirrussearch rack/row awareness - https://phabricator.wikimedia.org/T391392 [18:13:00] lunch, back in ~1h. cirrussearch2094 is done reimaging, should join the cluster once puppet finished [18:13:19] finisheS, that is [19:02:51] manually did a full run through of mjolnir data collection and training, looks like the updated code should work [19:44:42] back [20:05:26] hmm, looks like https://search.svc.codfw.wmnet:9243/_cat/shards/apifeatureusage-2025.02.10 only has a single replica? Is that normal for apifeatureusage shards? [20:11:49] inflatador: hmm, don't think thats normal [20:12:29] should be defined by operations/puppet modules/profile/files/apifeatureusage/templates/apifeatureusage_7.0-1.json [20:12:33] whih has number_of_replicas: 2 [20:12:44] oh, shards. Yea it's configured for 1 shard, 2 replicas [20:14:44] I'd say bump it up to 3 for the migration, but we haven't had any problems yet. This is the first time I've even seen a shard without any working replicas [20:15:14] plus I know those are pretty big indices, probably'd want to do the math before adding [20:18:32] We want 2 replicas for everything [20:18:53] Giving dog a shower cause he ran through a poison oak bush, can take a look after [20:21:08] NP, the shard eventually replicated. Just kind of food for thought [20:22:05] (2 replicas meaning 3 total counting primary ofc) [20:22:34] Also, I'm cranking https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137069 thru PCC, hopefully that will catch any regex errors before they cause reimaging problems in eqiad [22:42:59] OK, cirrussearch2071 is done...19 hosts left in CODFW (21 if you count the misbehaving 2078 and 2091) [22:49:54] updated T388610 ...see ya Weds! [22:49:54] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610