[09:43:23] errand+lunch [12:14:40] gehel: I you have a minute: https://gitlab.wikimedia.org/repos/maven/wmf-jvm-parent-pom/-/merge_requests - I could use some of that for the ongoing migration (T367405) [12:14:41] T367405: Migrate existing Java packages to deploying to Gitlab, including new version of parent pom, validation that all dependencies are available, and validation that deployment to production still works - https://phabricator.wikimedia.org/T367405 [12:26:32] sigh... java.lang.UnsatisfiedLinkError: no opensearchknn_faiss in java.library.path: [/usr/java/packages/lib,...] [12:27:17] and it's crashing opensearch... [12:37:07] needs to set LD_LIBRARY_PATH but unsure where... [12:41:19] could not find another place than the systemd unit... [12:46:35] actually could in jvm.options with -Djava.library.path=/path might be easier [12:46:42] be* [12:48:40] no it needs LD_LIBRARY_PATH... libopensearchknn_faiss.so is loading other shared objects from that folder... [13:10:47] o/ [13:14:15] Just updated T388610 with our migration progress...we're at 5/112 hosts on OpenSearch [13:14:15] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [13:14:42] nice! [13:14:43] Also updated https://docs.google.com/document/d/1S4p03N_kJAF-tr4qDWi23ZG3LKiSDFLcMSogF02L9L8/edit?tab=t.0#heading=h.z19h2znxyd16 with some of the problems we've experienced so far [13:30:42] dcausse OK if we do OpenSearch migration stuff at pairing? I've invited the other DPE SREs [13:30:50] inflatador: sure [13:32:42] Cool, thanks! [14:54:42] I'm moving the Wednesday meeting 1h later to leave space for the P&T Staff Meeting [15:50:54] Not going to make the Weds mtg, still working on the migration [15:52:07] looks like the rolling-operation picked a host outside of row A. Not a huge deal, but we'll need to keep an eye on that [16:00:15] will be 3min late [16:54:43] dinner [17:01:24] I'm wondering how the cookbook knows about hosts that aren't in site.pp, I guess it's using regex.yaml. I'll try another run, but I'm guessing we'll need to remove the non-row-A cirrus hosts from regex.yaml [17:37:20] hmm, I dunno if that's a good solution. I'm not sure what that would do to hosts that get restarted [17:37:51] or if puppet is run and they lose their row/rack awareness. Gotta think about this one... ryankemper (or anyone else), any suggestions? [17:55:50] https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/cassandra/roll-restart.py#L40 we probably need to add something like this to limit the scope to one row [17:57:41] if yoi're refactoring that cookbook it would benefit of using the SREBatchBase and SREBatchRunnerBase/SRELBBatchRunnerBase classes ;) [17:58:46] volans thank you. do you have any specific cookbook in mind I should look at [17:59:35] LB or not LB? [18:00:03] the hosts are behind a load balancer, if that's what you mean [18:00:18] yes, if they need to be depooled/repooled basically [18:00:35] ACK, they should depooled/repooled [18:01:31] hmm, looks like there's an opensearch cookbook that does this already? https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/opensearch/roll-restart-reboot.py [18:01:33] ok then git grep SRELBBatchRunnerBase should give plenty of examples of cookbooks using those classes. Yes ideally those classes should be moved to spicerack and benefit for better documentation. I can giv eyou some pointers tomorrow, it's already pretty late around here today [18:01:47] no worries, thanks for the tip! [18:02:01] the current opensearch one seems to use the non-LB version though [18:02:14] our stuff is more complicated as well...we have 2 clusters running on one host ;( . But it's a good start [18:03:07] feel free to ping me tomorrow when you get online, most things can be overriden and there is plenty of hooks so in general it should allow to support most use-cases, but I guess we'll see [18:03:24] np, enjoy your evening [18:05:36] thx [18:28:18] dcausse / ryankemper : did we see a change in response times on the internal WDQS cluster now that we have a smaller graph? [18:45:13] back from lunch [22:07:18] gehel: trying to find the best way to visualize that. here's one attempt but there's not a super strong signal (there might be a better metric to use though) [22:07:25] https://www.irccloud.com/pastebin/XeU3tCcS/ [22:07:34] 99th percentile looks a little better