[09:45:32] weekly status update: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2025-04-04 [09:48:23] errand+lunch [09:52:40] lunch [13:17:11] o/ [13:40:45] o/ [13:54:22] \o [14:35:09] o/ [15:20:59] hmm, mjolnir feature selection failed with some NaN values :S [15:21:01] .o/ [15:23:19] :/ [15:23:59] as ever, it never gives enough info to know where exactly :P But i imagine i can use spark to trawl through enough data to figure out what has nan's [15:26:32] dcausse: curious on your thoughts...when gabriele needed range handling for input partitions in mjolnir i suggested pulling in discolytics, but he ended up using the bare functions instead of HivePartitionRange since it didn't really line up. I was poking around, but it's a medium sized change to migrate all the HivePartition usage to use the discolytics version, its a bit different [15:26:52] i suppose the question, is it worthwhile? If we were actively changing mjolnir it might be, but with how little we change it, it seems like unnecessary risk? [15:28:41] for example doing it properly would probably mean migrating mjolnir to use the spec strings everywhere, but right now it's all independent options that get passed around. so it's a medium sized airflow change as well [15:29:06] but it would bring some extra consistency across things [15:29:23] ebernhardson: we might have to do some adaptation to mjolnir to use more data on a per wiki basis at some point to use more historical data so perhaps worthwhile to do the migration when we address this? [15:29:59] I'm all for more consistency [15:30:28] yea, it would perhaps reduce mental load when working with it in the future, if it's more similar to everything else [15:30:43] i suppose it makes sense to punt until we are going to do more mjolnir work though, doing it now is perhaps premature [15:30:53] while I agree that if things are not broken we should don't touch them but sometimes touching thing you discover bugs and improving things [15:32:10] we start to have some historical data, hewiki showed improvement already perhaps we should do this sooner than later [15:33:17] true, we are already starting to build up larger datasets. That work will require thinking about keeping different time ranges for different wikis [15:33:27] or really maybe juts enwiki is the odd one out [15:33:36] although I'm not entirely I remember everything and whether we need a per wiki time range or if you added some sampling there [15:33:47] enwiki already downsamples from the full 90 days [15:34:04] although maybe we could revisit and see if there is improvement, the problem at the time was too much memory [15:34:33] could we go with the same date filter for all and have a "generic" downsample method, unsure that's optimal tho [15:35:18] we essentially have a thing that downsamples to a max of 27M (query, doc) pairs, and at least historically only enwiki hit that limit [15:36:07] ok so expanding the date filter for enwiki would just read more data but still keep the same amount, drawback is that it' [15:36:19] wont be necessarily the latest data [15:36:35] ? [15:36:59] yes with the way it's currently setup, extending the date range will cause enwiki to sample the27M from a larger time period [15:37:18] the sampling happens during feature collection, which is still in the "all wikis at once" part of mjolnir [15:37:58] there might be benefits even on enwiki though, longer time periods should mean more things make it through the limit that a query must be performed by N users [15:38:07] more long tail [15:38:23] although it could be the same user, just over 13 months :P [15:38:56] :) [15:39:14] from the testing we did, it seems there isn't a big improvement on enwiki from using the latest data though, so also maybe not a drawback if it's sampling from the longer time period [15:40:09] sounds good, might mean we don't need to tune the date range on per wiki basis which might make things a lot simpler [15:40:10] there are also other parts i'd like to play with...this feature selection is currently done in spark but is awkward and uses abandoned libraries i had to port forward to spark 2. I'd like to consider if we could do that in bare python now with more memory in an instance [15:40:34] not sure how bad the memory usage would be, but python feature selection is going to have much better upstream support instead of the wierdness of doing it in a distributed computation [15:40:36] +1 always worried about this lib, it'll break at some point [15:41:37] from talking to a few people at the opensearch conference, it doesn't sound like anyone uses spark ML [15:41:51] (which this feature selection is quasi-part of) [15:47:18] yes... flink-ml is a bit of the same I doubt a lot of people use it [16:05:08] hmm, well at least this isn't lieing...mar 13 has 0 cols with nan, mar 20 has 33k, mar 27 has 28k...but not sure why yet :P [16:09:56] they almost seem appropriate though... they are always for mean or stddev, and are paired with things like (min: 3.4e38, max: 0, sum: 0) [16:10:19] * ebernhardson is mildly suspicious that its e38 and not e-38...but whatever [16:15:54] oh..of course the dates are the expected date not the real ones. mar 20 was collected apr 1, mar 27 on the 2nd. Those involved opensearch and the new plugin version [16:37:15] so indeed, it looks like the old plugin would return 0's for everything, the updated plugin returns max float for the min value, and nan's for mean and stddev. The nan's are arguably more correct, the max value for min sounds like an accident of implementation [16:37:27] in the case where there were no maches [16:37:28] matches [16:38:14] we have started collecting from opensearch? [16:38:32] we have 2 opensearch nodes in codfw, they both have shards, and we collect from both clusters. So yea [16:41:31] this is for features with no match I suspect? [16:41:55] yes, for example a document with only text matches will reports nan for title mean and title stddev [16:42:02] which, mathematically, is appropriate [16:42:09] the mean of the empty set is NaN [16:42:38] but i think i will just force these all to 0's since it's what we have seen historically...just not sure the least hacky way to do that yet [16:42:41] I remember some discussion on how to handle missing values but not sure anything was agreed, strange that the behavior changed perhaps there's a missing flag now? [16:42:50] sure [16:42:52] oh maybe, i should poke through the plugin [16:43:20] maybe also file a bug about the min value, reporting min as max float feels like an implementation detail [16:43:29] yes... [16:43:51] i suppose max gets 0 since they know lucene only accepts >= 0 scores, so that was the starting point of the calculation [16:45:24] heading out, have a nice week-end [16:45:28] enjoy! [16:46:10] .o/ [16:47:34] aww, the official opensearch docs link our blog :) https://opensearch.org/docs/latest/search-plugins/ltr/fits-in/ [16:56:29] so indeed the min value as 0 goes against the docs: "If the terms are not present in the document, then the result will be 0.", although it only mentions min/max/avg there [16:58:34] ahh there is also a `missing_as_zero` parameter to feature logging, will have to play with that [17:03:32] {◕ ◡ ◕} [17:07:33] err, hmm. But that might not work in a mixed cluster with the old plugin not having the option :P Although i guess i could hack up a temporary plugin version for elasticsearch that accepts it [17:08:01] but rolling restarting the mixed cluster might also be tedious... [17:09:23] I don't mind doing it, but I'm guessing we'll be finished w/CODFW by next week if that matters [17:09:42] i guess the solution is limit feature collection to the eqiad cluster, prepare the patch for mjolnir to collect to opensearch, then switch when codfw is ready [17:19:41] or i guess ignore the option and coerce the bad values in python...pondering necessary :P [17:58:56] inflatador: i think this will do as needed for now: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134268 [18:26:32] meh, the `missing_as_zero` option doesn't seem to work as expected :S at least my mean value still comes back NaN. Also curiously reading the current code it seems like it shouldn't be necessary, when ctx.docFreq() == 0 it adds 0.0f to stats collection [18:29:18] oh actually thats in newer version, the 1.x plugin doesn't have tht [18:34:37] back from lunch, looking at patch now [18:41:08] i also might have a fix for the 1.3.x ltr plugin, looks like it's fixed in 2.x. Would require re-releasing the plugin and updating the opensearch plugin .deb which i doubt we will finish today, and i'm out next week, so stopping feature collection in codfw seems reasonable [18:45:59] yeah, unless you think there is value to enabling in relforge/cloudelastic. we could probably finish today [18:47:07] finish **that** today, that is [18:47:07] nah feature collection needs all the production indices [18:47:18] ack, we can wait then [18:47:23] this is easy and will let things continue running and not break next week [18:48:34] search-loader patch merged, running Puppet now [18:48:40] thanks! [18:49:37] * ebernhardson is suspicious that my local ltr plugin passes `./gradlew test` but fails `./gradlew build` [18:50:04] but the test results in a .zip, so good enough :P [18:50:16] {◕ ◡ ◕} [18:55:53] patch looks to have worked as expected, i see the bulk daemon but not the msearch daemon which was disabled. Will re-run feature collection in airflow [19:04:19] doh...turns out the devel_plugins option we have for building the opensearch image with an extra plugin needs more smarts, it installs my ltr plugin, but doesn't delete the already installed ltr plugin first. Fails on startup with two plugins with same name [19:15:19] Ò_Ó [19:25:00] easy fix, although only as a hack :P just needs the right rm -rf .... but not sure how to make that work consistently for automated purposes. But worked well enough to test the plugin and make sure it does what i want [19:38:46] alright i think the fix is reasonable, have a patch to update the opensearch plugins .deb. No rush, but i'll be out next week so thought worth getting it all prepped now: https://gerrit.wikimedia.org/r/c/operations/software/opensearch/plugins/+/1134285 [19:56:20] damn, I guess banning the cirrussearch nodes ahead of time didn't work [19:56:32] I see plenty of shards on cirrussearch2056 [19:56:56] hmm [19:58:41] I think that plugins patch needs an updated BUILD_VERSION in debian/rules and debian/changelog [19:58:48] inflatador: doh, indeed it does [19:59:23] ban is clearly in cluster settings, and the name matches [19:59:32] not sure how that would fail :S [19:59:54] I'm guessing ban is a 'best-effort' thing and maybe it decided that it needed the space? [20:00:19] no, shouldn't be. The only best effort part is it wont remove shards from the node unless it cant [20:00:51] i guess they were reimaged, did that clear the disks? [20:01:08] wondering if maybe they came up with shards already part of it, and it never moved them away? But seems unlikely [20:01:32] yeah, that wipes out the disks for sure. Maybe I didn't ban it in time [20:01:56] I thought I pre-banned it before reimaging but I may have messed something up [20:05:08] oh, thats curious [20:05:21] inflatador: did you ban with cookbook? [20:05:28] or manually, since they didn't exist yet? [20:06:03] No, I did it with curl, but I also ran the ban cookbook after it existed [20:06:10] in persistent.cluster.routing.allocation.exclude._name you have a string with the hosts comma separated, [20:06:21] Guessing my original curl wasn't right [20:06:24] in transient.cluster.routing.allocation.exclude._name you have an array containing two strings, one for each host [20:06:39] i'm not sure which takes precedence, but it's curious they are different, and one is probably wrong [20:06:57] * inflatador needs to fix the ban cookbook so it can pre-ban [20:07:36] docs say it should be a comma separated string, suggesting the persistent one is right, and the transient one is wrong? [20:07:54] or it could be they accept both, because it's curious it wasn't rejected as invalid types or something [20:08:14] yeah, the transient was a one-off curl, and so that's probably why [20:10:44] yea looks like transient takes precedence, so that would make sense [20:10:48] per https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-update-settings.html [20:16:32] unrelated but curious documentation "If you use Elasticsearch Service, use the user settings feature to configure all cluster settings. This method lets Elasticsearch Service automatically reject unsafe settings that could break your cluster." [20:16:48] you'd think instead of making only their service reject unsafe settings, they could make it a bit more universal :P [20:17:06] (i'm pretty sure elasticsearch service == their cloud offering) [20:17:21] If I can get the cookbook fixed by the time we start migrating row B, that should keep this from happening again [20:17:32] I can at least remove the invalid config though [20:19:06] sadly, it looks like pretty much all shards on 2055 and 2056 are primaries :S oh well [20:20:47] yeah, I screwed the pooch on that one. There were only 4 unassigned shards before 2056 ;( [20:34:43] OK, I removed the bad transient config [20:38:19] created T391151 to work on the ban cookbook [20:38:19] T391151: Ensure ban.py cookbook can ban not-yet-existing hosts - https://phabricator.wikimedia.org/T391151 [21:03:36] OK, plugins patch merged. Let's see if my playbook works ;P [21:04:28] :) [21:05:29] i suppose in theory, if the instances get the new plugin we can revert the mjolnir search daemon patch [21:06:00] but also, since opensearch primaries can't have replicas in elasticsearch, the cluster will probably go red on restart [21:06:44] if I actually publish the package, the newly-reimaged hosts will get it. Sounds like we don't want that [21:07:48] I don't have to publish it today. If there's any concerns about how it works on a mixed cluster, I can do that after we're done [21:08:49] or are you just saying that we can't restart the cluster specifically to apply the package update? [21:09:51] the problem is just that because the new opensearch nodes have all primaries, there are no replicas for those shards [21:10:11] and they are in oh, wait i'm totally wrong [21:10:24] ??? [21:10:31] i wonder how that works, checking some specific indices, i see primaries on opensearch with replicas on elasticsearch [21:10:37] but thats supposed to not be possible? [21:11:39] the plot thickens ;) [21:12:24] but then we have a bunch of unassigned shards as well, which i had assumed were the primary/replica problem [21:12:26] I just wanted to make sure that newly-reimaged hosts getting the latest version of the plugin is OK. I wasn't planning on restarting the cluster specifically to get the new package versions [21:12:34] yes it's fine to put the plugin on the [21:12:49] i was more worried just about the restarting instances part of it [21:13:09] cool, we're on the same page then [21:13:36] so curiously i do see instances where the primary is opensearch, and the replicas are unassigned. But also where primary is opensearch, and replicas are elasticsearch. Curious [21:13:37] I guess let's look at allocation/explain and see why those shards aren't being assigned [21:15:25] `IndexFormatTooNewException` sounds a lot like the known issue with replicas [21:16:35] yea, thats what we are expecting... [21:16:52] my best guess would be if the replica was already running and opensearch was promoted to primary, it's fine [21:17:00] but we can't initialize a replica from the opensearch primary [21:17:22] since elasticsearch doesn't copy the index segments during normal operation, only when initializing the shard [21:19:04] I just published the new plugin version btw [21:19:11] awesome, thanks [21:19:25] * inflatador should've finished the playbook years ago [21:46:41] headed out, enjoy the weekend! [22:04:40] heading out as well