[09:43:25] errand, back in 20' [10:43:51] Morning search people! Absolutely no urgency but if you happened to be interested in the puzzle of why WMDE's wikibase.cloud ES nodes take so long to start feel free to take a look at https://phabricator.wikimedia.org/T328740 - we've done a ton more investigation but ultimately had no breakthroughs [10:57:50] tarrow: I'll add a few comments on that task [10:58:35] gehel: thanks! I think we've successfully pulled on lots of threads (hah) but just ended up with more knots :D [11:00:29] tarrow: comments added (not any great idea). [11:13:03] gehel: cheers! I guess it could also be a symptom of "lots of indicies -> lots of shards" [11:16:49] tarrow: yes, maybe. Do you know how many shards you have overall? [11:18:28] tarrow: do you monitor cpu throttling with k8s? looking at the stack it seems to be consuming cpu checking for deprecated settings [11:26:35] dcausse: not very effectively clearly ;) I'm still looking [11:28:19] looks like you request and limit 10 for elasticsearch... wondering why it's so slow... [11:29:06] gehel: from the cluster that those logs are from 83 indices with one shard for each per node (in a 3 node cluster) [11:30:15] if it was GC bound I would suspect the stack to be on lines requesting more heap but here I only see CPU related things... [11:30:16] dcausse: n.b. most of those logs are from "staging" which has less (more like 2 cpus) (https://github.com/wmde/wbaas-deploy/blob/main/k8s/helmfile/env/staging/elasticsearch.values.yaml.gotmpl) [11:31:15] 83 to indices * 1 shard* 3 replicas = not that many shards overall [11:31:37] 2 full cpu should not take 5mins to check all the settings of 83 indices, even it's wikibase ones, weird... [11:31:58] That bootstrap process seems to be single threaded, so not sure that more CPU would help much [11:32:06] yeah, I didn't think it was being too crazy [11:32:09] even 1 cpu [11:32:31] Lunch, back later [11:32:47] if cpu usage is throttled by k8s that could explain [11:32:52] enjoy! Thanks for the thoughts! [11:34:42] I would expect that bootstrap to saturate a single CPU. It would be interesting to see if that's also what the OS thinks [11:35:44] We could setup a pairing session to get into more details synchronously [11:37:41] It might also be interesting to see what happens if you recreate a new and empty cluster, and reindex everything into it. Is there something corrupted in the current cluster state? [11:37:49] And now, lunch for real [11:39:46] lunch 2 [11:57:51] I'd welcome a sanity check by one of the Search SREs for https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/887999 [12:42:22] moritzm: looking [12:43:13] cheers [12:47:03] that SREBatchRunnerBase is nice! [12:49:35] moritzm: looks good! [12:56:55] thanks! will deploy this on Monday (and then use it for the restarts for the latest OpenSSL update) [14:03:02] o/ [14:33:54] o/ [14:47:50] thanks moritzm , looks good [15:46:23] \o [15:46:58] turns out i need to fork and migrate the spark feature selection code to spark 3 as well. or find another mrmr implementation :S [15:48:51] :( [15:54:58] there are some other implementations, will have to see how they deal with the large enwiki data though. Maybe porting it wouldn't be so bad, the mjolnir jvm piece only required changing the pom (but it only lightly integrates with spark, the computation is separate). will see [15:58:46] * gehel is going to the movie tonight. I'll skip the unmeeting [16:19:29] workout, back in ~40 [16:36:43] Weekly update: https://app.asana.com/0/0/1203947553662547 [16:49:31] back [16:57:32] https://phabricator.wikimedia.org/P44251 [18:21:10] lunch, back in ~40 [19:07:41] back [19:29:57] * ebernhardson plays the version guessing game updating mrmr to spark 3 (and scala update, and scalatest update, and junit update, and who knows...) [20:11:44] * ebernhardson finds it suspicious that the support library, with top commit of 'issued version 1.4.1', which is the version we use, doesn't compile from the repo due to a class they renamed, and then doesn't pass the test the test suite [20:17:14] unclear how important the failures are though... 1.4435658E12 != 1.44359817E12 [21:03:47] it's running now....then have to figure out how to validate the feature selection still works properly [21:04:35] i suppose the labels are still the same, if the final ndcg comes up similar to previous runs maybe thats good enough. When we tested different feature selection algos initially they had wildly different ndcg's after training [23:47:42] almost working :) completed up through model building, skipped the swift upload atm...now to double check the trained models are reasonable