[07:14:04] we're still in a bad spot on the search cluster :/ [07:47:13] ran Erik's tool on elastic2054, seems to have stopped the rejection for now [07:51:36] dcausse: I'm a bit late :/ [07:51:48] anything else needed on that cluster? [07:53:59] gehel: (from the backscroll) the read_ahead settings are not really applied (perhaps because of software?) so while we fix that some nodes might start to do too many read IOs [07:54:18] software raid I mean [07:56:17] "while we fix that": meaning there is a fix in progress? not clear from the backlog [07:57:26] workaround is a tool Erik's has made that change how loaded pages are handled by the kernel to tell it to not anticipate reads [07:58:11] you call it like: sudo elasticsearch-madvise-random ELASTIC_PID [07:58:45] the fix is to figure out why the read_ahead is not applied [07:59:20] the tool is available at elastic2054.codfw.wmnet:~ebernhardson/elasticsearch-madvise-random [07:59:27] meeting, back in 30' [09:03:49] dcausse: do we have the sources of that tool somewhere? [09:04:27] gehel: https://phabricator.wikimedia.org/P5883 [09:04:38] thx! [09:05:54] this is a bit over my head, but it looks like running it on all nodes should not be too bad [09:07:04] no but it can be run on heavy loaded nodes [09:07:40] problem is that running it once is not sufficient, as files are deleted/created it has to be rerun [09:07:58] ofc :/ [09:08:03] hopefully the current state is stable enough to allow to fix the root cause [09:08:47] root cause being: make sure we have small enough buffers at all filesystem / raid / block devices [09:09:56] yes, or switch elastic7 which solves this on its end with its new hybridfs [09:11:25] it **should** not be too hard to set correct read ahead buffers, but given how long we've had this issue, it seems that the real world disagrees with my simplistic assumptions [09:14:17] hm hybridfs is actually available in elastic 6.8 but not sure we want to upgrade to it just before elastic 7 [09:27:55] probably not worth it, but another reason to not delay 7.x too much [09:46:24] lunch + errand [10:07:53] lunch too [14:51:16] \o [14:52:07] while they say it was written for the same use case, it's not clear to me that the hybridfs would fix our use case. What they did was replace IO at merge time with NIO instead of mmap. I'm sure they have some data that says this is a valid thing, but we aren't under heavy merge load so it doesn't seem like that would help us [14:52:27] * ebernhardson doesn't see why it's so hard to call a 20 line JNI [14:52:43] well, i know why it;s hard, but i won't be supporting those users :P [14:57:35] Will be ~15 mins late to weds meeting [14:59:33] I wonder how we behave with niofs vs mmap+madvice random, I think they ultimately want to switch to mmap+madvice random once they drop java8 support [15:00:26] someone created a plugin for that with java 8 support and a jni call but it disappeared [15:01:01] https://discuss.elastic.co/t/a-new-store-plugin-native-unix-store/157843 [16:16:58] Dinner time [17:18:16] dcausse: you just received an interview for tomorrow. Can you make it ? [17:18:22] I know it's last minute [17:40:34] gehel: yes, accepted [17:41:27] Thanks ! [17:49:54] pondering, i wonder if logistic regression is even the right tool to apply...double checking some definitions this is included as a major assumption: There should be no high correlations (multicollinearity) among the predictors. [17:50:05] except, all the predictors i'm providing are correlated. [18:58:12] yeah we're definitely violating that invariant [18:58:19] this is a pretty classic classification problem, right? [19:00:48] re logistic regression...not sure how reliable this answer is but it seems to indicate that the multicollinearity reduces the statistical power but doesn't otherwise "break" things https://stats.stackexchange.com/a/350234 [19:02:28] looks like open nsfw uses a deep neural network...that makes sense https://github.com/yahoo/open_nsfw [19:02:42] This part of the readme cracked me up: [19:02:43] > These images were editorially labelled. We cannot release the dataset or other details due to the nature of the data. [20:02:09] yea it seems that while worse, not strictly bad. Also i was surprised upon actually running correlations that the correlation is low, perhaps an artifact of the way it was trained [20:02:53] i should perhaps try open_nsfw, i tried https://github.com/GantMan/nsfw_model/tree/master/nsfw_detector because while it uses a closed dataset because they don't own the images, they use an open scraper and it's still maintained [20:03:26] heh, links a bit deeper than intended :) but that repo [20:06:11] that's cool that the scraper is open [20:06:25] but why would we even need to scrape? is out of the box accuracy not good enough for our types of images? [20:07:06] we don't need to scrape, i mean their model is trained using data from a scraper that goes out to social media sites to get images people have classified for them into various nsfw categories (think, named subreddits) [20:07:33] so, essentially they are mostly open even if the exact dataset isn't available. You could recreate it if you wanted to, all the code is there [20:08:57] gotcha [20:09:39] also means the data is messy, it was claimed to be acceptable but the performance i've gotten out of it so far isn't amazing :) [21:08:23] * ebernhardson wonders a bit why this tf-rocmstops to recompile with llvm every other minute