[09:18:00] dcausse: around? would you have a few minutes for a chat? [09:18:07] gehel: sure [09:18:24] meet.google.com/zqv-uncp-xsm [09:45:16] errand + lunch [10:34:25] lunch [12:36:58] Hello everyone [12:39:06] I just wanted to bring up the ERC Draft Review again. [12:54:46] thanks ejoseph , i'll take a look today [13:02:12] greetings [13:02:57] ejoseph I reviewed it yesterday, LMK if I need to sign it or anything [13:27:50] ejoseph I didn't have access but requested it [13:30:12] infladator: If you have any feedback you can drop a comment if not we are good [13:30:43] cbogen_: I have sent you access [13:49:38] ejoseph: thanks! [14:41:24] we'll need Elastica 7.1.5 I think [14:41:49] hits.total is now an array [14:49:57] but bumping to Elastica 7.1.5 also brings a whole bunch of changes like the Type class being removed... [15:08:50] sigh... they removed Type pretty early in https://github.com/ruflin/Elastica/commit/997989426dc37a1f4d103f477222869b7b6a87ab so not sure we can use a previous version to avoid a massive change [15:12:25] \o [15:12:29] not sure we have great options, fork Elastica from an earlier version and bring in the necessary changes (hits.total change or allow the rest_total_hits_as_int param + anything else we'll discover), bump to Elastica 7.1.5 possibly removing the use of the Type class in master to avoid a massive change [15:12:31] o/ [15:12:38] dcausse: there is some query param about revering the hits structure [15:12:44] reverting* [15:12:58] yes but it's disallowed by Elastica [15:13:02] oh :S [15:13:16] yea the migration path in elastica doesn't really seem thought out or intentional [15:14:07] dcausse: i suppose we need to roll a vendor upgrade with the train that deploys es7 branch? annoying but doable [15:14:41] yes but I'm afraid of a very big change by removing Type [15:14:50] I wonder if we could do this in master instead [15:25:11] hmm, I'm not entirely sure how yet but there is probably some way by dropping down to the elasticsearch-php client. I don't know that they necessarily think about version upgrades more, but they seem to more closely track elastic features which tend to [15:29:25] dcausse: oh, cindy isn't going to pass in the es710 branch [15:29:38] i only made it as far as this first error :) [15:30:06] * ebernhardson now remembers that cindy fails total_hits, and that david probably knew this already :) [15:31:15] ebernhardson: I think we need to decide if we want to move to Elastica [15:31:19] oops [15:31:45] dcausse: you mean drop elastica? It's been on my mind but nothing seemed to require it yet [15:32:00] ebernhardson: take 2: I think we need to decide if we want to move to Elastica 7.1.5 or hack something to keep Elastica to a lower version with few hacks [15:33:00] bumping to Elastica 7.1.5 might require a massive change in cirrus that I think could be prepared already in master (still running 6) [15:33:16] e.g. stop using Type [15:34:09] keeping Elastica to a lower version means forking it, not sure how many patches we'll have to backport tho [15:35:51] dcausse: i dunno :S We have to stop using type, but elastica hasn't made a convenient transition path. Ideally we would have transitioned from type->index one location at a type, hmm [15:36:37] yes I'm unclear what it means to remove the use of Type in master [15:36:43] if 7.1.5 runs on 6.8 as well, I guess that seems sane. We have to do it sooner or later anyways [15:37:45] hm.. I meant removing the use of Type still using Elastica 6 but if bumping to Elastica 7.1.5 is an option why not [15:38:30] i thought elastica 6 didn't have all the appropriate api's? I can't remember what now but i tried before to start transitioning away Type's in 6.x and couldn't do it [15:38:50] ah I see... [15:39:19] we can make helper functions or whatever to bridge the gap i suppose [15:40:00] we could do an "exploraty" patch removing everything that's obvious and ponder? [15:40:04] i suppose when i looked earlier i thought i would have to make a bigger mess to removes Type's, when the goal was simplifying [15:40:06] dcausse: yea [15:40:59] ok will try this on master and see [15:42:40] workout, back in ~30 [15:53:57] how can i get stack traces from jetty? We have ' /oauth/check_auth java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms' sporadically in the wcqs-blazegraph journal output, but no stack traces or anything useful in logstash [15:55:38] it's an odd message because check_auth should only validate the JWT token, which doesn't require the network. A stack trace would make things clearer :) [15:56:12] oh i guess i didn't paste the whole message, it start swith `:WARN:oejs.HttpChannel:qtp968514068-231008: /oauth/check_auth java.io.IOException: ...` [15:56:56] ebernhardson: I don't think we configure the logger for jetty properly [15:57:13] so you might only see this in journalctl :/ [15:58:22] should this exception be captured to force the user to login again? [15:59:05] dcausse: i mean, we could hack a 500 to remove the users cookie and redirect to self, but seems hacky :) [15:59:33] also i dunno if the cookie does it, but seems plausible since the users claimed the 500 state was persistent for some time (and the cookie expires in a few hours) [16:04:36] random google search turns up sonatype docs which say jetty behind an nginx proxy can have these exceptions, and to turn off proxy_buffering. Although i'm not entirely sure they are the same exceptions :) [16:04:56] doesn't really line up with the bad cookie idea though [16:08:14] can JWT throws TimeoutException? [16:08:45] dcausse: probably, but I would expect a JWT timeout and not a j.u.concurrent timeout [16:09:21] i mean, the token can timeout and become invalid. But the actual validation shouldn't be doing any concurrency or network requests [16:11:16] oh qtp968514068-231008 reminds me something [16:11:38] well no [16:12:15] I'm mixing up things, briefly thought that it could be related to the jetty http client we use in blazegraph [16:12:37] but the error line is tied to /oauth/check_auth which would not make sense [16:13:10] but I've plenty of "qtp" threads in jstack [16:13:18] *seen* [16:13:55] bah... ignore everything I said this does not help at all [16:14:06] :) [16:14:26] looking, seems qtp is the main thread pool for jetty distributing requests to? [16:14:46] yes so that's "normal" to see plenty of them :) [16:15:40] well, i guess i'll pull on the 'get stacktraces' thread. It's only more pain in the future to keep trying to debug this system without good logs [16:16:15] makes sense [16:20:31] back [16:38:05] ahh, we include slf4j in the war file instead of in the deployment, so jetty never sees slf4j and ends up using StdErrLog [16:38:29] in theory can drop slf4j in libs/ and it will work? Never that easy :P [17:30:35] and logging configuration is painful because... "jetty-runner is for quick testing of your webapp, and is not recommended for production use, purely because it's not setup for customization and configuration just like you are experiencing now." [17:32:48] haha [17:34:15] * ebernhardson is pretty sure migrating away from jetty-runner would take significant time and falls fully into yak shaving territory :P [17:34:50] lunch/errands, back in time for SRE pairing [17:45:03] dinner [18:22:27] back [18:23:49] how odd. `java -cp "lib/slf4j/*" jetty-runner-*.jar ...` doesn't find the slf4j libs. `java -cp "jetty-runner-1.2.3.jar:lib/slf4j/*" org.eclipse.jetty.runner.Runner` finds the slf4j libs [18:25:13] (but then fails bceause i didn't bring in janino, probably need some way for a .pom to wrap these deps up) [19:35:19] quick errand, back in ~30 [20:03:18] Trey314159: nice work on the highlighter doc ! Big thanks ! https://docs.google.com/document/d/1v7KoQX7kc5vvR4Ti7hpFlHs6I6zDi4-E_AxyCoWiXRo/edit?usp=drivesdk [20:03:43] And thanks to dcausse and ebernhardson who have probably provided a lot of the analysis ! [20:11:42] something about this classpath thing is magical ... `java -cp "jetty-runner-1.2.3.jar:lib/slf4j/*" ...` works, but wrap those deps into a fat jar and call it `java -cp "jetty-runner-1.2.3.jar:my-logging.jar" ...` and it fails. But put the jar in a directory and name the directory with * and it works. [20:14:57] but anyways, kinda-sorta works. restarting wcqs2001 put ~6k lines into logstash complaining about pre-existing jar-hell [20:15:49] sorry, been back [20:15:55] blazegraph war looks ok, but mw-oauth war and jetty-runner.jar have overlapping classes. [20:24:00] Curious, do all of the ES sites in WMF production run on their own indexes? or all in some big shared cross site indexes? [20:25:10] addshore: multiple indices per wiki [20:52:34] how many shard do you have in a single elastic search setup then for all the wikis? [20:52:43] / is there a way I can look at the /_stats output ? :P [20:53:16] addshore: umm, just issue the http request? :P curl https://search.svc.eqiad.wmnet:9243/wikidatawiki_content/_stats [20:53:47] addshore: alternatively, in cloud there are public servers at https://cloudelastic.wikimedia.org:8243/ accessible from wmf cloud [20:53:53] (but that cluster is not same size) [20:54:42] addshore: additionally there are three clusters per group, ports 9243, 9443, and 9643. but wikidata will be found on :9243. [20:54:52] only 63 shards total [20:55:00] aah wait, thats only for wikidatawiki! [20:55:06] addshore: yea, thats a lot :P [20:55:33] *goes and gets the snippet from what he is currently looking at elsewhere `"_shards": { "total": 1513,`* [20:56:35] addshore: for a simple shard count, something like `for port in 9{2,4,6}43; do curl https://search.svc.eqiad.wmnet:$port/_cat/shards; done | wc -l` gives 13.5k shards [20:57:07] TLDR here is for this thing that runs multiple wikibases we are having some elastic search fun [20:57:21] probably mainly because we are running on as low resources as possible, with multiple wikis, [20:57:30] so we currently have 1513 shard on something with 2GB heap [20:57:37] (single node currently) [20:57:58] addshore: 1500 shards is going to be a major pain on one node. Can you set everything to 1 shard per index? [20:58:07] there is no reason to have multiple shards per index on a single host anyways [20:58:30] so multiple hosts I imagine would also help the situation? [20:58:41] but yes, from my reading, given these are all small wikis, we hare too many shard :P [20:58:52] doc count it as 778069 [20:59:03] addshore: maybe, but 1500 shards is quite a lot in elasticsearch world. We try to keep the production clusters at < 5k shards and they have a few thousand cores a 5+ TB of memory :) [20:59:03] store in bytes, 1673363952, so 1.6GB [20:59:18] hehe, indeed :P [20:59:49] i there an easy way to count the indexes that exist? [20:59:51] addshore: this sounds like a common misconfiguration in elasticsearch, but at 1.6GB of data you really don't want any index to have more than 1 shard [20:59:52] i guess I could do this with jq [21:00:01] addshore: curl http://localhost:9200/_cat/indices | wc -l [21:00:30] the /_cat/ api's are human readable outputs and almost always 1 line per element [21:00:36] aaaaah [21:02:24] 379 indexes [21:02:36] addshore: if they are all cirrus indices, there is a cirrus config option that can be set and then all the indices would have to be recreated. You should see things run better with 379 shards [21:02:58] great [21:03:00] $wgCirrusSearchShardCount = 1; [21:03:28] it looks like our defaults are 4 :S that only makes sense in the wmf world, and even then not for most of our wikis [21:03:50] :D [21:04:08] Right, I think this is perhaps what really caught us out then :D [21:04:28] looks like there are some other interesting config options in there we might want to look at [21:04:43] I wonder if you might be up for a short call at some point in the coming days if we want to pick you brain? [21:05:31] Another interesting thing was seeing a whole bunch of `curl` processes that were being run by the elastic search user, no idea what for [21:06:16] *checks if this channel is logged* [21:06:29] yes, woo! [21:07:25] addshore:hmm, not sure why it would be invoking curl. elastic tends to do everything in java-land [21:07:36] yeah, it surprised me too :P