[08:07:20] <gehel>	 dcausse: if you have a few minutes: https://docs.google.com/document/d/1E8TW3CP_TnIDLoHUxceGpFvnZB_mtI3XIZAOa5qNkO4/edit#
[08:08:27] <dcausse>	 gehel: sure, I had a quick look to it already, was not sure why Wikibase itself was covered
[08:08:59] <gehel>	 This is the notes from yesterday's pentest meeting. If you could have a look and add anything that I'm missing, in particular in terms of interesting code repositories
[08:09:30] <dcausse>	 sure
[08:10:05] <gehel>	 I think that the expected scope was wikidata + w[cd]qs. There have been pentests on Mediawiki already, but having a specific focus on Wikidata makes sense, it is different enough from other wikis that the attack vectors might be different
[08:10:34] <dcausse>	 ok
[08:10:46] <gehel>	 I'm also interested if you have other ideas of potential scope (or scope exclusions).
[08:11:32] <dcausse>	 I think I'd add a link to the puppet code doing nginx
[08:11:53] <dcausse>	 crashing wdqs is a known problem but you already mentionned this
[08:13:01] <gehel>	 yeah, I'm really not interested in knowing about other convoluted ways to crash the service. We have enough simple ways to do that already :)
[08:13:58] <gehel>	 I would leave the Wikidata side of that document to Kara / Lydia. But if you have good ideas of things to add on their side, feel free!
[08:15:33] <dcausse>	 wdqs should be read-only but the backend is writable and only nginx is protecting us so perhaps I'd mention this
[08:20:12] <dcausse>	 hm https://github.com/wikimedia/wikidata-query-blazegraph is not synced with https://gerrit.wikimedia.org/r/plugins/gitiles/wikidata/query/blazegraph/+/refs/heads/master
[08:21:05] <gehel>	 let me change the link
[08:25:42] <dcausse>	 I think the "rdf" is properly sinced, only blazegraph apparently...
[08:27:05] <gehel>	 ^ not sure I understand
[08:28:29] <dcausse>	 sorry I was looking at a different section
[08:28:40] <dcausse>	 links to codebase are at the top as well
[08:28:49] <dcausse>	 did not see the ones at the bottom
[08:29:21] <gehel>	 yep, the first part is the notes from the meeting. I added everything after "Main components" to have a bit more structure
[08:29:46] <gehel>	 I'll let Maryum choose how she wants to format all that before sending it to the consultants.
[09:52:31] <gehel>	 lunch
[10:44:51] <dcausse>	 lunch 2
[12:56:04] <dcausse>	 might be a stupid question but I wonder what to use in statsd to record the cirrus doc sizes... what I want is something similar to "timing"
[13:07:09] <dcausse>	 reading https://github.com/statsd/statsd/issues/98 I guess it's fine to use "timing" for something that's not "times"
[13:07:46] <dcausse>	 histogram would be nice perhaps but it's not available from mediawiki
[15:48:18] <ebernhardson>	 spark memory usage is so mysterious ... have run the dump numerous times and always fail with yarn killing things.  largest output partition is now ~1GB but i'm up to 18G executors w/2 tasks each trying to get it to finish (might actually work this time, but at 12G it failed)
[15:49:03] <ebernhardson>	 or maybe it's python, who knows. Yarn should really give a report when it kills things about the processes involved instead of a top-level number
[15:56:42] <dcausse>	 weird that it needs so much mem
[16:00:15] <ebernhardson>	 yea it seems odd to me too...maybe it's the python in the middle, most of the ram is assigned to memoryOverhead which is non-jvm heap (python, plus some jvm stuff)
[16:01:20] <ebernhardson>	 maybe could consider redoing this in java, it's not a big script. Would have to take more time i suppose to handle the elasticsearch json on the java side though.  Or i suppose i should do some experiments with elasticsearch-hadoop but that always seemed a bit magical, and in some cases (not clear when) it wants port 9300 access which we cant give
[16:01:47] <dcausse>	 thing I don't get is how you populate the RDD?
[16:01:59] <dcausse>	 (if you're generating a rdd)
[16:02:06] <ebernhardson>	 using flatMap
[16:02:10] <ebernhardson>	 and python generators
[16:02:34] <dcausse>	 will it buffer everything? or create partitions when needed?
[16:02:50] <dcausse>	 you get a wiki+shard in and many docs out no?
[16:03:41] <ebernhardson>	 i had to divide it up, i turn pre-process the shard info into a bunch of individual requests that have min/max values to use with search_after.  The prod clutser turns into 22,407 partitions with my current settings
[16:04:04] <ebernhardson>	 maybe that wasn't necessary, unsure, but that was my first attempt to deal with memory issues guessing the problem was trying to pull in the ~50GB partitions
[16:05:22] <ebernhardson>	 in the current run with 13.5k of those complete the largest partition is 2GB of output data from 20k docs,  but there are other partitions that have 250k docs and 553MB
[16:05:35] <dcausse>	 ok
[16:06:10] <dcausse>	 I wonder how elasticsearch-hadoop is solving this problem tho, they might implement lower level spark component
[16:07:01] <ebernhardson>	 yea they likely (i haven't opened the code) implement a custom spark datasource, which implements the spark read layer instead of constructing an rdd with one-object per partitition holding the request
[16:07:39] <ebernhardson>	 but python side can't (easily, at least. maybe with lots of wrangling) implement that.  Maybe i should learn to do more spark from the java side :P
[16:07:52] <dcausse>	 possibly :)
[16:08:00] <ebernhardson>	 but i suppose i'm not expecting the elasticsearch-hadoop thing to be better about memory usage. I should at least try it out though i suppose
[16:14:38] <dcausse>	 going offline, have a nice week-and
[16:14:41] <dcausse>	 *end
[17:02:43] <inflatador>	 <o/
[19:11:58] <ebernhardson>	 managed to complete this time at least
[19:36:26] <ebernhardson>	 again today thread pools spiked up around 8:30 to 1k :S 1054 ends up an outlier. still not clear from stats where the extra load comes from
[19:36:30] <ebernhardson>	 18:30 even
[20:03:12] <ebernhardson>	 the per-index statistics make a reasonable claim that its commonswiki queries.  over the last 12 hours enwiki_content ramps rather typically from 150 cores/s to 300 cores/s,  commonswiki_file spikes from 30 cores/s to 336 cores/s at 14:20, slowly ramps down to normal over 3 hours, then spikes back up to 260 cores/s at 18:30 and is slowly ramping down
[20:05:37] <ebernhardson>	 wikidatawiki_content has a similar tiered output, with a baseline of 30c/s but spikes up to 100-150 cores/sec for multi-hour timespans.  Not sure whats appropriate...
[20:08:36] <ebernhardson>	 actual qps doesn't change though.  elasticsearch_indices_search_query_time_seconds_total increasing out of lockstep with elasticsearch_indices_search_query_total