[08:07:20] dcausse: if you have a few minutes: https://docs.google.com/document/d/1E8TW3CP_TnIDLoHUxceGpFvnZB_mtI3XIZAOa5qNkO4/edit# [08:08:27] gehel: sure, I had a quick look to it already, was not sure why Wikibase itself was covered [08:08:59] This is the notes from yesterday's pentest meeting. If you could have a look and add anything that I'm missing, in particular in terms of interesting code repositories [08:09:30] sure [08:10:05] I think that the expected scope was wikidata + w[cd]qs. There have been pentests on Mediawiki already, but having a specific focus on Wikidata makes sense, it is different enough from other wikis that the attack vectors might be different [08:10:34] ok [08:10:46] I'm also interested if you have other ideas of potential scope (or scope exclusions). [08:11:32] I think I'd add a link to the puppet code doing nginx [08:11:53] crashing wdqs is a known problem but you already mentionned this [08:13:01] yeah, I'm really not interested in knowing about other convoluted ways to crash the service. We have enough simple ways to do that already :) [08:13:58] I would leave the Wikidata side of that document to Kara / Lydia. But if you have good ideas of things to add on their side, feel free! [08:15:33] wdqs should be read-only but the backend is writable and only nginx is protecting us so perhaps I'd mention this [08:20:12] hm https://github.com/wikimedia/wikidata-query-blazegraph is not synced with https://gerrit.wikimedia.org/r/plugins/gitiles/wikidata/query/blazegraph/+/refs/heads/master [08:21:05] let me change the link [08:25:42] I think the "rdf" is properly sinced, only blazegraph apparently... [08:27:05] ^ not sure I understand [08:28:29] sorry I was looking at a different section [08:28:40] links to codebase are at the top as well [08:28:49] did not see the ones at the bottom [08:29:21] yep, the first part is the notes from the meeting. I added everything after "Main components" to have a bit more structure [08:29:46] I'll let Maryum choose how she wants to format all that before sending it to the consultants. [09:52:31] lunch [10:44:51] lunch 2 [12:56:04] might be a stupid question but I wonder what to use in statsd to record the cirrus doc sizes... what I want is something similar to "timing" [13:07:09] reading https://github.com/statsd/statsd/issues/98 I guess it's fine to use "timing" for something that's not "times" [13:07:46] histogram would be nice perhaps but it's not available from mediawiki [15:48:18] spark memory usage is so mysterious ... have run the dump numerous times and always fail with yarn killing things. largest output partition is now ~1GB but i'm up to 18G executors w/2 tasks each trying to get it to finish (might actually work this time, but at 12G it failed) [15:49:03] or maybe it's python, who knows. Yarn should really give a report when it kills things about the processes involved instead of a top-level number [15:56:42] weird that it needs so much mem [16:00:15] yea it seems odd to me too...maybe it's the python in the middle, most of the ram is assigned to memoryOverhead which is non-jvm heap (python, plus some jvm stuff) [16:01:20] maybe could consider redoing this in java, it's not a big script. Would have to take more time i suppose to handle the elasticsearch json on the java side though. Or i suppose i should do some experiments with elasticsearch-hadoop but that always seemed a bit magical, and in some cases (not clear when) it wants port 9300 access which we cant give [16:01:47] thing I don't get is how you populate the RDD? [16:01:59] (if you're generating a rdd) [16:02:06] using flatMap [16:02:10] and python generators [16:02:34] will it buffer everything? or create partitions when needed? [16:02:50] you get a wiki+shard in and many docs out no? [16:03:41] i had to divide it up, i turn pre-process the shard info into a bunch of individual requests that have min/max values to use with search_after. The prod clutser turns into 22,407 partitions with my current settings [16:04:04] maybe that wasn't necessary, unsure, but that was my first attempt to deal with memory issues guessing the problem was trying to pull in the ~50GB partitions [16:05:22] in the current run with 13.5k of those complete the largest partition is 2GB of output data from 20k docs, but there are other partitions that have 250k docs and 553MB [16:05:35] ok [16:06:10] I wonder how elasticsearch-hadoop is solving this problem tho, they might implement lower level spark component [16:07:01] yea they likely (i haven't opened the code) implement a custom spark datasource, which implements the spark read layer instead of constructing an rdd with one-object per partitition holding the request [16:07:39] but python side can't (easily, at least. maybe with lots of wrangling) implement that. Maybe i should learn to do more spark from the java side :P [16:07:52] possibly :) [16:08:00] but i suppose i'm not expecting the elasticsearch-hadoop thing to be better about memory usage. I should at least try it out though i suppose [16:14:38] going offline, have a nice week-and [16:14:41] *end [17:02:43] managed to complete this time at least [19:36:26] again today thread pools spiked up around 8:30 to 1k :S 1054 ends up an outlier. still not clear from stats where the extra load comes from [19:36:30] 18:30 even [20:03:12] the per-index statistics make a reasonable claim that its commonswiki queries. over the last 12 hours enwiki_content ramps rather typically from 150 cores/s to 300 cores/s, commonswiki_file spikes from 30 cores/s to 336 cores/s at 14:20, slowly ramps down to normal over 3 hours, then spikes back up to 260 cores/s at 18:30 and is slowly ramping down [20:05:37] wikidatawiki_content has a similar tiered output, with a baseline of 30c/s but spikes up to 100-150 cores/sec for multi-hour timespans. Not sure whats appropriate... [20:08:36] actual qps doesn't change though. elasticsearch_indices_search_query_time_seconds_total increasing out of lockstep with elasticsearch_indices_search_query_total