[09:57:27] dcausse: I'm not sure how to properly describe my scenario in cucumber - what are those annotations at the beginning of each .feature file? [09:59:07] zpapierski: they're like before/after junit init/cleanup steps [09:59:23] they're defined in support/hooks.js [09:59:43] so for you there'll likely be a hook feeding the data [10:00:52] e.g. @suggest, it creates a bunch of pages and then call the api cirrus-suggest-index [10:02:42] well you perhaps don't need that actually since the data you need does not rely on the wiki database... simply initializing all you need some scripts un tests/jenkins (most likely resetMwv.sh) might be enough [10:03:48] s/all you need some scripts un/all you need is some scripts under/ [10:04:26] so your scenario might just assume that the data is there [10:05:33] could be as simple as "When I query completion search for f then foo is the first api result" [10:07:08] you'll have then to define support function for this in e.g. step_definitions/search_steps.js [10:08:05] the tricky part is find a "sentence" that won't be in conflict with existing regexes in other step definitions [10:18:44] I already defined the steps themselves, this far I understood, I just missed that hook thing, thanks [10:18:56] I think they're not conflicting [10:22:27] lunch [11:53:26] meal 3 break [13:22:45] dcausse: how can I force cindy run? it didn't run on my latest change? [13:23:02] (or didn't report, not sure) [13:29:20] zpapierski: I'm debugging cindy at the moment [13:29:50] sometimes if the build is broken badly cindy won't even run [13:29:57] ah, ok :) [13:30:05] that wouldn't be surprinsing [13:30:14] s/surprinsing/surprising [13:30:43] might happen if the maint scripts to initialize the env are broken for instance [13:31:28] huh, can't log into cindy host [13:31:57] cirrus-integ.eqiad.wmflabs [13:32:13] "Connection closed by UNKNOWN port 65535" - familiar? [13:32:23] zpapierski: you should rebase your path it's unlikely to work from its current base [13:32:28] s/path/patch [13:32:35] ok [13:33:51] "Connection closed by UNKNOWN port 65535" what is this ssh? [13:33:56] yep [13:34:26] ssh config issues, bastion for wmflabs might not be setup properly? [13:34:37] I'm debugging my ssh connection [13:35:09] but I'm logging into other wmcs instances without issue [13:36:04] I think I see this error when I wait for too long before entering my passphrase into the ssh-agent [13:36:39] but maybe not... [13:38:48] Warning: sizeof(): Parameter must be an array or an object that implements Countable in /vagrant/mediawiki/extensions/CirrusSearch/maintenance/UpdateQueryCompletionIndex.php on line 63 [13:38:52] zpapierski: ^ [13:38:57] huh [13:39:01] thx [13:39:58] cindy is running let see if it reports something (sadly it won't report such errors but only the failed tests) [13:42:05] I messed up with the script, don't know yet why (copied from a working source) [13:42:29] hmm, I probably shouldn't do that per wiki [13:44:22] ok, there must be a reason why the rest of resources are provided from the root [13:44:25] will do the same [13:55:40] ok, let's see if that helps [14:13:12] zpapierski: https://www.mediawiki.org/wiki/JetBrains_IDEs#MediaWiki_code_style might help, MW code style loves spaces [14:13:48] ahh, thanks for that - I started to change them manually [14:22:49] zpapierski, ebernhardson1 T259674 and T258738 are ready for dev but assigned to you is this something you plan to work on soon? happy to help and tackle one of these [14:22:50] T259674: Ship query completion indices from analytics to prod clusters - https://phabricator.wikimedia.org/T259674 [14:22:50] T258738: Build query-clicks dataset from SearchSatisfaction logging - https://phabricator.wikimedia.org/T258738 [14:24:25] I think that would be awesome [14:24:47] we should have everything we need for that [14:24:56] need to relocate, be back for triaging [14:33:56] break [15:50:34] * ebernhardson1 will someday figure out why his laptop freezes for ~2 sec whenever starting the hadoop integration environment [16:24:09] dinner [16:49:50] ryankemper: is now an okay time to do some reindexing? [16:50:08] it may run for the whole week... [16:50:19] Trey314159: fire away [16:50:28] Cool! Thanks! [18:48:30] heh, started reindexing and cloudelastic started complaining :) I wouldn't worry about it yet though [21:50:13] Is the Wikidata Query Service implemented with sharded Blazegraph servers? If not, is there a reason why that wasn't pursued? [22:54:01] hare: what would the partition function be for sharding wikidata? [22:54:33] I have not the slightest idea [22:56:32] classically "sharding" means separating data by some dimension into separate buckets. [22:57:09] for a multi-tenant app, "customer" is a classic shard discriminator [22:57:39] but for a thing like wikidata it could be tricky to find a dimension to cut on [22:57:44] I think Wikidata Query Service is single-tenant? [22:57:46] Right [22:58:03] the different parts of a triple (subject, property, object) might be one way [22:58:17] unless I guess blazegraph has some map/reduce distributed functionality [22:58:27] That's what I assumed [22:59:56] I haven't been deep into blazegraph by my recollection from ~5 years ago was that it did not have any distributed query system. I remember it being a "scale up" app, not a "scale out" app [23:01:03] If that's how it was five years ago that's probably how it is now [23:01:11] I'm considering another approach that uses Redis [23:01:40] I am pretty sure Redis supports sharding, but I don't know if it's as turnkey as I would want it to be [23:02:11] T206560 is probably relevant generally to blazegraph things [23:02:11] T206560: [Epic] Evaluate alternatives to Blazegraph - https://phabricator.wikimedia.org/T206560 [23:02:14] The challenge is supporting a query service at the scale of Wikidata but without ultra-large servers [23:03:07] I really hope they find something; Blazegraph has been in a very sorry state since Amazon poached it [23:03:25] It really does feel like interacting with a black box [23:03:35] "the scale of Wikidata but without ultra-large servers" so an NP-hard problem? [23:05:10] Perhaps... [23:06:47] There's the indexing strategy, then there's knowing which shard the relevant data is in... [23:06:58] I really don't know of any useful graph database that doesn't hold the graph in ram in some way to allow full traversal [23:07:36] so you can have 100 small servers or one huge server, but ultimately you need it all in ram pretty much all the time [23:08:31] and 100 small servers adds in the need for query coordination servers to aggregate the partial results from the shards [23:09:13] The main benefit would be if your dataset/index occupies more RAM than can fit on a machine [23:10:11] Facebook also uses plain MySQL/memcache for their graphs https://engineering.fb.com/2013/06/25/core-data/tao-the-power-of-the-graph/ [23:10:40] facebook also has ~40 engineers who write custom storage engines for mysql :) [23:10:59] i also build a copy of some facebook graph tech. Scales horizontally on elasticsearch servers, but has significant drawbacks compared to SPARQL: https://wikitech.wikimedia.org/wiki/Tool:Wikibase_Unicorn [23:11:20] (minimal, POC copy :P) [23:12:43] * ryankemper takes a note to check that out later, sounds neat [23:14:39] hare: but yup it is definitely like interacting w/ a blackbox. We've basically accepted we have to move off blazegraph long-term on the team, it's just a question of exactly when that happens and what we move to. to your point, it is basically abandonware as far as amazon is concerned [23:16:40] I don't know enough about the internals to really say if there's a way to partition it out like you're saying but I think bryan's point about needing to store a bunch of stuff in RAM for efficient traversal is definitely the case [23:17:05] so from an operational perspective it's definitely a scale-up-not-out service and given that we know we're not going to stay on it forever, trying to figure out ways to make it more horizontally scalable probably wouldn't have a great return on investment [23:17:26] (although arguably if there were a way to magically make it capable of scaling like how elasticsearch was that presumably obviates the need to actually move off it...but I digress) [23:18:16] from my prior research, the problem isn't just storing it in ram but communictaion between steps in the database. To have a fast graph database it needs to lookup in memory all the edges, not send a network request that comes back 2ms later. This fan-out of computation between nodes ends up dominating the compute time. [23:20:48] the difference with elasticsearch is all shards are independant. In elasticsearch each shard does its thing, reports to a coordinator. The coordinator gets the response and does a sort/slice. With a distributed graph each step in the graph that isn't within the shard has to go over the network, lots of communication [23:34:14] I think I come here and ask about this every three months or so, so I appreciate your continued consultation