[09:55:51] I'm pausing mjolnir dag [10:15:38] lunch [13:03:20] o/ [13:22:47] having NickServ issues ATM, brb [13:24:01] aaand back [13:35:24] Still having IRC issues, I might be in and out until this gets fixed [13:47:56] OK, fixed. Not sure why my pw stopped working but whatever [15:04:18] mpham won't be at sprint planning , at the serviceops mtg [15:45:38] inflatador: is https://phabricator.wikimedia.org/T321491 in progress for you? [15:52:30] mpham Yes [16:05:16] Ugh! Google Calendar has decided to randomly stop giving me alerts. Sorry I missed the triage meeting. I was fighting python and Jenkins. [16:42:06] hi there. addshore suggested me to look here for some suggestion about my own wikidata-query-rdf setup (or, at least, i'm trying...) [16:42:50] madbob: we can try :) [16:43:36] well, the problem is very stupid. but, after days of different tries, i've not yet found a solution [16:44:31] when executing runBlazegraph.sh, it output tons of debug logging [16:46:25] i've appended many different parameters to the java command line, such as -Dcom.bigdata.level=WARN (which should be the log4j parameter), but always have the same behavior [16:47:22] also tried to override log4j.properties file, trying to enforce a higher level of logging, with no results [16:48:53] then, when executing loadRestAPI.sh, it results in a stream of debug data. seems a bit strange to me that this is intended to be the normal, production ready behavior... [16:52:15] what produces an output like: [16:52:31] 16:38:36.351 [com.bigdata.journal.Journal.executorService1] DEBUG com.bigdata.sparse.TPS - [... lots of informations... ] [16:52:32] ? [16:53:36] i've figured out that enforcing some log4j parameter i could handle this, but this is not the case (or, at least, i'm not doing in the expected and right way) [17:00:23] hm [17:02:35] madbob: i think the problem is we use logback, you need something like https://github.com/wikimedia/puppet/blob/production/modules/query_service/templates/logback.xml.erb which we place in /etc/${deploy_name}/logback-${title}.xml, along we setting LOG_CONFIG=/etc/<%= @deploy_name %>/logback-<%= @title %>.xml in /etc/default/${title} [17:02:57] essentially, you need the LOG_CONFIG environment variable to point at a logback xml file [17:03:16] thats picked up by the runBlazegraph.sh script [17:04:07] tbh i don't fully understand the jvm logging configurations, there are so many ways...but that seems to be how our services are configured [17:06:17] logback... ok, at least is a different path to explore as manipulation of log4j parameters failed [17:06:23] lunch, back in ~1h [17:18:22] yes, probably the logback configuration file to enforce through LOG_CONFIG is the correct way. i've just to figure out how to populate the variables intended to be passed to the sample template you linked to me, but this is the easy part ;-) [17:18:25] thank you! [17:18:54] madbob: np, good luck :) Come back any time, usually someone is arround EU and US hours [18:10:49] back [18:32:35] CI is such meh recently :( A patch failed phan in a way that doesn't make alot of sense, so i had CI run against the master branch. The master branch also failed CI, but in the selenium tests instead of phan :P [19:11:10] AI=Ambiguous Integration [20:35:26] looks like the data reload finished. There are errors, but not sure if they mean anything: https://phabricator.wikimedia.org/P36109 [21:05:09] hmm, i don't think there are any errors in there, looks ok [21:05:52] the FAIL line is perhaps sus, but it seems to suggest 0 instances fail'd [21:06:42] david had said we could check the triple count to see if it looks reasonable, and i suppose we can run the test queries from the rdf repo's queries dir [21:07:04] heya folks - wondering when your next office hours are coming up. I'm almost done putting all the wiki dumps through a kafka pipeline in avro format. I def have some q's depending on when the next one is coming up [21:07:41] kristianrickert: first wednesday of each month, i think that should be next week on the 2nd [21:08:13] avro was probably annoying, our schema isn't all that strict :( [21:08:30] hahah it wasn't bad, I'm just using a few fields [21:09:33] I just have title, body, some ids, and a bunch of NLP fields I'm making for fun. I made about 5-6 avro models, but I also made the document have a map for custom fields if I feel like dumping data on it [21:10:34] ahh, yea that's not going to be as bad [21:10:51] I dumped all the redirects - maybe my parser isn't catching them but do documents also carry which keywords redirect to them? [21:11:21] right now I htink I'm going to throw all those in a simple table until the crawl is done then dump that table into the documents. [21:11:26] inflatador: oh i totally missed the earlier line: `File /srv/query_service/munged/wikidump-000000803.ttl.gz not found, terminating`. Does sound like it ended early, i think there should be ~1700 files. Should issue a count query to get the # of tripples to see if it's bad [21:12:26] kristianrickert: sadly no, it's something we've pondered extracting in a batch job, but nothing we've gotten around to. Previous analysis was that for the most part all that link text is already found in the redirects field [21:13:19] generally redirects get highly weighted in our ranking, they also sometimes make search engineers think we have nlp magic when really it's just human editors providing context words :) [21:13:20] yeah, in the long run it's NBD for me either. I'm going to try the new dense vector stuff once I have a good pipeline in practice [21:13:33] hahahha [21:13:37] ebernhardson can I do that from the WCQS GUI, or what's the best way to do that? [21:13:58] inflatador: i think you can do `SELECT (COUNT(*) as ?Triples) WHERE { ?s ?p ?o}` [21:13:58] yeah, so it's not a bad idea to keep that redirect data. I just wish I didn't have to match off the title and it would also have the document id in it [21:14:54] kristianrickert: hmm, our dumps should have page_id's in them, every other line [21:15:00] it's parsing the XML from an old java project but it does a good job [21:15:03] kristianrickert: oh :) [21:15:15] ohhh... then I'm missing it is all [21:15:34] inflatador: if you can get the UI for the box that will work, otherwise you can issue it directly to the sparql endpoint but you have to urlencode the query [21:15:39] that's good to know. it won't be that bad of a task to fix then [21:16:13] yeah - I was thinking about that but your servers only let me use three threads per IP address at a time ;) [21:16:26] ebernhardson cool, what port does the GUI use? [21:16:56] which GUI? [21:17:20] kristianrickert: two different convo's :) inflatador is doing something with https://commons-query.wikimedia.org/ which is a RDF triple store (graph db) [21:17:24] the entire app I'm making is far from any gui. It's an all-in-one CLI that lets you pick from a menu or commandline switch what the process would do [21:17:30] ahhh [21:18:06] sorry, I won't bother ya. I'm at a very robotic point of this adventure. It's just typing a lot. May as well be an ETL person at a bank. [21:18:15] that's OK! didn't mean to interrupt [21:18:29] inflatador: hmm, actually it's probably tedious to use. The GUI is hosted by microsites and the ATS directs requests to the right instance based on the bits in the URI [21:18:49] oh, this is just a side project for me so I can learn kafka and use a bunch of search engines at the same time .. just because [21:18:59] ebernhardson ah, I thought maybe there was a GUI port I could tunnel into directly. No worries, I'll just URL encode [21:19:04] kristianrickert: no worries :) I can usually follow a few threads at a time. I would still suggest trying our search dumps, they contain much of this already extracted into easily digestable json [21:19:52] oh... that would be a ton easier. I already have teh code written for all the steps, I'm just message-izing and CLI-izing it [21:20:36] basically, my goal is to allow for each step to run as a kubernetes instance, so I can run a bunch when I eventually vectorize each doc [21:21:46] kristianrickert: makes sense. I wonder, have you considerd something like weaviate which does that kind of thing for you? [21:22:33] I have! so that's why I'm doing this: I like lucidworks is moving away from Solr to Weaviate [21:23:05] and solr's new vector stuff, well, I have no idea how to do that [21:23:41] one of my coworkers is really good with that stuff, so he's going to show me some stuff. So far he's got me doing word2vec, but there's a lot better (and faster) things out there [21:23:49] inflatador: for wdqs instances, you can do http://localhost:80/sparql, for wcqs due to auth easiest is to use the :9999 port with the full path /bigdata/namespace/wcqs/sparql [21:24:22] the good news is, with kafka in the mix, I can just take the topic and shove it in weaviate once I get there [21:24:51] yea that should work, i guess i was curious as I haven't had a chance to play with weaviate but i know they already do the k8s and string-to-vector parts natively [21:25:01] (BTW - thanks for mentioning that one, I also want to try vespa) [21:25:10] yea vespa is super interesting [21:25:35] i like that they have an actual ranking equation you can adjust. It's always bothered me a little that lucene has a ranking equation but it's hidden behind the scenes and tedious to adjust directly [21:25:52] yeah, that's why I chose this data.. as much as you're near it and know that it's not totally structured, it is well edited, has little bias, and can be used for all sorts of joy to the automated world [21:26:09] yeah I've had to write lucene rankers [21:26:23] it's not bad, but it's really primitive [21:26:52] indeed [21:27:08] Solr it's not that bad to adjust or extend. They have soooo many tests in that codebase. Everyone is just running away from it for these new cherries all over [21:28:13] sorta like how nginx is the cool guy for proxies and apache isn't for proxying, I feel like solr is the same way. [21:28:18] the problem with these new ones, at least for us, is varied languages don't often get handled well. One problem you might see running enwiki data is that there are dozens of languages and scripts in the enwiki corpus [21:28:52] scripts? [21:28:59] latin, arabic, cyrillic, etc. [21:29:03] ohhh [21:29:12] you mean the actual character encoding? [21:29:15] yea [21:29:25] I wasn't even thinking about that [21:29:52] since I'm going nuts with the NLP - maybe I should think of ways I can help massage it with that - it'll be a couple weeks before I get there though [21:30:15] to take a random page from todays front page, https://en.wikipedia.org/wiki/China?action=cirrusdump has redirects including 中国 and 中國 [21:30:57] oh, they're actually in the title? [21:31:12] yea, or https://en.wikipedia.org/wiki/Quran has arabic in the content [21:31:34] that'll be an interesting problem to look at [21:31:42] there are lots of smaller things, but these are the ones that are easiest to find :) [21:32:19] aging myself here: just 12 years ago mysql defaulted to south american character set for the DB. Drove me nuts at times [21:32:45] i also wonder what nlp will do with https://en.wikipedia.org/wiki/To_be,_or_not_to_be which has old-english sentence construction [21:32:46] I'd be using mysql all happy and then out of nowhere I'd realize I forgot to chage the default [21:33:12] "Muſt giue vs pauſe, there's the reſpect" :) [21:33:25] oh, my use of NLP right now is very very .. umm.. crappy [21:33:44] Right now it's mainly doing NER [21:33:51] ahh, ok. That will be easier :) [21:34:13] but it would be fun to see if it accepts that as english [21:35:18] the bulk of content is pretty typical english, but we have lots of fun edge cases :) [21:44:36] inflatador: dunno if you figured that out, aparently for wcqs the url to fetch is http://localhost:9999/bigdata/namespace/wcq/sparql?query=SELECT+%28COUNT%28%2A%29+as+%3FTriples%29+WHERE+%7B+%3Fs+%3Fp+%3Fo%7D [21:45:22] inflatador: also there is a second ui on http://localhost:9999/bigdata/ ...but i've rarely used it. Looks like can paste a query into it [21:58:05] inflatador: for the specific error message, 'File not found, terminating', that looks like "expected". By default it tries to keep loading munge splits up to 10k, stopping whenever it gets to a missing file. Files on disk stop at 802, so the error for file 803 is expected and "correct" [21:59:08] the triples count on wcqs2001 is ~10M lower than on the other instances, but i think thats expected. The whole reason for the reload is that things weren't being deleted properly before [21:59:41] it's about -0.2%, seems plausible