[07:56:59] o/ [08:10:33] \o [08:36:41] o/ [09:48:51] dcausse: ever came across this? https://www.irccloud.com/pastebin/IfYqzlAw/ [09:49:14] I'm trying to run bootstrap based on rev map I generated for WCQS and hit a slight issue [09:49:53] This is a script I'm using https://www.irccloud.com/pastebin/MxpxgOSn/ [09:50:24] hm.. at a glance it should work [09:50:41] something with my user on stat1007 perhaps [09:50:51] or flink config, but I did copy your flink dir [09:51:36] not sure how this works but it says: analytics-search/stat1004.eqiad.wmnet@WIKIMEDIA [09:51:44] aah [09:51:53] and you use stat1007 [09:52:01] and herein lies the issue, flink was copied from stat1004 to stat1007 [09:52:04] thanks, missed that [09:52:43] ok, now it logged it [09:52:48] cool [09:52:49] still faide on something [09:53:08] familiar? https://www.irccloud.com/pastebin/boYGpPEL/ [09:53:26] I copied the whole dir, so I assumed a classpath should match [09:53:54] yes you need to copy the staete-processor-api jar to opt iirc [09:54:18] you mean to lib from opt? [09:54:36] you copied flink-1.12.1-wdqs from my home folder on stat1004? [09:54:43] nope, 1.13.1 [09:54:47] sorry. 1.13.2 [09:54:54] but that helped with that [09:56:14] did you compile the job against 1.13.2? [09:56:42] it had issues with this version, esp. with the state-processor-api IIRC [09:56:55] ah,ok - so we still use 1.12? [09:57:00] I thought we migrated [09:57:07] yes for now we're still 1.12 [09:57:14] we'll skip 1.13 [09:57:16] ok then, stepping down [09:58:42] also it complains about the missing param - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater#First_run_(bootstrap) - this doesn't mention --checkpoint_dir but that's what it complains about [09:58:49] why do we need both params,anyway? [09:59:53] no reason really it's a weird requirement of this api (fixed in 1.14: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/719456) [10:00:33] I see, so this parameter is ignored here, then? [10:00:47] it's passed to flink but ignored yes [10:01:23] I think I removed it from the doc because I thought we had migrated to 1.14 in the meantime [10:01:25] ok, cool [10:02:06] I wish we made this testing easier a bit - can we just leave a Flink cluster running in yarn? [10:02:22] it should consume much resources jobless, I guess [10:04:01] we could indeed, we need to find a common place to start it and run the job, flink uses from /tmp files to detect the target session cluster when submitting job [10:04:17] there might be better ways to force a particular session cluster tho [10:04:30] I'd ask Andrew about the recommendations [10:04:36] I mean, I will [10:09:17] errand, I'll be back in 30mins [10:52:35] zpapierski: you want to resume [10:57:55] ejoseph: I'm just out of my meeting with zpapierski. He has to take a break but will be back later. [10:57:56] ejoseph: need to consume a meal that's called "meal 2" (some people call it a second breakfast) [10:58:06] I'll ping once I'm stuffed [10:58:22] ejoseph: If you have some free time, it might make sense to iterate on that refactoring kata [10:58:50] I was looking into git [10:58:54] But ok [10:59:01] git is just as good! [10:59:09] whatever works for you! [10:59:24] We can take some time tomorrow to debrief the git stuff if you want [11:08:22] lunch [11:52:27] break [12:17:53] dcausse, zpapierski : Im working on T258834. I need a reliable way to find out what are all the possible keys the wikidata and commons json dumps can use. I looked through the codes that extract the json, but I couldnt find a nice schema. Is it documented anywhere? [12:17:54] T258834: Create a Commons equivalent of the wikidata_entity table in the Data Lake - https://phabricator.wikimedia.org/T258834 [12:32:00] tanny411: I'm not very knowledgeable here but is this really different than the wikidata json dump? [12:32:53] late lunch [12:33:01] as far as i could find out, it is mostly similar. commons just does not have aliases, sitelinks and claims are called statements [12:33:44] tanny411: I think the SD team (esp. Cormac Parle) might know more [12:35:46] dcausse: which channel is he on? [12:43:03] I'll ping him on slack. [14:05:39] ejoseph: are you available ? [14:06:14] yes [14:06:29] https://meet.google.com/ajv-zzdk-mre [14:39:38] zpapierski: looking at https://phabricator.wikimedia.org/T280485#7275149 it seems that there is a unit discrepancy. I assume you meant that Commons is 2.8B triples. Is that correct? [14:41:40] yeah, that's what I meant [14:42:04] It is also not super clear if our strategy is going to increase the number of Flink workers or not, and in the end if that has an impact on the number of instances that we'll deploy. [14:42:26] we assumed the same usage as with WDQS Flink [14:42:45] My assumption looking at your previous comments is that we expect a ~20% increase in RAM + CPU + local disk storage [14:43:09] same usage => we double the amount of resources that we'll consume on the k8s cluster? [14:43:23] If that's the case, I don't think it is very clear in the ticket. [14:43:24] ah, sorry - storage wise, yes [14:44:17] In the end, from service ops perspective, what they need to know is our estimate in terms of the resources they provide: CPU, RAM, number of instances. Local storage is probably negligeable. [14:44:40] Also, in terms of deployment, is it going to be the same cluster with multiple jobs? Or a different cluster? [14:44:40] and here is where we just went with have the same as we had for WDQS [14:45:11] "the same" meaning we reuse the same cluster and don't allocate any additional resources? [14:45:14] we decided on a seperate cluster - WDQS streaming updater is known to sometimes kill the job manager, no need to make them dependant this way [14:45:48] same as in the duplicate of what we scheduled for WDQS streaming updater [14:46:20] I thought we'd be using the same session cluster but I'm fine with a separate one [14:46:37] we'll have to adapt the naming if we run a second session cluster [14:47:36] dcausse: we had a miscommunication then, I assumed a seperate cluster - which option would you consider better? [14:48:15] if occasional crashing (it does happen once a week) isn't an issue, we can go with a single cluster [14:48:19] separate is certainly safer but requires renaming the charts [14:49:31] let's have a short discussion on this in the meeting today, I'm not leaning in any particular way [14:49:32] separate is 1more pod at 1.6G, cpu: 500m [14:50:37] + 3 pods (2.1G ram, cpu: 1000m) that we will have to allocate regardless of the decision we take [14:55:26] dcausse, zpapierski: I've added what I understand on T280485. Feel free to correct me or add additional context. [14:55:26] T280485: Additional capacity on the k8s Flink cluster for WCQS updater - https://phabricator.wikimedia.org/T280485 [14:59:09] sorry for the confusion with the type, billion is one unit that doesn't translate neatly into Polish... [14:59:23] s/type/typo [14:59:33] I made a typo in a word "typo"... [15:02:50] storage capacity is not really linear with the number of triples but the number of entities, density of triples per entity is probably very different between wikidata & commons [15:03:09] but k8s wise we're more than fine with the 10G temp space allowed per pod [16:44:07] hello hello [16:44:47] I have a question that maybe dcausse can answer (about elastic learning to rank logging) [16:45:06] nuria_: hey! [16:45:17] dcausse: hola hola [16:45:50] dcausse: i am moving up from being in elastic kindergarden and now at outschool.com we are thinking about learning to rank [16:46:10] cool! [16:46:31] dcausse: in wikipedia, did we ever logged (using LTR) features scores? [16:47:23] nuria_: we do collect feature scores offline using the plugin [16:47:25] dcausse: I think the plugging allows you define features and you can log for each feature its relevance score on a given query [16:47:44] dcausse: ah, and is that stored in hadoop? [16:48:00] it's returned back in the elasticsearch response [16:48:12] dcausse: to the user? [16:48:49] dcausse: or do we harvest it from the response and we log it elsewhere? [16:49:08] we replay the queries offline from the hadoop cluster [16:49:29] doing feature logging in realtime is possible but might be slow [16:49:57] dcausse: oohh i see, the feature logging is happening at a different time [16:50:15] yes, for us we do it offline [16:50:20] https://elasticsearch-learning-to-rank.readthedocs.io/en/latest/logging-features.html [16:50:28] dcausse: so we talk from our hadoop job to elastic, get responses and store those in hadoop? [16:50:34] yes [16:51:09] dcausse: and when mappings for a index change we tweak the config for feature scoring [16:52:30] doing feature logging offline allows to explore new features [16:52:50] doing feature logging realtime you only collect scores for your current model [16:53:23] dcausse: nice, SUPER THANKS!!!!!! [16:53:30] yw! [17:00:14] dcausse: one more though, given login is async the data changes slightly when the queries are replayed, is that in any way a problem? [17:01:02] nuria_: for us it's not, but it certainly depends on your features [17:03:02] dcausse: thanks again [18:32:46] dinner [19:02:18] After discussing "guys" and "y'all" and "you guys" in the unmeeting, I said "see y'all guys later" at the end of the meeting. I swear it wasn't a joke... I guess I actually talk like that! [19:33:03] Perfectly grammatical in the Southern United States [19:34:30] > see y'all guys later [19:34:33] * ryankemper explodes [19:34:58] The regional dialects are fusing together [20:40:18] the g'all.. [20:42:20] youse-all guyses [20:46:55] don't forget yinz [21:10:19] dcausse, ebernhardson: Carly is looking for someone to represent Search in a "data architecture meeting for the next iteration of image suggestions with the Search, SD, API Platform, Data Platform, and Growth teams", making sure we are aligned at a technical level. You are the 2 most likely candidates. Any preference? Both of you? [21:20:37] gehel: i can i suppose, couldn't hurt to have david he's probably more familiar anyways :)