[06:58:32] addshore: o/ mind doing a review on changes for wdqs microsite for ebernhardson ? you're subscribed to the ticket, but just in case [06:58:32] - https://phabricator.wikimedia.org/T280247 [06:58:45] I added you to the reviews, probably way too many of them :) [06:59:09] feel free to ignore/remove yourself from ones you don't care about [08:19:15] dcausse: is streaming updater in yarn still running? [08:19:26] (asking out of curiosity) [08:19:27] zpapierski: no [08:19:38] I'm planning to launch my own, for WCQS data [08:20:59] do you have some code modifications already? if no before you start I wanted to know if Erik would be interested in doing some flink [08:22:43] no, nothing yet. I just wanted to test out config changes, I'm not really sure what code changes are required [08:22:55] (at least for producer) [08:23:51] it'll reject all events because we do some filtering on the page title (starts with Q, P or L) [08:24:21] right, M-entitites need to be added [08:24:23] so I guess the pipeline will work but will do basically nothing [08:24:40] M-entities are not visible in the events [08:25:01] what do you mean? [08:25:17] we'll get File:XYZ.png and from that we need to know if there are structured data attached to it [08:25:31] I see [08:25:44] still unclear how we will do this [08:26:45] there are perhaps metadata attached to the event? overall I think this is the main difference, for wikidata we knew for sure that strutured data was there but for commons we don't [08:26:51] Erik proposal last week (or a week before that) was for him to finish work on local integration environment, similar to what he set up for dags. I proposed we wait for you so that we can talk about stuff you already did for that [08:27:22] (which reminds me that I forgot to add that to the wednesday meeting) [08:28:35] wdyt? [08:29:04] for the local env that'd be great, we need to see what it involves [08:29:16] yarn has been my test env tbh [08:29:52] flink provides some integration env but I agree this is not covering the full env [08:30:17] ok, I've added that to today's meeting then [08:30:38] here we need kafka with some events, swift, flink and the MW apis (or a way to mock them) [08:30:44] I'll research that M entity data [08:30:55] swift can probably be replaced as well [08:31:36] I'd leave kafka, since its functionality is paramount to the process, but Swift is just a glorified FS in our case [08:32:57] I think the "hard" part will be to mock the MW apis (hard as in tedious esp. to match API calls with events present in kafka) [08:32:58] I'll research how to get the data for M-entities, but I guess I won't probably make any code changes yet [08:33:06] sure [08:33:07] yeah, I agree [08:33:24] unless we the original ones [08:33:31] all of this is read-only, after all [08:33:49] I know original events straight from the source will be difficult [08:34:00] but I simply used a kafkacat output in the past [08:34:13] simply script feeding the local env with that would be nice [08:34:20] s/simply/simple [08:36:48] yes, we'd have to capture the output of the MW Special:EntityData because sadly it's not idempotent (regarding deleted entities) [08:37:50] what happened for deleted entities? they're completely gone, even the old revs? [08:38:50] they're not visible so the API returns a 404 even when asking a particular revision [08:41:16] this makes things like T279698 particularly difficult (impossible) to do right [08:41:17] T279698: WDQS should retry when getting 404s - https://phabricator.wikimedia.org/T279698 [08:43:42] and will be even more difficult because I think Special:EntityData returns a 404 for a File that does not have structured data :) [08:43:58] streaming updater too fast? :D [08:44:17] addshore: no, kafka is too fast :) [08:45:14] addshore: I wonder if could vary some headers to Special:EntityData to give some hints: revision not found, page deleted or things like that [08:45:35] that would help consumers to decide what to do I think [08:45:55] that doesnt sound too unreasonable [08:46:21] ok thanks! I might file a task, will ping you in there [08:47:23] *looks at list of status codes* [08:48:56] meh, no, probably nothing makes sense [08:52:16] yes I once looked at 410 but we rejected the idea in T14345 [08:52:17] T14345: Server should return a 410 HTTP status code for deleted pages - https://phabricator.wikimedia.org/T14345 [09:07:01] in any case, given relative rarity of deleted events, we'd mostly probably fine without capturing WM Special:EntityData [10:27:44] lunch [10:42:47] break [12:21:33] followup suggestion: https://github.com/nomoa/exo-train/commit/c2faebb976452a72e7877ea404386da3e3ce4833 [19:37:58] stepping out to run some errands [19:38:38] mpham: great email on wdqs scaling ! Thanks ! [19:39:31] no problem!