[08:58:01] pfischer_: fyi, filed T322186, might be a good next step [08:58:02] T322186: Consume revision based changes from the mediawiki.page-state stream - https://phabricator.wikimedia.org/T322186 [08:59:28] this stream is still in its infancy so it's a good time to provide feedback [11:02:38] lunch [12:14:46] lunch [13:01:49] dcausse: ping me when you have a few minutes... [13:31:00] gehel: I'm around [13:34:29] dcausse: meet.google.com/bjq-uqix-uds [14:32:10] inflatador: around for our 1:1? [14:41:52] gehel day off today, sorry for not declining [14:42:14] inflatador: yes, I see that now. Sorry for the ping. Please go away from IRC! [17:35:37] dinner [19:26:37] well, i guess this works, mostly. 467.38M pages are within 20% of the stored incoming_links value, in a dataset of 494.91M pages. Will poke over the ones that are out just to make sure nothing crazy is happening [19:26:50] (to be fair, more than half of the dataset has 1 incoming link :P) [20:55:02] how fun...tiny bit of rain (potentially unrelated) and now i'm regularly dropping packets :S [22:02:25] curious..but whatever. The largest outliers for incoming_links in the the _source docs are all basically wrong. Issuing fresh queries against elastic agrees with my batch computation, so for example https://en.wikipedia.org/wiki/Emergency_management has 17k in the _source doc, but batch and a fresh query report 1.4k. No clue how it's not getting updated, it was edited a couple days ago [22:02:27] (after the dump was taken) [22:03:18] still working out what thresholding would be appropriate, popularity_score has 188M pages in the latest dump, 99M of those overlap with the incoming_links computation