[00:44:54] 10Analytics, 10Event-Platform, 10Internet-Archive, 10The-Wikipedia-Library: Page-links-change stream doesn't capture duplicated links - https://phabricator.wikimedia.org/T216492 (10Krinkle) Instrumentation for the `mediawiki.page-links-change` event lives in the EventBus extension, not WikimediaEvents. [00:46:46] 10Analytics, 10Event-Platform, 10Internet-Archive, 10The-Wikipedia-Library, and 2 others: page-links-change stream is assigning template propagation events to the wrong edits - https://phabricator.wikimedia.org/T216504 (10Krinkle) See also: {T216492}. Instrumentation for the `mediawiki.page-links-change`... [00:46:48] 10Analytics, 10Event-Platform, 10Internet-Archive, 10The-Wikipedia-Library: Store page-links-change data in a database table and make available through a Special page - https://phabricator.wikimedia.org/T221397 (10Krinkle) Instrumentation for the `mediawiki.page-links-change` event lives in the EventBus ex... [00:48:32] wow nice stuff addshore with spark? [00:57:40] 10Analytics-Radar, 10Event-Platform, 10Growth-Team, 10Growth-Team-Filtering, 10The-Wikipedia-Library: Edits to Flow pages result in a page-links-change event with no performer - https://phabricator.wikimedia.org/T216726 (10Krinkle) Instrumentation for the mediawiki.page-links-change event lives in the Ev... [07:28:24] yup [12:49:06] * addshore now investigates the easiest way to get a JSON fill, munge it into some form, and load it into a table in hadoop [12:49:15] im guessing the answer as always is spark / pyspark :P [12:49:33] Heay addshore - I apologize for yesterday - I completely missed pinging you back :S [12:50:39] addshore: spark can read json natively if correctly formatted (1 object per line) [12:51:14] joal: no problem! It ende dup being a bug in hive that I was hitting xD [12:51:26] all switched over to spark for that bit and working now! [12:51:27] But, the json format needs to be table oriented, otherwise you end up with a table not really usable [12:51:43] right, so I could do the munging as a step before, using jq or something [12:51:45] Yeah, spark should be your go to solution for working on the cluster :) [12:52:10] got a pointer to loading json lines directly a table by chance? [12:52:18] addshore: Depends on the munging you're willing to do - spark can help [12:52:48] addshore: you'll need pyspark - And then 'spark.read.json("PATH")' [12:52:58] i'm just about to do a live 1 hour thing at wikidatacon :/ [12:53:06] otherwise id put some examples here now :D [12:53:11] to be more precise: "df = spark.read.json("PATH")" [12:53:19] ehehe [12:53:29] addshore: we can spend a minute on a meet if you wish [12:53:30] aaah right, so does `spark.read.json` already need to be json lines format? [12:53:36] maybe thats where I got stuck! [12:53:40] correct addshore [12:53:48] i was trying to load things like this https://usercontent.irccloud-cdn.com/file/FvIWcEPp/image.png [12:53:50] when using that, spark expects 1 object per line [12:53:56] cool, okay, so thats the next thing for me to fix in an hour :) [12:54:06] :) [12:55:15] Interesting addshore: spark.read.json("PATH") [12:55:17] oops [12:55:22] https://sparkbyexamples.com/spark/spark-read-json-from-multiline [12:55:25] addshore: --^ [12:55:33] Maybe it can help [12:55:33] will read that :) thanks! [12:58:52] Enjoy wikidatacon addshore :) [14:00:37] ohia MichaelG_WMDE =o [14:00:53] 👋 [14:01:01] not on mattermost? :P [14:01:06] i have something to show you [14:01:25] but i can also send in pm here! [14:01:27] * MichaelG_WMDE opens work profile of Firefox [14:19:00] joal: In case your interested, this is the transformation im going for now :) https://usercontent.irccloud-cdn.com/file/01wEdE1B/image.png [14:21:20] Then in CSV format maybe i could even use LOAD DATA [14:23:57] and then I realize its not even json, its missing quotes, hahaha