[00:44:54] <wikibugs>	 10Analytics, 10Event-Platform, 10Internet-Archive, 10The-Wikipedia-Library: Page-links-change stream doesn't capture duplicated links - https://phabricator.wikimedia.org/T216492 (10Krinkle) Instrumentation for the `mediawiki.page-links-change` event lives in the EventBus extension, not WikimediaEvents.
[00:46:46] <wikibugs>	 10Analytics, 10Event-Platform, 10Internet-Archive, 10The-Wikipedia-Library, and 2 others: page-links-change stream is assigning template propagation events to the wrong edits - https://phabricator.wikimedia.org/T216504 (10Krinkle) See also: {T216492}.  Instrumentation for the `mediawiki.page-links-change`...
[00:46:48] <wikibugs>	 10Analytics, 10Event-Platform, 10Internet-Archive, 10The-Wikipedia-Library: Store page-links-change data in a database table and make available through a Special page - https://phabricator.wikimedia.org/T221397 (10Krinkle) Instrumentation for the `mediawiki.page-links-change` event lives in the EventBus ex...
[00:48:32] <ottomata>	 wow nice stuff addshore with spark?
[00:57:40] <wikibugs>	 10Analytics-Radar, 10Event-Platform, 10Growth-Team, 10Growth-Team-Filtering, 10The-Wikipedia-Library: Edits to Flow pages result in a page-links-change event with no performer - https://phabricator.wikimedia.org/T216726 (10Krinkle) Instrumentation for the mediawiki.page-links-change event lives in the Ev...
[07:28:24] <addshore>	 yup
[12:49:06] * addshore now investigates the easiest way to get a JSON fill, munge it into some form, and load it into a table in hadoop
[12:49:15] <addshore>	 im guessing the answer as always is spark / pyspark :P
[12:49:33] <joal>	 Heay addshore - I apologize for yesterday - I completely missed pinging you back :S
[12:50:39] <joal>	 addshore: spark can read json natively if correctly formatted (1 object per line)
[12:51:14] <addshore>	 joal: no problem! It ende dup being a bug in hive that I was hitting xD
[12:51:26] <addshore>	 all switched over to spark for that bit and working now!
[12:51:27] <joal>	 But, the json format needs to be table oriented, otherwise you end up with a table not really usable
[12:51:43] <addshore>	 right, so I could do the munging as a step before, using jq or something
[12:51:45] <joal>	 Yeah, spark should be your go to solution for working on the cluster :)
[12:52:10] <addshore>	 got a pointer to loading json lines directly a table by chance?
[12:52:18] <joal>	 addshore: Depends on the munging you're willing to do - spark can help
[12:52:48] <joal>	 addshore: you'll need pyspark - And then 'spark.read.json("PATH")'
[12:52:58] <addshore>	 i'm just about to do a live 1 hour thing at wikidatacon :/
[12:53:06] <addshore>	 otherwise id put some examples here now :D
[12:53:11] <joal>	 to be more precise: "df = spark.read.json("PATH")"
[12:53:19] <joal>	 ehehe
[12:53:29] <joal>	 addshore: we can spend a minute on a meet if you wish
[12:53:30] <addshore>	 aaah right, so does `spark.read.json` already need to be json lines format?
[12:53:36] <addshore>	 maybe thats where I got stuck!
[12:53:40] <joal>	 correct addshore 
[12:53:48] <addshore>	 i was trying to load things like this https://usercontent.irccloud-cdn.com/file/FvIWcEPp/image.png
[12:53:50] <joal>	 when using that, spark expects 1 object per line
[12:53:56] <addshore>	 cool, okay, so thats the next thing for me to fix in an hour :)
[12:54:06] <joal>	 :)
[12:55:15] <joal>	 Interesting addshore: spark.read.json("PATH")
[12:55:17] <joal>	 oops
[12:55:22] <joal>	 https://sparkbyexamples.com/spark/spark-read-json-from-multiline
[12:55:25] <joal>	 addshore: --^
[12:55:33] <joal>	 Maybe it can help
[12:55:33] <addshore>	 will read that :) thanks!
[12:58:52] <joal>	 Enjoy wikidatacon addshore :)
[14:00:37] <addshore>	 ohia MichaelG_WMDE =o
[14:00:53] <MichaelG_WMDE>	 👋
[14:01:01] <addshore>	 not on mattermost? :P
[14:01:06] <addshore>	 i have something to show you
[14:01:25] <addshore>	 but i can also send in pm here!
[14:01:27] * MichaelG_WMDE opens work profile of Firefox
[14:19:00] <addshore>	 joal: In case your interested, this is the transformation im going for now :) https://usercontent.irccloud-cdn.com/file/01wEdE1B/image.png
[14:21:20] <addshore>	 Then in CSV format maybe i could even use LOAD DATA
[14:23:57] <addshore>	 and then I realize its not even json, its missing quotes, hahaha