[08:49:21] errand + lunch [09:49:27] Lunch [10:39:48] https://usercontent.irccloud-cdn.com/file/Svz3MU2U/image.png [12:44:39] btullis: thanks for the heads-up, I was out and missed your ping, I quickly checked the search airflow instance and it seems to run well [12:48:50] dcausse: would you have some time for a hand-over-meeting for the search update pipeline, today? [12:49:08] pfischer: sure [12:53:56] dcausse: Thanks for the feedback. All of the airflow-scheduler processes crashed on all instances, so I had to restart them, but hopefully no jobs were impacted. I haven't heard anything back back yet and I'll look into whether the scheduler could be configured to reconnect automatically. [12:54:42] ouch, thanks for taking of this! [12:54:53] *taking care [13:03:13] til, if you do df1.join(df2, ['join_col1', 'join_col2']).select("df2.*") the join columns are not selected with "df2.*" :/ [14:01:54] inflatador, ryankemper: I'll skip the SRE pairing session today, conflicting ERC meeting [14:35:51] \o [14:42:44] o/ [14:42:48] quick errand [15:14:17] on a first run only 6k wikitext pages are inconsistent (cirrus vs mysql) accross all the wikis, seem low [15:14:34] wow, indeed that's much closer than expected [15:14:56] hm.. but 24k wikibase-lexeme, seems relatively high [15:16:25] given how low the numbers are I wonder if a simple report global to all wikis might not be better than a fine-grained per wiki report [15:17:32] hmm, could we have a global report and perhaps break out any outliers? [15:19:27] sure I'll see what I can do, smth like if > XX% of the total number of pages [15:19:41] yea, or if it has a rate 10x the mean wiki or something like that [15:19:54] sure [15:20:26] checking one wikitext example I see: https://hr.wikipedia.org/w/index.php?curid=624643 so I guess cirrus is not really to blame [15:24:12] yea, i'm sure there are some broken pages. We had pondered before if we should have some way to recognize and ignore broken pages, still not sure [15:24:30] at least some way to flag them differently in the stats would be nice though [15:26:04] yes... will have to think about how to do that without calling mediawiki [15:26:49] if we are already maintaining flink state, we could probably keep a sequential failures counter per (wiki, page_id) [15:29:16] ah via flink I guess since we will report unrecoverable errors to a side-output I guess we could try to do some analysis on this output [15:29:49] ideally proper error identification would be great to not mix-up transient network/mw failures [15:29:54] yea that might work too [15:46:33] I wonder how to report outliers... global counts per content_model would make sense as prometheus timeseries but outliers (offending wikis) I'm not sure... a simple sheet exported somewhere might be better but not sure how to do that [15:51:13] hmm, not sure either...will have to ponder [17:36:01] dinner [18:08:17] always mysteries...the IT is failing with the Kyro deserializer trying to put something in an UnmodifiableMap :( somehow serialization of the pyspark things always seemed mostly magic :P [18:27:03] doesn't make sense..pausing the debugger shows its the lat/lon fields of a coordinate in this UnmodifiableMap, meaning it came from the cirrus doc and was deserialized by flinks output schema [19:32:41] if something is interpreted as a "map" from the json schema it might get a Map<> object, otherwize it should be Row object [19:33:18] coordinates.coord only has additionProperties: type: number, so I suspect it's interpreted as a Map [19:39:47] it's org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema#createMapConverter using unmodifiableMap but I think RowTypeInfo should take care of this, I suspect that the proper TypeInformation is not passed properly and kryo is taking over [19:40:53] dcausse: ahh, ok i didn't realize kyro is only the fallback. For the moment i registered a deserializer (flink already has an UnmodifiedCollectionDeserializer, it's just not registered) but will see where i missed passing the type info [19:42:00] sure, if we have to rely on kryo it probably means we missed something either passing a TypeInfo while building the streamGraph or something not working as expected in the eventutilities-flink