[14:12:02] \o [14:44:38] .o/ [15:29:56] hey all.. I'm planning to attend the Wikipedia 25 meeting today instead of the Wednesday meeting, especially since our new CEO should make an appearance. [15:32:45] kk [15:38:00] still pondering how the profiles/contexts should work for semantic...not entirely sure. The problem is the context (full text vs semantic) selects the default profile, but the profiles themselves don't know or care about context, can just use the semantic profile in non-semantic context [15:50:35] Trey314159: Sure, I’ll be there too. [15:52:44] ebernhardson: Do we also use profiles across prefix *and* full text or have we not encountered a need for context yet? [15:53:59] pfischer: No, different entry points have different ways of doing things. But in the case of semantic there is nothing special about semantic, it's a typicall full text query builder / profile, uses all the default code. So the profile is a generic FT_QUERY_BUILDER profile, but the default gets selected based on context [15:54:35] but that still allows anyone to just request (via api params) the semantic profile in non-semantic context. [15:54:43] or vice-versa [15:55:19] As long as we control the API we could perform some minimal validation. [15:55:26] basically there is a dispatcher, the dispatcher looks at things and decides its CONTEXT_SEMANTIC, we then use the default FT_QUERY_BUILDER profile assigned to semantic [15:56:13] the validation exists, but what it does is provide the list of known FT_QUERY_BUILDER profiles, and only allows users to provide those [15:57:39] i was pondering making a SEMANTIC_QUERY_BUILDER profile that we flip between, but it would be spread out a bit between files and feels hacky. [15:58:03] and then no clue how the default fulltext apis would handle that, they have to provide allowed params to the api layer and probably don't know the context [15:58:23] Okay. I guess that’s the price to pay for the flexibility of the Action API. There are tons of parameters that can be combined in wild ways. So as long as we document it, I would be fine. I don’t see the need for a hard bounce in case of an non-working combination. [15:58:55] i guess my worry is we can support 5-10qps, we need those for mobile app, but if the api just accepts any profile we have no contrl [15:59:23] right now mobile would be behind an "undocumented" debug param, someone could still see and use it but at least the api isn't announcing it [15:59:33] see it in the code/patch history i mean [16:03:36] Understood. But unless we introduce some shared secret we cannot protect against requests from other sources than apps. [16:04:16] of course, they will be able to trigger it with the debug param, it's all open source, but i guess i feel like thats different than the api announcing it has semantic search capabilities. [16:04:43] maybe it doesn't matter, can ignore [16:06:31] Well, let’s ponder a bit longer, I thinks it’s a legit worry if we expose such expensive API [16:09:46] i suppose also the semantic profile is not entirely mixable, right now i have the dispatcher for semantic rejecting any custom profile requests, and forcing the rescore profile to empty (neural queries don't support rescore). I guess i should check what happens when it tries to mix them [16:30:02] So let’s see if I get this: CONTEXT_DEFAULT would use the `default` profile for `FT_QUERY_BUILDER` *and* that can be overridden using an `cirrusFTQBProfile` (which is where users could specify any profile). Would a dedicated CONTEXT_SEMANTIC allow us to change what can be overridden via URI? [16:30:28] pfischer: CONTEXT_SEMANTIC only really changes the defaults, it doesn't impose any limits [16:31:17] perhaps david will have ideas though, i've neve rfully grok'd the dispatch/profile/etc. bits [16:31:48] i suppose the only limit imposed by CONTEXT_SEMANTIC is what i put inside the dispatch route impl around when it will be selected [19:00:05] taking a break from semantic to look at glent, always fun to ponder the way through spark code :P [19:02:31] initial suspect is that NPE is because something that collected partitions returned an empty list, would imply missing data somewhere [19:36:07] * ebernhardson somehow has an old python notebook that does all the reflection needed to run glent from a notebook [19:43:57] i guess i should clean that up a bit and stuff it in the glent repo as a debugging tool [19:51:39] * ebernhardson finds it slightly annoying that the easiest way to debug spark in java is to re-create it in python, and invoke java where you must...there is probably a better way :P [19:53:58] at least this doesn't look too bad, all of the data is being thrown away at the step: df.where(col("ts").geq(earliestLegalTs.getEpochSecond()) [19:54:13] because in the data...ts is all NaN [20:10:46] searching suggests it's a spark 3.x upgrade issue? In the old spark it would have thrown away milliseconds, but the new timestamp parser is supposed to be stricter [20:11:12] i suppose the bigger failure is all the data became null, but nothing noticed because spark is designed for "messy" data and doesn't fail-fast [20:15:38] then since glent unions in the old data and does a date filter, it took months for the dataset to go empty and start failing the pipeline [20:17:54] (at which point, i can no longer verify if the problem was that timestamps added milliseconds, or that the old spark accepted milliseconds and threw them away) [20:43:02] i suppose best guess would be test suite has timestamps without milliseconds, maybe they were added at some point? [21:05:06] * ebernhardson sighs at the `release:prepare release:perform` using https and asking for a username/password...when the git repo is configured with ssh. But i was the last person to adjust the glent release process, so i know who to blame :P