[00:54:56] <Krinkle>	 I was wondering, not urgent, what constitutes a "user" in context of this patch? (cc ebernhardson ) https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/709781
[03:02:27] <ebernhardson>	 Krinkle: md5(ip + UA)
[03:04:18] <Krinkle>	 ebernhardson: I see. from the internal perspectice, so a user here would be a particular non-MW service calling Elastic and its server IP/ and service UA?
[03:04:38] <Krinkle>	 or is this passed on from the eternal web request all the way through?
[03:04:49] <Krinkle>	 (honours xff?)
[03:05:36] <ebernhardson>	 Krinkle: in terms of impl, php land hashes together the ip+ua to get a weak user fingerprint, the only thing provided to elastic is the resulting hash. A user here is generally intended to be an end user in the public internet making requests
[03:05:57] <ebernhardson>	 Krinkle: not sure if it uses xff, this is an old identity hash we've used for ~5 years, sec
[03:06:19] <ebernhardson>	 i guess its md5(ip, XFF, UA)
[03:06:22] <Krinkle>	 ah this is somethig custom to CirrusSearch
[03:06:31] <Krinkle>	 I looked up generateIdentToken now, I thought it came from the Elastic lib
[03:06:50] <ebernhardson>	 yea, elastic provides the "use some value as the seed", and this is generating a per-user seed
[03:07:24] <Krinkle>	 It calls WebRequest::getIP which resolves XFF already, so the added XFF to the hash is probably not needed.
[03:08:17] <ebernhardson>	 oh cool, can drop that part then yea
[03:08:56] <Krinkle>	 origin https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/226466
[03:09:06] <Krinkle>	 doesn't look like it mentions a special reason for including xff
[03:09:32] <Krinkle>	 no big deal either way, but just looking for "unusual"/possibley-intented patterns and this stood out just now
[03:09:39] <Krinkle>	 unintended*
[03:10:26] <ebernhardson>	 yea makes sense, i dont remember why in particular that was included, i imagine mw core has been resolving xff for some time
[03:10:34] <Krinkle>	 indeed
[09:42:35] <tanny411>	 Hi, there seems to be some old dump in '/wmf/data/discovery/wikidata/rdf/' from february. These don't conform to recent data (which have a parition called wiki='wikidata' or wiki='commons'). Wanted to point out, in case these need cleaning or something.
[09:42:54] * tanny411 had sent msg to analytics channel my mistake
[13:46:04] <addshore>	 o/ Does anyone have any pointers around the mechanism for indexing new entities on wikidata as fast as possible, and how it is different to the regular index process? 
[13:46:20] <addshore>	 and is this specific to wikidata? or would also happen in wikibase?
[13:46:32] <addshore>	 and where does the mechanism actually live?
[14:06:24] <zpapierski>	 o/
[14:06:46] <zpapierski>	 can't help you here, but west coast is waking up soon, ebernhardson might be able to provide insightt
[14:23:01] <addshore>	 </help request> So though it would be nice to know still, we found our issue now so are not blocked on figuring this out :)
[14:23:50] <zpapierski>	 cool, I love helping without doing anything!
[14:27:48] <zpapierski>	 we still didn't reorganize our board for new priorities
[14:28:21] <zpapierski>	 mpham: I'm going to assume priorities as discussed during kick-off for WCQS. Which means I'm going to start working on authorization for the service
[14:30:29] <mpham>	 zpapierski: sounds good to me! thanks
[14:58:46] <ebernhardson>	 addshore: it's called instant index, the short of it is a minimal version of the document is send in the web request process that created the page
[17:44:01] <ebernhardson>	 addshore: actually, this was removed in  Ide5c74eb92cd4. It was initially added as a hack to help the use case of create property->use property seconds later. But iirc this was replaced with a method using sql that guarantee'd the user finds the property, instead of this which is only an attempt
[18:43:03] <ebernhardson>	 hmm, not clear what to do with search/highlighter.  It seems david updated it to elastic 7.5.1 in feb 2020, but we need 6.8.18 first. hmm....
[20:15:32] * ebernhardson will probably create a side-branch or something i guess
[20:16:57] <mpham>	 ebernhardson: Do you know the answer to Morten's question? "In the task description, "Search Engagement" is defined as a dwell time of over 10 seconds. The first ping event on a page happens at 10 seconds. I've taken this to mean that at least two pings are needed, meaning 20 seconds, for a session to count towards search engagement. I'm wondering if I've misinterpreted that and we should change it? The way the data is stored makes changing 
[20:16:57] <mpham>	 that easy, so it's not a significant change." T279105#7263573
[20:16:58] <stashbot>	 T279105: Create/revive Search Platform team metrics dashboard - https://phabricator.wikimedia.org/T279105
[20:17:29] <ebernhardson>	 mpham: hmm, i can probably find refernce from when it was previously defined. few mins
[20:25:49] <ebernhardson>	 mpham: hmm, some analysis/background here but not a reference for that particular q https://meta.wikimedia.org/wiki/Research:Measuring_User_Search_Satisfaction#Survival_analysis_approach
[20:25:56] <ebernhardson>	 have another idea though..looking
[20:27:37] <ebernhardson>	 mpham: i think that limit would have been implemented from this line: https://github.com/wikimedia/wikimedia-discovery-golden/blob/master/modules/metrics/search/search_threshold_pass_rate.R#L77
[20:27:48] <ebernhardson>	 so whateve rthe ortiz::dwell_time package in R does when dwell_threshold=10
[20:28:22] <ebernhardson>	 documented as 'the value (in seconds) to use to indicate a "sucessful" session
[20:28:32] <ebernhardson>	 so, it sounds like getting 10 seccond dwell is 'success'
[20:28:46] <mpham>	 i guess the question is whether 10s dwell is actually 10s or really 20s
[20:29:06] <ebernhardson>	 it sounds like this package accepts anything with >= 10, so 10 would trigger it
[20:30:00] <ebernhardson>	 i guess as well though, the old metric doesn't seem to be using the actual checkin events, only using those as additional sources of timestamps within a session
[20:30:14] <ebernhardson>	 (hard to follow entirely, never written anything in R)
[20:31:08] <mpham>	 Nettrom: does this answer the question you had on T279105#7263573?
[20:31:08] <stashbot>	 T279105: Create/revive Search Platform team metrics dashboard - https://phabricator.wikimedia.org/T279105
[20:32:19] <ebernhardson>	 tbh reading this code doesn't entirely make sense to me :P i was expecting dwell by page, but this is doing something over session_id that i don't entirely follow
[20:33:59] <ebernhardson>	 I guess it could be https://github.com/wikimedia/wikimedia-discovery-golden/blob/master/modules/metrics/search/srp_survtime.R instead, this is dealing with checkins an LD50 directly
[20:34:29] <ebernhardson>	 hmm, somewhere in this repo or its history is where it should be defined at least :P
[20:38:04] <ryankemper>	 Trey314159: are there any es reindexes going on you're aware of? need to do rolling restarts this week to apply some security updates
[20:38:12] <ebernhardson>	 I guess the threshold is the right one,  here is docs from old dashboards on the kpi: https://github.com/wikimedia/discovery-dashboard-search/blob/master/tab_documentation/kpi_augmented_clickthroughs.md
[20:40:47] * ebernhardson is also thankful someone imported that as dashboards-search, in gerrit its discovery-rainbow and a little harder to find :)
[20:43:53] <ebernhardson>	 random other fun page i turned up, and a reminder that things live in etherpad longer than we might expect :) https://etherpad.wikimedia.org/p/search-metrics
[20:44:07] <mpham>	 I guess it'd just be weird if the a 10s threshold actually meant 20s real dwell time for a session to count. I'm not sure how the pings work, but I would assume that first ping at 10s is checking against some earlier timestamp?
[20:45:15] <ebernhardson>	 mpham: if i'm reading the old KPI code correctly, mostly it's doing `dwell = max(timestamp) - min(timestamp)` over all events with the same session_id, it never looks at the action 10, 20, etc numbers, it just looks at the timestamps on the events along with all the other events like performing a search 
[20:46:00] <ebernhardson>	 checkin events end up increasing the dwell time, because they keep coming in, but the exact dwell number sent isn't referenced
[20:47:38] <ebernhardson>	 under these conditions, a single checkin event with dwell=10 should trigger it
[20:49:20] <mpham>	 so would it be fair to say that Nettrom should change how he's counting from using 2 pings/checkin event/20s to using 1 ping/checkin event/10s to calculate dwell time?
[20:50:24] <ebernhardson>	 that would closer match the old KPI, i suppose
[20:51:36] <ebernhardson>	 i don't necessarily know the value in trying to match the old KPI, it's hard to say exactly which parts were the important parts.
[20:54:18] <mpham>	 ok thanks! I think using the old KPI will at least make things more consistent
[20:57:14] <Nettrom>	 thanks for digging into this, mpham and ebernhardson ! I'll take another look at the R code tomorrow and test it out just to confirm that it uses `>=`, and I'll also see if we have other code that's done similar things
[20:57:58] <Nettrom>	 I don't have strong opinions about what is the "right" way, having it be ">= 10s" will simplify things, though
[20:58:48] <Nettrom>	 and lastly, based on the proportions in our data, 20s of dwell time might be a little too strict
[20:59:23] <ebernhardson>	 makes sense, the old data looked something like this: https://meta.wikimedia.org/wiki/Research:Measuring_User_Search_Satisfaction#/media/File:Per_session_intertime_density_log.png
[20:59:45] <ebernhardson>	 err, hmm. Actually that might be something slightly different
[22:00:52] <Trey314159>	 ryankemper: dcausse has his giant collection of 800+ reindexes running. It's up to sh.. you can see the logs on mwmaint2002 in ~dcausse/reindex/cirrus_log/ .. looks like there are still ~200 to go
[22:01:41] <Trey314159>	 I can probably tell it to stop after it finishes the current one
[22:02:11] <ryankemper>	 Trey314159: got it, thanks. we can let those run their course. off the top of your head, do you remember if relforge is part of that? (I need to start w/ restarts of relforge first anyway)
[22:02:42] <ryankemper>	 I would imagine since we reindex by talking to mediawiki that it only impacts eqiad/codfw?
[22:03:11] <Trey314159>	 I forget where cloudelastic lives w.r.t. relforge. It's reindexing eqiad, codfw, and cloudelastic.
[22:03:28] <ryankemper>	 okay good, that means no relforge
[22:04:06] <ryankemper>	 cloudelastic lives in eqiad. I believe it runs in ganetti and is not an actual physical host although I should go check up on that to be sure
[22:04:44] <ebernhardson>	 cloudelastic has real hosts :) 
[22:05:08] <Trey314159>	 I didn't think it had any relation to relforge, but lotsa things have moved around since the last time I thought about it too much
[22:05:10] <ryankemper>	 thanks, wonder what I was thinking of then
[22:05:20] <ryankemper>	 (re cloudelastic)
[22:05:21] <ebernhardson>	 hmm, an-airflow and search-loader are ganetti
[22:05:35] <ebernhardson>	 maybe others, it's not always clear :)
[22:06:15] <ryankemper>	 ah yeah I might have been thinking of the MLR stuff