[00:54:56] I was wondering, not urgent, what constitutes a "user" in context of this patch? (cc ebernhardson ) https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/709781 [03:02:27] Krinkle: md5(ip + UA) [03:04:18] ebernhardson: I see. from the internal perspectice, so a user here would be a particular non-MW service calling Elastic and its server IP/ and service UA? [03:04:38] or is this passed on from the eternal web request all the way through? [03:04:49] (honours xff?) [03:05:36] Krinkle: in terms of impl, php land hashes together the ip+ua to get a weak user fingerprint, the only thing provided to elastic is the resulting hash. A user here is generally intended to be an end user in the public internet making requests [03:05:57] Krinkle: not sure if it uses xff, this is an old identity hash we've used for ~5 years, sec [03:06:19] i guess its md5(ip, XFF, UA) [03:06:22] ah this is somethig custom to CirrusSearch [03:06:31] I looked up generateIdentToken now, I thought it came from the Elastic lib [03:06:50] yea, elastic provides the "use some value as the seed", and this is generating a per-user seed [03:07:24] It calls WebRequest::getIP which resolves XFF already, so the added XFF to the hash is probably not needed. [03:08:17] oh cool, can drop that part then yea [03:08:56] origin https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/226466 [03:09:06] doesn't look like it mentions a special reason for including xff [03:09:32] no big deal either way, but just looking for "unusual"/possibley-intented patterns and this stood out just now [03:09:39] unintended* [03:10:26] yea makes sense, i dont remember why in particular that was included, i imagine mw core has been resolving xff for some time [03:10:34] indeed [09:42:35] Hi, there seems to be some old dump in '/wmf/data/discovery/wikidata/rdf/' from february. These don't conform to recent data (which have a parition called wiki='wikidata' or wiki='commons'). Wanted to point out, in case these need cleaning or something. [09:42:54] * tanny411 had sent msg to analytics channel my mistake [13:46:04] o/ Does anyone have any pointers around the mechanism for indexing new entities on wikidata as fast as possible, and how it is different to the regular index process? [13:46:20] and is this specific to wikidata? or would also happen in wikibase? [13:46:32] and where does the mechanism actually live? [14:06:24] o/ [14:06:46] can't help you here, but west coast is waking up soon, ebernhardson might be able to provide insightt [14:23:01] So though it would be nice to know still, we found our issue now so are not blocked on figuring this out :) [14:23:50] cool, I love helping without doing anything! [14:27:48] we still didn't reorganize our board for new priorities [14:28:21] mpham: I'm going to assume priorities as discussed during kick-off for WCQS. Which means I'm going to start working on authorization for the service [14:30:29] zpapierski: sounds good to me! thanks [14:58:46] addshore: it's called instant index, the short of it is a minimal version of the document is send in the web request process that created the page [17:44:01] addshore: actually, this was removed in Ide5c74eb92cd4. It was initially added as a hack to help the use case of create property->use property seconds later. But iirc this was replaced with a method using sql that guarantee'd the user finds the property, instead of this which is only an attempt [18:43:03] hmm, not clear what to do with search/highlighter. It seems david updated it to elastic 7.5.1 in feb 2020, but we need 6.8.18 first. hmm.... [20:15:32] * ebernhardson will probably create a side-branch or something i guess [20:16:57] ebernhardson: Do you know the answer to Morten's question? "In the task description, "Search Engagement" is defined as a dwell time of over 10 seconds. The first ping event on a page happens at 10 seconds. I've taken this to mean that at least two pings are needed, meaning 20 seconds, for a session to count towards search engagement. I'm wondering if I've misinterpreted that and we should change it? The way the data is stored makes changing [20:16:57] that easy, so it's not a significant change." T279105#7263573 [20:16:58] T279105: Create/revive Search Platform team metrics dashboard - https://phabricator.wikimedia.org/T279105 [20:17:29] mpham: hmm, i can probably find refernce from when it was previously defined. few mins [20:25:49] mpham: hmm, some analysis/background here but not a reference for that particular q https://meta.wikimedia.org/wiki/Research:Measuring_User_Search_Satisfaction#Survival_analysis_approach [20:25:56] have another idea though..looking [20:27:37] mpham: i think that limit would have been implemented from this line: https://github.com/wikimedia/wikimedia-discovery-golden/blob/master/modules/metrics/search/search_threshold_pass_rate.R#L77 [20:27:48] so whateve rthe ortiz::dwell_time package in R does when dwell_threshold=10 [20:28:22] documented as 'the value (in seconds) to use to indicate a "sucessful" session [20:28:32] so, it sounds like getting 10 seccond dwell is 'success' [20:28:46] i guess the question is whether 10s dwell is actually 10s or really 20s [20:29:06] it sounds like this package accepts anything with >= 10, so 10 would trigger it [20:30:00] i guess as well though, the old metric doesn't seem to be using the actual checkin events, only using those as additional sources of timestamps within a session [20:30:14] (hard to follow entirely, never written anything in R) [20:31:08] Nettrom: does this answer the question you had on T279105#7263573? [20:31:08] T279105: Create/revive Search Platform team metrics dashboard - https://phabricator.wikimedia.org/T279105 [20:32:19] tbh reading this code doesn't entirely make sense to me :P i was expecting dwell by page, but this is doing something over session_id that i don't entirely follow [20:33:59] I guess it could be https://github.com/wikimedia/wikimedia-discovery-golden/blob/master/modules/metrics/search/srp_survtime.R instead, this is dealing with checkins an LD50 directly [20:34:29] hmm, somewhere in this repo or its history is where it should be defined at least :P [20:38:04] Trey314159: are there any es reindexes going on you're aware of? need to do rolling restarts this week to apply some security updates [20:38:12] I guess the threshold is the right one, here is docs from old dashboards on the kpi: https://github.com/wikimedia/discovery-dashboard-search/blob/master/tab_documentation/kpi_augmented_clickthroughs.md [20:40:47] * ebernhardson is also thankful someone imported that as dashboards-search, in gerrit its discovery-rainbow and a little harder to find :) [20:43:53] random other fun page i turned up, and a reminder that things live in etherpad longer than we might expect :) https://etherpad.wikimedia.org/p/search-metrics [20:44:07] I guess it'd just be weird if the a 10s threshold actually meant 20s real dwell time for a session to count. I'm not sure how the pings work, but I would assume that first ping at 10s is checking against some earlier timestamp? [20:45:15] mpham: if i'm reading the old KPI code correctly, mostly it's doing `dwell = max(timestamp) - min(timestamp)` over all events with the same session_id, it never looks at the action 10, 20, etc numbers, it just looks at the timestamps on the events along with all the other events like performing a search [20:46:00] checkin events end up increasing the dwell time, because they keep coming in, but the exact dwell number sent isn't referenced [20:47:38] under these conditions, a single checkin event with dwell=10 should trigger it [20:49:20] so would it be fair to say that Nettrom should change how he's counting from using 2 pings/checkin event/20s to using 1 ping/checkin event/10s to calculate dwell time? [20:50:24] that would closer match the old KPI, i suppose [20:51:36] i don't necessarily know the value in trying to match the old KPI, it's hard to say exactly which parts were the important parts. [20:54:18] ok thanks! I think using the old KPI will at least make things more consistent [20:57:14] thanks for digging into this, mpham and ebernhardson ! I'll take another look at the R code tomorrow and test it out just to confirm that it uses `>=`, and I'll also see if we have other code that's done similar things [20:57:58] I don't have strong opinions about what is the "right" way, having it be ">= 10s" will simplify things, though [20:58:48] and lastly, based on the proportions in our data, 20s of dwell time might be a little too strict [20:59:23] makes sense, the old data looked something like this: https://meta.wikimedia.org/wiki/Research:Measuring_User_Search_Satisfaction#/media/File:Per_session_intertime_density_log.png [20:59:45] err, hmm. Actually that might be something slightly different [22:00:52] ryankemper: dcausse has his giant collection of 800+ reindexes running. It's up to sh.. you can see the logs on mwmaint2002 in ~dcausse/reindex/cirrus_log/ .. looks like there are still ~200 to go [22:01:41] I can probably tell it to stop after it finishes the current one [22:02:11] Trey314159: got it, thanks. we can let those run their course. off the top of your head, do you remember if relforge is part of that? (I need to start w/ restarts of relforge first anyway) [22:02:42] I would imagine since we reindex by talking to mediawiki that it only impacts eqiad/codfw? [22:03:11] I forget where cloudelastic lives w.r.t. relforge. It's reindexing eqiad, codfw, and cloudelastic. [22:03:28] okay good, that means no relforge [22:04:06] cloudelastic lives in eqiad. I believe it runs in ganetti and is not an actual physical host although I should go check up on that to be sure [22:04:44] cloudelastic has real hosts :) [22:05:08] I didn't think it had any relation to relforge, but lotsa things have moved around since the last time I thought about it too much [22:05:10] thanks, wonder what I was thinking of then [22:05:20] (re cloudelastic) [22:05:21] hmm, an-airflow and search-loader are ganetti [22:05:35] maybe others, it's not always clear :) [22:06:15] ah yeah I might have been thinking of the MLR stuff