[10:16:53] lunch [11:10:35] errand [13:13:53] o/ [13:36:18] o/ [14:08:03] o/ [14:08:15] 🫢 [14:15:53] :) [14:30:09] inflatador, ryankemper: any update on https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1016855 ? [14:32:40] volans not ATM... ryankemper can you check it out once you get in? [14:33:01] headed to cowork space, back in ~30 [14:34:32] thx [14:34:44] np, sorry it is not moving faster [15:01:19] back [15:05:14] \o [15:05:25] dcausse: looking for code review on anything wdqs? i saw the attention set change on some patches, but wanted to make sure it's okay. looking for general feedback? also, anything needing merge that needs review? [15:06:52] dr0ptp4kt: yes, please everything that's not WIP or -1 should be ready for general feedback but no rush to get any of these merged [15:06:56] o/ [15:07:10] thx dcausse! [15:08:44] dr0ptp4kt: i realized while making numbers for T358349 do we really want a ratio of, for example, autocomplete requests to pageviews? I can make those numbers, desktop had perhaps 25M ac req's from users and 233M page views on 4/4. But i'm not sure the ~10% number means anything [15:08:45] T358349: Search Metrics - Number of Searches - https://phabricator.wikimedia.org/T358349 [15:12:21] building dataframes programmatically I wonder when spark will start to scream... [15:14:03] ebernhardson: ack, pondering [15:18:27] Question for I think ebernhardson but maybe others? I've been trying to establish some direction for WMF recommender systems next year and in particular the expansion of the sorts of topic filters and task filters that we provide to editors. For example, so we can surface articles with an image recommendation available that are about sports in Newcomer Tasks and hopefully Content Translation or other tools (example API call: [15:18:27] https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=hasrecommendation:image%20articletopic:sports&format=json&srnamespace=0 ). This approach has been working quite well from my perspective and I'd like to scale it up with more tags but wanted to talk with you all first to see if there are any unforeseen challenges there. I put a bunch of my thoughts into this Etherpad [15:18:27] (https://etherpad.wikimedia.org/p/recsys-search-tags-future). If it can be done async, I'm happy to chat. If it'd be better in a meeting or something else, just let me know. Thanks! [15:20:27] taking a look [15:22:41] ebernhardson: a couple parts here, i think. it sounds like due to edge caching the number of backend-observed autocomplete "prefix searches" is lower than the number of pageviews. so, i think what we really would want here is both edge cache-observed requests (which is approximately one request per character entered, or maybe slightly less with debounce), backend-observed requests (which will be smaller like you say). [15:22:42] but [15:23:20] i think you may have also noted some challenges because of i18n. but i think the i18n challenge for webrequests is for special:search (unless you do an array of the i18n in queries) [15:23:22] however [15:23:34] for prefix search, i think it may be the case the format is known: [15:23:40] like if i'm looking for a cat on french: [15:23:42] dr0ptp4kt: these are counted from webrequests, so it's using the same agent == user filtering [15:23:42] https://fr.wikipedia.org/w/rest.php/v1/search/title?q=chat&limit=10 [15:23:49] ebernhardson: thanks -- no urgency but trying to check in while I'm still early in this planning [15:24:09] or english: https://en.wikipedia.org/w/rest.php/v1/search/title?q=cat&limit=10 [15:24:16] those seem to be predictable [15:24:28] that's the desktop case [15:24:42] dr0ptp4kt: for i18n i decided to pretend any request to index.php with the 'search' query string is probably search. For the longest time there was a hardcoding in mw that sent all pages to Special:Search if they had that token. That has been fixed, but still afaik there are no other users [15:25:53] i might not be counting the restbase ones, not sure. restbase has to make a query into mediawiki, do those end up in the webreuest logs too? [15:26:07] i'm guessing no and that it skips all the caching layers [15:26:23] hmm, i think they do show, but lemme go check quickly [15:26:54] that's rest.php, which i think is mw mediated on an app server layer and still routes via varnish so should show in the logs. moment... [15:27:15] i'm detecting currently be decoding the mw api.php query string and counting various requests [15:28:39] yah, i really need to go through the notebook (thx btw for pointing to the appropriate stat box in one of those tix, as well as noting the thing about dividing by uniques that we discussed, also the search-metrics etherpad has the notes, but i owe you ticket updates...) [15:31:31] hmm, also realized for counting uniques i guess it really has to do the whole month in one spark job, or i would need some fancy data structures, but i guess going to try throwing a bunch of compute and hope it doesn't die on 30d of data. Do we care what 30 days? I would probably run mar1-31 but it could be mar11-apr11 or whatever [15:33:34] yay, looks like those rest.php ones show in webrequest. select cache_status, uri_query from webrequest where year = 2024 and month = 4 and day = 4 and hour = 4 and uri_host='fr.wikipedia.org' and uri_path = '/w/rest.php/v1/search/title' limit 10; [15:35:02] dr0ptp4kt: what i was wondering though, is restbase then goes on to make an api.php request. Do those api.php requests end up in webrequests? [15:35:38] oh, hang on, i haven't looked at the code in a long time. i know it was within the last few years that was built, but it feels like 30 years ago, pandemic time vortex [15:52:43] still getting alerts from elastic2090, which was accidentally added to both psi and omega. Cleared out the psi stuff and running puppet, let's see if that fixes anything [15:52:51] * ebernhardson realizes while looking at the restbase reommendations code they lose the order of morelike recommendations [15:53:05] okay, looking at https://gerrit.wikimedia.org/g/mediawiki/core/+/2d80faff2d13c9f2be8a011df5aded602c105aad/includes/Rest/Handler/SearchHandler.php it seems like this does an internal search without dispatching to api.php [15:53:15] but wait, does it? [15:53:23] dr0ptp4kt: doh, for some reason i'm mixing rest.php and restbase :P [15:53:44] if it's mediawiki php code it should certainly be invoking search directly [15:53:51] the new autocomplete widget is using rest.php and this SearchHandler.php code [15:53:55] i'm trying a LIKE query on webrequest to make sure i'm not missing something [15:54:13] but not all the wikis are using the new widget so opensearch is still being used [15:54:36] yeah, to mediawiki-config i suppose :) [15:54:45] yea opensearch was the 25M ac per day (which does seem super low, we probably serve ~80M from the backend iirc). It's counting again with restbase [15:54:56] i was initially assuming that was the bot/user filter [15:55:37] * ebernhardson should just give this thing more executors...on one hand a TB of memory seems excessive...but there are 13TB still idle :P [15:55:42] mobile apps might be using restbase as an intermediate layer to some search apis I suppose (morelike for sure) [15:56:16] mobile apps do an insane number of queries [15:56:17] i did have a look at https://github.com/wikimedia/apps-android-wikipedia/blob/af8db81f535a1a5de25fa07a8ab126c71f733aca/app/src/main/java/org/wikipedia/dataclient/Service.kt#L65 last night to see what android was doing [15:56:21] it works out to something like 4 search requests per page view [15:56:31] and then there is of course the calls that are restbase mediated and not action api mediated [15:56:55] iirc they do a mix of autocomplete and fulltext [15:57:08] yea the two endpoints are similar counts [15:57:12] mobile app search is really nice from a ux perspective for sure, because it just does "what you mean" [15:58:14] the invoker on android is i think https://github.com/wikimedia/apps-android-wikipedia/blob/4f57145c099fdeb60f3dc88d36df7a1ac52275e3/app/src/main/java/org/wikipedia/search/SearchResultsViewModel.kt#L101 but dbrant could verify on slack if need be [16:01:53] from a search perspective it's a bit awkward, we have a strict AND between all words in the full text query, and they send partially typed queries [16:02:58] the AND would be implicit, is that right? [16:03:10] yes, it's always there if not specified [16:03:44] a more intelligent thing would have to treat the final token differently, perhaps as a weight but not a filter [16:03:59] yes this always bothered me that fulltext is receiving partially typed queries but I guess they can't really know [16:04:13] but that would be weird for normal use cases, so they would need some dedicated flag to flip [16:04:37] even still...as a weight it would give weird results i suspect [16:05:47] yes... seems difficult to address without the user saying "I'm done typing" [16:07:15] i imagine google, et. al. are doing some combination of word and query stats to first complete the query, then search for 1-3 of the top couple completions [16:07:33] with a completion smart enough to know when a query is "done" [16:07:57] or maybe it's all ml magic :P [16:08:53] it's a little different though because google completes queries, we complete titles [16:08:54] I might be wrong but I feel that you always have to press enter? so it's clearly separating suggestions from autocomplete and actual search results [16:09:25] hmm, wow my age is showing :P instant search was a big deal for awhile but indeed they turned it back off at some point. probably more expense than it's worth? [16:10:14] yes I remember this indeed, no clue why they stopped, in my memory the search results appeared with a small delay after typing the last character [16:10:16] apparently 2010-2017 [16:10:48] for some reason my memory is that it would show results while i was typing for completed searches...hmm [16:11:18] perhaps on some search boxes on android? [16:11:25] wiki page says instant was turned off because it didn't work well on mobile [16:11:41] maybe [16:12:20] * ebernhardson apparently needs to start writing these dataframes to hdfs and loading parquet into pandas...they keep dieing while collecting [16:12:31] :/ [16:20:59] when google wanted users to grade queries for them: https://googleblog.blogspot.com/2008/11/searchwiki-make-search-your-own.html [16:24:04] heh. Sounds very corporate, but then again i regularly see people asking for basically this feature (customized per-user re-ranking) [16:24:26] they want to uprank certain domains, downrank others [16:39:33] yes I can understand why this sounds appealing but I could see this becoming quite tedious to maintain from a user perspective, if it's only domains perhaps that's manageable tho [16:46:54] oh, regarding which dates for the month, i don't think it matters. it may be nice to have it be for march 1 - march 31 so you can say 'for the month of march'. spark will probably play nicer if things are calculated by hour; also that fits more closely with what might be a session [16:47:59] session length is thought of 30 minutes without interaction. but it seems like too much work to have a sliding window per session [16:50:23] with folks on roaming ip addresses, behind nat'd or proxied ip addresses, etc. we wouldn't want to consider a unique spanning a full month i think. this does make it a bit complicated. one way to possibly sidestep this is [16:50:55] hmm, if you don't do the unique deduplication over the full month of data, then you have double counts? [16:52:22] hmm, thinking. so there's actore_signature [16:52:42] in wmf.pageview_actor [16:54:57] hmm. hmm hmm hmm. [16:55:13] yes, thats what i'm using. Although i just reimplemented the udf in spark [16:55:42] did try to load the jar, but spark was complaining about missing handlers, not worth the trouble [16:55:59] when in doubt! [16:56:08] copy-pasta! [16:57:32] spark is so much nicer...compare https://gerrit.wikimedia.org/g/analytics/refinery/source/+/fb0f7dbf7451b9258770f3aa683b53fff909b4fb/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/ActorSignatureGenerator.java to https://phabricator.wikimedia.org/P60469 [16:59:15] heh. and python. [17:13:31] so, i think we would want a window that is less than a month for defining a unique user. if we do this as a time series by day that will be a bit closer to the daily uniques notion. the output table or time series graph could show what the estimated sum(uniques_estimate) is for a given day-domain pair from unique_devices_per_domain_daily. i guess either way we'd want to make sure to specify the notion of a "session" [17:14:33] as being on a day boundary, it has limitations and weaknesses being counted this way, and so on. [17:21:27] i don't love it, but i think a month is too long and probably too prone to monoculture UAs and shared IPs, whereas a day is a somewhat balanced compromise figure. of course if we showed both from the perspective of day and perspective of the month it would be interesting. there is also unique_devices_per_domain_monthly, if we're looking to include a benchmark column against which to see how many hashes there are per domain per mo. [18:31:57] following up on a thing i had mentioned, there is this as well, which appears to be sampled, but it does include bots in its formulation, so it's not viable for comparison. https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/SessionLength . if mediawiki_client_session_tick had the info for constructing the hash, it could be ascertained, but it's not that specific. [19:35:35] ryankemper looks like elastic2088 is fixed per https://phabricator.wikimedia.org/T361525#9710654 [19:53:04] ack [20:49:12] volans: (for monday) thanks for following up on https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1016855 (and my bad for losing track of it). got the linting errors fixed so we should be all set. come next week we can get that patch merged and take the new logic out for a spin [20:49:57] great, let's sync next week! [21:46:52] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1018360 is ready for review. See ya Monday!