[15:59:59] hi Search team! I tried creating a custom CirrusSearch rescore function for T301096 (when providing a feed of articles matching the hasrecommendation:link filter to new users, prioritize the ones with a low amount of links compared to article size) and I've hit a wall. [15:59:59] T301096: Add a link: prioritize suggestions of underlinked articles - https://phabricator.wikimedia.org/T301096 [16:02:21] The outgoing_links field has the links, but ElasticSearch doesn't seem to support "number of values" in any of its queries; google tells me scripting is the only way to do that. But even though the field is a list of URLs, the mapping is type:text, and that effectively disables scripting for that field. Do you know of any way to get around that? [16:03:30] The idea would be to provide a custom scoring function along the lines of text_bytes / count( outgoing_link ), and use that in our API that makes a hasrecommendation:link search to provide new users with link recommendation tasks. [16:04:41] (We think editors would much prefer link recommendations to happen on underlinked articles, not already well-developed ones, where many consider it more annoying than useful.) [16:07:48] As far as my very limited understanding of ElasticSearch goes, the only solution would be to 1) add a new index which contains the number of outgoing links specifically (which seems like a lot of work for such a minor feature), or 2) setting fielddata=true on outgoing_link in the mapping, which would probably cause a performance degradation as it increases memory usage. [16:12:48] My plan B would be to calculate "underlinkedness score" as a PHP job and store it in the hasrecommendation:link tag weight (and then change or clone the HasRecommendation feature to take weight into account). [16:13:35] That's not as good as a rescore function (more complicated architecture + changing the exact mechanics of the underlinkedness boost would require recomputing all tag values; we were hoping to make it configurable how much a given wiki cares about underlinkedness) but obviously better than nothing. [16:14:01] Does that sound reasonable? [17:21:44] tgr: thanks for the thorough explanation of the different options you're looking at. I don't have a ton of knowledge in this area but will try to do some poking around later today. ebernhardson and dcausse would know the most here, but note that they're both in ireland this week for the search team offsite so it might take until next week to get an answer [17:23:13] tgr: one brief question though, for `solution 1`, you wrote `add a new index which contains the number of outgoing links specifically`, but did you perhaps mean to add a new *field* rather than an index? [17:26:33] thanks ryankemper! it's not particularly urgent. [17:28:34] I might be mixing ElasticSearch and SQL terminology. The raw data (list of URLs) is already there but some derived data (number of URLs) would have to be included. I think the correct term is a subfield? I know very little of ES mappings so might be getting the terminology wrong, though. [17:46:53] I was wondering if a multi-field could be used (https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html#types-multi-fields / https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html) [17:47:32] However it looks more geared towards interpreting the same data as two different types, but we need to go one step further and tell elasticsearch something like "store an int called `outgoing_link_count` whose value is equal to the size of the array `outgoing_link`" [17:57:49] Anyway at first glance I'd probably lean towards adding a new field rather than setting `fielddata=true`, since we just need `_source.outgoing_link.size()`. That would involve needing to change the mapping though AFAICT [18:20:43] Mapping the field as both text and keyword would probably also work, since scripting is available for keyword fields. I imagine it would have the same disadvantage as fielddata=true wrt. memory usage, though. [20:31:19] Working on reloading the apaches for https://phabricator.wikimedia.org/T301461#7996226. A bit confused though, I'm realizing I don't know much about our usage of apache in the query service [20:32:01] The config file's here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/799297/6/modules/profile/templates/query_service/httpd.erb but I don't actually see an httpd process running nor do I see any `httpd` entries via `sudo netstat -ltp` [21:46:53] Tried both reloading and restarting nginx on `wdqs1006`, but tests still failing (`ryankemper@cumin1001:~$ httpbb /srv/deployment/httpbb-tests/query_service/test_wdqs.yaml --hosts wdqs1006.eqiad.wmnet`) [21:47:12] https://www.irccloud.com/pastebin/lWzJSNQ4/access%20log [21:47:23] https://www.irccloud.com/pastebin/SyuSsc4m/ [21:47:45] I'm missing something obvious here, gonna grab lunch and see if the answer becomes obvious to me [22:20:42] ryankemper: the frontend files are not served from the wdqs nodes, but from the general purpose static webservers [22:30:15] gehel: ah, of course, thanks! I'll look at `miscweb*` [22:31:13] that previous CR should probably be reverted as it is useless [22:31:22] (and somewhat misleading)