[09:04:24] dcausse: are you around [09:07:18] oh [09:07:41] kostajh: looks like ejoseph is around if you have questions about integration testing on ES + Cirrus [09:23:19] \o [09:24:27] ejoseph: the first thing I'm trying to solve is to get elasticsearch working on my mac again (using docker). I have an apple m1 system. I was using this image https://gitlab.wikimedia.org/kharlan/wmf-elasticsearch-arm64/-/blob/main/Dockerfile . It worked fine but updates stopped processing (recently?) and now I get cryptic errors in elastic when I try to rebuild the search index [09:38:27] ryankemper / inflatador: a minor follow up on your puppet change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/789104 [09:42:20] kostajh: good morning [09:42:36] Can you send screenshots of the error you are getting [09:46:27] ejoseph: hi! https://gitlab.wikimedia.org/-/snippets/23 [09:46:49] the error comes after I run `php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --startOver` [09:47:28] I put the stack trace for the script in the snippet too [09:48:48] ok checking [09:51:06] that's really weird! `class java.lang.Boolean cannot be cast to class java.util.Map` [09:53:11] I was wondering too [09:53:37] Didn't come across that when setting up [09:54:06] kostajh: can you jump on a meet? [09:54:41] ejoseph: sure! [09:55:19] ejoseph: https://meet.google.com/ohb-gtda-kgh [09:55:35] kostajh: We've moved our main branch to 6.8. So if you're using a recent CirrusSearch, you might want to upgrade to 6.8 (I don't see how this could be related, but I don't know enough) [09:55:37] sorry have to attend to something [09:55:45] oh [09:55:56] give me 10 mins [09:56:10] ejoseph: np. ping me when you're ready [09:56:33] gehel: the "oh" was about 6.8. I had been using 6.5. So I that's probably why things broke when I updated Cirrus [10:00:05] let me try to rebuild my arm64 image using https://github.com/elyalvarado/elasticsearch-docker-arm64 as the base, then, to see if 6.8 fixes things [10:08:11] kostajh: I am here now [10:08:55] ejoseph: let me see if rebuilding my arm64 image with elastic 6.8 fixes the issue, because I have tried so many other things already; if that doesn't work I can ping you to do a debug meeting, does that sound ok? [10:09:11] Yes sure [10:21:03] is it easy for me to build the plugins for 6.8.24? they are built for 6.8.23. And the arm64 image I've found is only for 6.8.24 [10:31:30] I suspect it isn't, anyway, I'm trying to build a 6.8.23 arm64 image now 😰 [12:17:46] it works 😅 [12:18:04] gehel: thank you for the 6.5 -> 6.8 clue, that is what I needed [12:28:23] I built elasticsearch-oss-6.8.23 for arm64 architecture; is there some semi-random place I can keep this 65M build in case others want to build this image in the future? [12:28:51] I updated https://www.mediawiki.org/w/index.php?title=MediaWiki-Docker/Configuration_recipes/ElasticSearch&type=revision&diff=5196632&oldid=5196561&diffmode=source with a reference to the image I built and pushed to docker hub [12:42:55] kostajh: my intuition would be on archiva. It's a bit of an abuse, but we are already doing it for some other stuff. [12:43:08] I put it on people.wikimedia.org for now [12:43:09] Or people.wm.o [12:43:18] ;) [12:44:44] last question (for now!), are there other version bumps like this anticipated in the near future, and how can I hear about them ahead of time? [12:59:50] kostajh our phab board is the best place to learn about our upcoming projects: https://phabricator.wikimedia.org/project/view/1849/ [13:00:13] We are planning a bump to Elastic 7.12 (the last OSS version) which will probably happen in the next few months [13:12:58] gehel :eyes on your PR [13:13:26] want me to merge it? [13:29:05] inflatador: please do! [13:51:46] merged [13:56:14] quick break, back in ~15 [14:20:05] back [14:58:10] Come Talk to the Search Platform Team in a few minutes! https://meet.google.com/vgj-bbeb-uyi [15:01:41] ebernhardson, ryankemper, inflatador, ejoseph: office hours: https://meet.google.com/vgj-bbeb-uyi [15:38:58] anyone knows where that documentation about authentication on WCQS might be? [15:52:32] Working out, back in ~35 . If you have time to check https://gerrit.wikimedia.org/r/c/operations/puppet/+/788815 it would be appreciated ;) [16:14:07] gehel: if you are available, can you come back to the meeting? [16:14:33] if not, no worries! [16:45:11] sorry, been back [17:00:56] unmeeting? [17:35:57] Trey314159 or ebernhardson or anyone else: there were some questions about exactly how/when snippets and section titles are being rendered in search results -- it seems a little unpredictable and it'd be helpful if you could provide more detail about what the process is: https://wikimedia.slack.com/archives/C030Q2LL63T/p1651148609504579 [17:47:20] mpham, I don't know exactly what's going on in the highlighter; I've never worked in that code. David did reply earlier and it seems like there's a rough explanation for what's going on. Do you want us to add to the explanation/supposition, or dig into the code to try to figure out exact details? [17:47:59] (I originally saw that thread but didn't chime in because I didn't have anything concrete to add.) [18:05:30] lunch. back in ~45 [18:17:54] who wants to code review wbsearchentities fix? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseCirrusSearch/+/789227/ [18:18:22] if you have a local install with wikibase+cirrus it should be easy to test [18:49:00] Trey314159: I think nobody really seems to understand when a section's title will be highlighted in a search result, how snippets are chosen and displayed. It is making it difficult to design around. I guess on the other hand, we could try to write some requirements around what is desired, but this has gotten us into problems in the past in making assumptions that certain changes would be easier than they actually are (i.e. indexing sections) [18:51:23] I think there's difficulty in understanding how the current search works as we try to improve it (this includes my own process of getting up to speed on things). I'm not sure we have documentation or explanations of how search features work, and I think that has tripped up users more than once with different expectations of functionality [18:53:51] mpham: i suppose you could think of it as two groups of highlighting, there is the title/redirect/category/section title highlighting, and then there is the text/aux text/file text highlighting. Each of those groups will try the first in the list, if it doesn't highlight anything go to the next in the list [18:55:05] a user should get a section highlight any time one of their search terms matches a section title and the highlight isn't superseeded by title/redirect/category [18:55:28] mpham: yeah, that's fair. There are a *lot* of complex backend pieces, but also too many entry points into search for users. We want to give people powerful tools, which can be hard to square with an intuitive UI/UX [18:55:49] (and isn't that just the woriest *shrug* ever?) [18:57:29] * ebernhardson wonders why category comes before section title in the highlighting, but maybe i'm biased against categories :P [18:59:22] back [18:59:23] ebernhardson: how good of a match does it have to be? I deleted "great" from sneha's mining on the great barrier reef example, and the results were still the same. And what's text vs aux text vs file text? [19:00:27] Trey314159: makes sense, but i'd argue without good entry points and/or UI/UX, nobody is going to be able to use the powerful tools -- or may actually be inadvertently destructive with them. So I do think it's in our interest to make sure things are as intuitive as possible [19:00:30] mpham: have a link? [19:01:40] mpham: text is the cleaned up html of the page minus "non-content" parts such as headings, captions, tables, etc and minus the html. aux_text is anything we removed from text as non-content, file_text is mostly a commons thing, but it's content from .pdf's, .djvu, etc. [19:03:20] ebernhardson & mpham: this is sneha's original link: https://en.wikipedia.org/w/index.php?search=mining+on+the+barrier+reef&title=Special:Search&profile=default&fulltext=1 [19:03:46] The article in question is "Biodiversity of New Caledonia" [19:03:59] in the first 6 results those all get title highlights, so no section highlighting will be attempted [19:04:22] Yeah, it's not a top hit [19:04:25] was also looking at this example: https://en.wikipedia.org/w/index.php?search=mining+on+the++barrier+reef+%22mining+on+the+great+barrier+reef%22&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%22fields%22%3A%7B%22phrase%22%3A%22%5C%22mining+on+the+great+barrier+reef%5C%22%22%7D%7D&ns0=1 [19:04:30] the next 2 get category highlights, back to title highlights, one section highlight, etc. [19:05:25] I will try to synthesize a reply for Slack. It will not be particularly short, though. [19:05:35] on my results `ibodiversity of new caledonia` gets a section mtach on `coral reef fish` [19:05:54] thanks trey. I can also help out if you want (was going to do the synthesis myself if you didn't want to) [19:05:56] * ebernhardson obviously can't type... [19:06:40] the highlight has no effect on the relevancy/results position, right? [19:06:51] mpham: right, the highlighting comes after result list is decided [19:07:12] so there could be very weird cases in which lower results have more highlighting than higher results? [19:07:57] mpham: i suppose yes, although i would consider that normal and not weird :) [19:09:01] highlighting is a totally separate process that is tangential to relevance, it uses some of the same index statistics to determine which words are important but thats about it [19:10:05] in terms of the short-circuiting, there is no reason we couldn't always check section titles and always show a match if one exists, it just doesn't today because that's how the current UI works [19:10:22] mpham: i'll put it in a google doc and we can review and clarify there. (I think I might be being tricked into writing a blog post.) [19:11:06] thanks. i was thinking a user might think it's weird that something that superficially looks like a better match because of highlighting is lower in the list, since they don't necessarily understand that highlighting and relevancy are distinct processes. But agreed it's not weird once I get that distinction [19:11:35] Trey314159: sounds good. thanks [19:15:23] it seems like the intent behind the current highlighting implementation is to attempt to justify the pre-existing ranking condition, so first off it goes looking in the title, if no justification in the title try the redirects, then categories, then section headings. While it might be nice if those could somehow know which field had the largest contribution to the ranking lucene/elastic [19:15:24] don't share that kind of information between ranking and highlighting stages of execution [19:22:28] ebernhardson: are you free around 2:30 PT to discuss some elasticsearch hw related stuff? [19:31:31] ryankemper: can do [19:32:54] excellent [19:58:13] ebernhardson: I'm looking at T265056 with Mike. Do we really need those dumps? [19:58:13] T265056: Make Cirrus Search dump script more resilient to failures (elasticsearch restarts) - https://phabricator.wikimedia.org/T265056 [19:58:51] I don't know if anyone download those dumps externally, but I don't think we've ever used them ourselves [20:16:15] Turnilo, that's a new one on me. Looks cool [20:17:46] gehel: i import them into hadoop and use them for various things, although not recently. Someone asked about them on our talk page yesterday: https://www.mediawiki.org/wiki/Topic:Wuuc4b74blrcc8h1 [20:18:10] gehel: conceptually, i think these are about 10x better than the xml dumps, the just involving xml is a giant pain, but i'm probably biased :) [20:18:39] he, he, he ... [20:18:55] looking at turnilo, it seems that almost no one is downloading them [20:19:03] I'll ask you more about that in our next 1:1 [20:19:09] would we know? We have the slowest download of our own dumps available on the net :P [20:19:21] yeah, that might be right [20:19:58] i mean, i do imagine they are rarely used. But i also think they are quite useful when they are, and much more accessible than other things that are available. [20:20:03] sure, for 1: 1:) [20:24:17] if we're talking about https://dumps.wikimedia.your.org/other/cirrussearch/current/ , I might be the one ;) [20:24:37] I used one of those dumps to set up a test ES cluster [20:25:14] oh right, use them occasionaly to import a sampling of prod-like pages into a local index [20:35:26] brb [20:59:03] hmm, in apt we have elasticsearch-oss 7.10.0 in thirdparty/elastic710, but component/elastic710 has the wmf-plugins for 7.10.2 [20:59:20] at apt.wikimedia.org. I suppose we have to pull a newer elasticsearch-oss [21:15:54] * ebernhardson waits to see how badly cindy fails on 7.10 [21:20:28] thanks for the code review g-ehel [21:31:05] ebernhardson: brian and I are in https://meet.google.com/qdk-mwwv-aoo [21:32:57] mpham & ebernhardson: I've sent you links to the draft write up on highlighting. It's 4 pages! Anyone else who wants to read the draft, let me know. [21:33:34] ^^ mpham: I typed your name instead of tabbing it, so I'm not sure you will get a notification. Sorry for the spam. [21:33:54] thanks trey! I got it [21:58:50] ryankemper: totally sorry, here now :) one sec