[06:42:18] aarora: o/ [06:44:48] aarora: so you have questions about CirrusDumps? I might not be the best person to answer. dcausse might know more. Or ebernhardson (but he is in the US and will be around later) [06:45:58] Thanks Guillaume! I will post my question here anyway! [06:53:28] Hi, [06:53:28] I am Akhil Arora, an external research collaborator of WMF Research, and we (Martin Gerlach and I) are currently trying to use morelike to assess the efficacy of a link-reccommendation tool that we are building. To ensure a fair comparison, we would like to use a historical dump and not use the API as it will return the results from the indices in their current state. [06:53:28] Thus, I am trying to load the cirrussearch indices into my own local installation of ElasticSearch, and I was following the steps described in this blog: https://www.elastic.co/blog/loading-wikipedia. Of course, this is for an older version, and I adapted the instructions for the recommended version, i.e., ElasticSearch 6.8.23. To setup the index, I fetch the settings using: 'https://{lang}wiki/w/api.php?action=cirrus-settings-dump&format=js [06:53:28] on'. However, when I pass this json to elasticsearch in order to initialize the index, I get errors about missing tokenizers or filters. A few examples below: [06:53:28] * Custom Analyzer [text] failed to find filter under name [homoglyph_norm] [06:53:28] * Custom Analyzer [text] failed to find tokenizer under name [hebrew] [06:53:29] Indeed, there's no reference to these tokenizers/filters in the obtained json from the settings-dump API, and I am sure these are not standard/built-in tokenizers/filters. How to get by these issues and obtain such dependencies/pre-requisites. [06:53:29] Sorry for the long post, however, it would be great if anyone in this channel can provide more insights into my queries. [06:54:22] aarora: you probably need to install the same set of plugins that we use [06:54:49] is there any documentation around this? [06:54:58] those are available as a .deb package on our APT repository if you are using Debian (or a derivative) [06:55:38] yes, I am using debian. You mean the apt-repository for elasticsearch? [06:55:55] nope, the one from WMF [06:56:19] https://wikitech.wikimedia.org/wiki/APT_repository [06:57:45] the .deb packages should be there: https://apt.wikimedia.org/wikimedia/pool/component/elastic68/w/wmf-elasticsearch-search-plugins/ [06:58:55] I found somewhere a reference to an "extra" plugin (org.wikimedia.search:extra:6.8.23-wmf1), that I installed, but I think there are more dependencies on the tokenizers/filters [06:59:36] I can't find any good documentation on which plugins we use. This is very internal and usually no one cares outside or our team. [06:59:44] since the documentation is so scattered/sparse, it becomes difficult to identify these dependencies on one's own :) [06:59:47] yeah, I understand. [07:00:07] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/elasticsearch/plugins/+/refs/heads/master/debian/plugin_urls.lst [07:00:42] That's the list we use to build our debian package. I don't think this counts as documentation, but at least you have the full list of plugins and where to download them [07:00:56] sounds good, thanks! [07:01:13] let me try this out, and get back if I still face any issues/errors. Thanks again! [07:01:20] Good luck! [09:30:18] a quick follow-up question. I was working with elasticsearch 6.8.23 as this was recommended with the mediawiki-1.3.8 and the cirrussearch extension. However, based on the aforementioned gerrit link, I noticed some plugins are compatible with a higher version of elasticsearch, specifically, the hebrew plugin (with 7.10). Do you recommend upgrading es to 7.10? [09:31:18] Also, going back to my eventual goal, which is to query a historical cirrusearch index using the morelike api, what's your recommendation, as in, what's the best way to proceed? [09:35:16] aarora: we're in the process of upgrading to 7.10 but the CirrusSearch master branch is not yet compatible with it [09:36:16] so then, given the goal is to use morelike, should I stick with 6.8.23? [09:37:18] if you want to use CirrusSearch yes, you can also ship more_like queries without CirrusSearch, these should be pretty simple to replicate [09:37:47] e.g. https://en.wikipedia.org/w/index.php?search=morelike%3ATest&title=Special:Search&profile=advanced&fulltext=1&ns0=1&cirrusDumpQuery [09:38:18] simply add '&cirrusDumpQuery' to any search results URL to see how we query elastic [09:39:37] yes, I tried that.. [09:40:07] basically, your suggestion is that, I construct my own ES query using what I get from the cirrusDumpQuery output? [09:41:10] aarora: yes it's a possibility [09:41:46] okay, I will try that.. [09:42:29] but using es 6.8.23 + CirrusSearch is doable but will require some hacks as you won't have the mysql db in sync with elastic [09:44:11] ahan, I was under the impression that once I successfully install the CirrusSearch extension on a local machine, configure ES, it should work seamlessly [09:44:53] sadly there are still some checks done against the mysql db (e.g. check that the displayer article exists) [09:45:29] which can be disabled by some config hacks IIRC [09:45:45] okay, what sort of hacks would be needed? Can you guide me or are these too specific: need to be taken on a case-by-case basis, once I start facing issues? [09:47:56] sure! in a meeting atm but will get back to you in ~15mins [09:48:39] sg, thanks! [09:48:41] Regardless, if I want to stick to 6.8.23, what's the recommended plugin versions (along with links) for elasticsearch-learning-to-rank, analysis-hebrew? [09:48:53] also, you can respond later about the above as well [10:01:52] aarora: this should be the version of the above file for 6.8.23: https://github.com/wikimedia/operations-software-elasticsearch-plugins/blob/ae339937693e824145925604c973941045290387/debian/plugin_urls.lst [10:02:00] lunch! [10:02:38] aarora: the list of plugins for 6.8.23 is here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/elasticsearch/plugins/+/af639cd9206c1994d907f0b185b90533bdc4e797/debian/plugin_urls.lst (simply replace $ELASTICSEARCH_VERSION with 6.8.23), beware that the hebrew is under AGPL (https://github.com/synhershko/elasticsearch-analysis-hebrew/) [10:06:54] for making CirrusSearch works (esp morelike) with an empty mysql db you have to set wgCirrusSearchDevelOptions (https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CirrusSearch/+/refs/heads/master/docs/settings.txt#1472) [10:07:31] something around those lines: $wgCirrusSearchDevelOptions = [ "morelike_collect_titles_from_elastic" => true, "ignore_missing_rev" => true ]; [10:07:47] lunch [10:08:48] https://bintray.com/synhershko/elasticsearch-analysis-hebrew/download_file?file_path=elasticsearch-analysis-hebrew-5.3.0.zip [10:09:00] this link is broken on the https://github.com/synhershko/elasticsearch-analysis-hebrew/ git repo [10:10:12] we made our own build here: https://people.wikimedia.org/~ejoseph/analysis-hebrew-6.8.23.zip [10:10:28] yes, found it in the links shared by Guillaume. Thanks again! [13:06:10] greetings [13:25:42] o/ [14:12:52] \o [14:16:03] o/ [14:35:56] dcausse I removed the kubernetes operation section from https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Flink_On_Kubernetes , what page did you say it was supposed to go on? [14:38:33] inflatador: it should go under Runbooks at https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater I think [14:39:37] Running to pharmacy, will be at weds meeting @ 30’ after [14:40:44] ebernhardson: I must confess that I might need some help to understand your OAuth change :) [14:40:57] hopefully we can use some of the wed meeting for this [14:43:11] dcausse: sure! [14:43:24] anyone know if the tech meeting is being recorded? [14:44:45] I don't know was wondering the same [14:55:00] i suppose i'll start in the tech meeting then, if it's being recorded will switch over to wed's meeting, otherwise joining wed meeting late [15:01:22] slack response was that it will be recorded [15:21:15] damn, lost track of time [17:42:10] back [18:04:53] lunch, back in ~30 [18:05:16] ryankemper if you have time, might look at https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/825874 . I fixed the linting errors and jenkins seems happy [18:07:44] * ryankemper hates how the linter doesn't allow `##` as a comment [18:08:06] Fixed one small thing [18:43:41] back [18:43:41] :eyes [19:10:33] ryankemper I gave the ol' +1 to https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/825874 . After we merge, do we need to run puppet on cumin hosts? [19:10:51] (just wondering how we get the new cookbooks) [19:11:42] inflatador: yeah puppet runs just pull the latest master, so puppet just needs to be ran on the relevant cumin host before we run the cookbook [19:11:52] if you merge now it'll have automatically ran puppet by the time we end up using it [19:12:35] ACK, will do [19:45:42] quick break, back in ~15-20 [20:06:59] and back [20:25:50] ryankemper ebernhardson up at https://meet.google.com/tfe-wgyh-xqu if y'all wanna try the cookbook on cloudelastic [20:31:48] sure, sec [22:35:54] cloudelastic-chi back to green, all happy [22:46:14] inflatador: We've got two patches that should avoid the condition we hit (starting the es 7 units, which allocates all the memory that java wants and leads to the OOMKills on cloudelastic since cloudelastic can't afford the double memory requirements). One to avoid starting the es 7 units until an arbitrary file is in place, and then the cookbook change to make it actually touch that file during the upgrade: [22:46:15] https://gerrit.wikimedia.org/r/c/operations/puppet/+/826396 && https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/825874 [22:46:25] We can re-try the upgrade with those two changes merged tomorrow [22:47:53] (To be explicit, the es 7 units can't actually get healthy until es 6 stops running because of the node locks, but the problem is that we're still starting the elasticsearch process in order for it to be able to realize that, so it's still creating the jvm process and allocating the big chunk of memory) [23:16:39] Nice detective work! And I see we're back to green [23:43:21] Haven't looked at the patches but we might want to use systemd Conflicts https://www.freedesktop.org/software/systemd/man/systemd.unit.html