[09:10:13] Errand, back in a few [12:21:14] inflatador / ryankemper: I see that you've reverted https://gerrit.wikimedia.org/r/c/operations/puppet/+/950136, but I don't see any notes on T342361. Do you know what was wrong? [12:21:14] T342361: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 [12:23:27] inflatador / ryankemper: You've probably seen the email from _moritzm, looks like we did not actually reboot elastic and wdqs servers :/ T344587 [13:20:27] gehel working on the reboots and will get you the startup script error msg from wdqs1004 shortly [13:26:48] meant to put that in the ticket, sorry [13:27:05] inflatador: thanks ! [13:34:21] gehel NP, error is here: https://phabricator.wikimedia.org/T342361#9136740 [13:34:36] I'm looking for the number of full text search per day. Looking at https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&refresh=1m&viewPanel=41&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1, it seems that we have ~100/s in codfw + ~300/s in eqiad [13:35:51] (100+300)*60*60*24 = 34'560'000 Full Text Search per day. That seems more than I expected. Could someone check my math? [13:42:04] Context: T341227, I'd like to know what percentage of search requests get duplicate results. If my numbers are correct, we're at ~0.002%. That seems really insignificant. [13:42:04] T341227: Make local_sites_with_dupe filter configurable and count duplicates - https://phabricator.wikimedia.org/T341227 [13:42:18] At least not enough to spend any kind of time on this. [13:43:10] Sorry, that was ~0.001% [13:44:18] we get ~300 duplicates per day, so I was very much expecting a very small percentage. I just did not realize we have >30M queries per day. [13:44:28] Large numbers never make much sense... [14:14:15] Weekly update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2023-09-01 [14:20:45] is there a pointer to the source of the fulltext query metric informing the data? i was looking around the puppet and icinga log repos to try to quickly pinpoint the source of the fulltext query log metric, and then i codesearch'd, then i realized maybe someone has it handy or just knows! [14:21:39] boo, cloudelastic chi is red [14:23:31] it seems like, assuming 15b pageviews per month, that fulltext qps would be like 6% of pageviews or something like that. i was thinking, why might that be. it's not completely implausible that it's Special:Search, but then again, it feels like maybe it's some combination of UAs making fulltext calls - the apps do this, which is what make their search so awesome, and i believe there are some other fulltext api callers..but maybe [14:23:56] it isn't api callers at all, or it is this in concert with special:search callers or something [14:27:17] looks like cloudelastic1003 is having problems...hard rebooting it now [14:29:31] (btw, i was looking at cirrus_cluster_checks.pp in icingi repo and prometheus-wmf-elasticsearch-exporter.py in puppet repo and then tried a \.(php|py|pp)$ regex for search string of full_text in codesearch, but maybe need to be more patient and exploratory) [14:36:38] OK, back to yellow. Kinda disturbing that losing a single host up us in the red, but probably just some aliases [14:41:21] dr0ptp4kt: the graph I used as a basis for my calculation is https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&refresh=1m&viewPanel=41&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1, which is shows the QPS as measured by CirrusSearch. So it should include all full text search queries, including coming from bots, apps, etc... [14:43:01] I think this is what I want in this context, the duplications are just as much unexpected for API calls then they are for actual display [14:44:24] The measure of duplications are from https://grafana-rw.wikimedia.org/d/hOONJnkIz/cirrus-search-deduplication?orgId=1 which is measured from https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/944269/10/includes/Search/FullTextResultsType.php#143 [14:44:40] As I understand it, this also covers all queries, including bots [14:46:46] I don't think I ever thought about full text search in terms of % of page views (I really should have). But if it is 6%, even including API calls, that makes Search even more important than I thought! [14:47:25] I should triple check that number and use it to advocate the value of our team! [14:47:55] Not too bad for a team of 6 engineers! [14:52:10] inflatador: there are a few cases where a cluster might go red during normal operation. For example when creating an index, the cluster will be red for a short time. Probably not what happened here, but just to mention that elasticsearch reporting red isn't always unexpected. [14:52:54] That's why we don't have an alert on the cluster being red, but only on a percentage of shards being unallocated. [14:53:05] gehel Oh yeah, I got that, but it still makes my heart race a bit ;) [15:00:27] looks like we have a few indices with 0 replicas on cloudelastic. That does not seem right! [15:00:34] https://www.irccloud.com/pastebin/T7kirgWP/ [15:01:22] we also have one on the production cluster: [15:01:27] https://www.irccloud.com/pastebin/JjP6T9SV/ [15:02:21] \o [15:03:07] interesting. I don't have any alerts on that [15:03:17] or if I do, they may have been eaten by a filter ;( [15:04:30] regarding frequency of searches, while we get peaks of some 400-500/s, and 30M-ish per day, queries on the web are only ~100-120 iirc, the rest is api traffic of bots, apps, etc. [15:04:43] for full-text, autocomplete is more [15:05:25] I guess the solution for missing shards is to run force-shard-allocation cookbook? [15:05:43] although that says "all shards"? [15:05:47] https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/elasticsearch/force-shard-allocation.py [15:06:41] hmmm [15:06:50] indices with 0 replicas, do they have aliases? [15:06:59] indices with no replicas and no aliase are failed reindexes [15:07:03] checking...that's the usual explanation [15:07:30] inflatador: no, the shard allocation cookbook will try to reallocate shards that are not currently allocated to a server for various reasons. [15:07:55] if the number of expected replicas is 0, then it will not try to allocate more shards [15:08:55] it's gotta be reindx related, that's the only time we have 0 replicas. We index into a 0 replica index and then let it copy out when complete [15:09:09] we need to check if those indices are supposed to exist (maybe there was a failed delete after a reindex). And then either delete those, or change the replica count. [15:09:25] yeah, that's usually what makes a cluster go red [15:09:54] at least when we do these maintenance operatoins [15:10:12] zero replica still means that there is a primary, so the cluster should not be red in that case [15:10:40] but if we have failed reindex leftover, those could be giving us false positive on alerts [15:10:49] and eat some disk space for no reason [15:12:05] dr0ptp4kt: i dunno if you ended up finding it, but the request metrics are logged in CirrusSearch\ElasticsearchIntermediary::finishRequest [15:18:14] going to work out, but the first 3 shards I check have no aliases and no replicas, so will be deleting them when I get back [15:34:25] I'll start my weekend! Have fun! [16:00:02] back [16:00:17] Maybe we should add an alias/check delete to the rolling operation cookbooks [16:03:53] there is a script that can look for these, just takes a couple minutes to run. sec [16:06:29] from mwmaint1002 (or any mw maint host): python3 /srv/mediawiki/php/extensions/CirrusSearch/scripts/check_indices.py --run-cache-path /tmp/check_indices.cache [16:06:51] The run cache path is about caching info from mediawiki, part of the reason it takes awhile is it has to run a maint script for each of 1k wikis [16:10:39] nice, is that part of the cirrussearch extension repo? [16:10:49] yes [16:11:03] thanks ebernhardson! sorry i missed the unhangout y'all, i ended up drilling into some wmcs stuff with arturo [16:16:00] i do wonder how we have these though, i thought we fixed up the reindexer to cleanup...apparently doesn't fully work :P [16:28:56] it happens pretty much every time we do a cluster restart or reboot...not a huge deal but I think it would save time to detect/delete first [16:29:23] https://phabricator.wikimedia.org/T345449 created a ticket for it, verbiage might not be 100% correct, feel free to edit [16:29:37] we could also ponder making that check_indices.py script have a cleanup option. It's super easy to add, the hard part is being confident it won't delete anything you still need :) [16:30:10] But we could specifically detect 0 replicas, no aliases and allow deletes there for example [16:34:41] If that's appropriate for anyone using Cirrussearch extension, that works for me. I guess you could just make it an option even if that might not be desired behavior for everyone [16:50:45] Thanks also gehel - I just realized I needed to scroll up more. [16:54:34] OK, rebooting codfw now [16:54:47] codfw elastic hosts, that is [17:32:32] lunch, back in ~45 [18:19:36] trying to figure out why we can't publish the cirrussearch-streaming-updater images, but it doesn't make sense. I don't know where it comes from, but based on the output there must be a KOKKURI_JWT env var, and it seems like it has outdated credentials [20:41:33] the answer was naming restrictions on the images, we can either publish `repos/search-platform/cirrus-streaming-updater` or `repos/search-platform/cirrus-streaming-updater/foo` but not `repos/search-platform/cirrus-streaming-updater-foo`