[09:10:13] <gehel>	 Errand, back in a few
[12:21:14] <gehel>	 inflatador / ryankemper: I see that you've reverted https://gerrit.wikimedia.org/r/c/operations/puppet/+/950136, but I don't see any notes on T342361. Do you know what was wrong?
[12:21:14] <stashbot>	 T342361: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361
[12:23:27] <gehel>	 inflatador / ryankemper: You've probably seen the email from _moritzm, looks like we did not actually reboot elastic and wdqs servers :/  T344587
[13:20:27] <inflatador>	 <o/
[13:26:30] <inflatador>	 gehel working on the reboots and will get you the startup script error msg from wdqs1004 shortly
[13:26:48] <inflatador>	 meant to put that in the ticket, sorry
[13:27:05] <gehel>	 inflatador: thanks !
[13:34:21] <inflatador>	 gehel NP, error is here: https://phabricator.wikimedia.org/T342361#9136740
[13:34:36] <gehel>	 I'm looking for the number of full text search per day. Looking at https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&refresh=1m&viewPanel=41&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1, it seems that we have ~100/s in codfw + ~300/s in eqiad
[13:35:51] <gehel>	 (100+300)*60*60*24 = 34'560'000 Full Text Search per day. That seems more than I expected. Could someone check my math?
[13:42:04] <gehel>	 Context: T341227, I'd like to know what percentage of search requests get duplicate results. If my numbers are correct, we're at ~0.002%. That seems really insignificant.
[13:42:04] <stashbot>	 T341227: Make local_sites_with_dupe filter configurable and count duplicates - https://phabricator.wikimedia.org/T341227
[13:42:18] <gehel>	 At least not enough to spend any kind of time on this.
[13:43:10] <gehel>	 Sorry, that was ~0.001%
[13:44:18] <gehel>	 we get ~300 duplicates per day, so I was very much expecting a very small percentage. I just did not realize we have >30M queries per day.
[13:44:28] <gehel>	 Large numbers never make much sense...
[14:14:15] <gehel>	 Weekly update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2023-09-01
[14:20:45] <dr0ptp4kt>	 is there a pointer to the source of the fulltext query metric informing the data? i was looking around the puppet and icinga log repos to try to quickly pinpoint the source of the fulltext query log metric, and then i codesearch'd, then i realized maybe someone has it handy or just knows!
[14:21:39] <inflatador>	 boo, cloudelastic chi is red
[14:23:31] <dr0ptp4kt>	 it seems like, assuming 15b pageviews per month, that fulltext qps would be like 6% of pageviews or something like that. i was thinking, why might that be. it's not completely implausible that it's Special:Search, but then again, it feels like maybe it's some combination of UAs making fulltext calls - the apps do this, which is what make their search so awesome, and i believe there are some other fulltext api callers..but maybe
[14:23:56] <dr0ptp4kt>	 it isn't api callers at all, or it is this in concert with special:search callers or something
[14:27:17] <inflatador>	 looks like cloudelastic1003 is having problems...hard rebooting it now
[14:29:31] <dr0ptp4kt>	 (btw, i was looking at cirrus_cluster_checks.pp in icingi repo and prometheus-wmf-elasticsearch-exporter.py in puppet repo and then tried a \.(php|py|pp)$ regex for search string of full_text in codesearch, but maybe need to be more patient and exploratory)
[14:36:38] <inflatador>	 OK, back to yellow. Kinda disturbing that losing a single host up us in the red, but  probably just some aliases
[14:41:21] <gehel>	 dr0ptp4kt: the graph I used as a basis for my calculation is https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&refresh=1m&viewPanel=41&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1, which is shows the QPS as measured by CirrusSearch. So it should include all full text search queries, including coming from bots, apps, etc...
[14:43:01] <gehel>	 I think this is what I want in this context, the duplications are just as much unexpected for API calls then they are for actual display
[14:44:24] <gehel>	 The measure of duplications are from https://grafana-rw.wikimedia.org/d/hOONJnkIz/cirrus-search-deduplication?orgId=1 which is measured from https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/944269/10/includes/Search/FullTextResultsType.php#143
[14:44:40] <gehel>	 As I understand it, this also covers all queries, including bots
[14:46:46] <gehel>	 I don't think I ever thought about full text search in terms of % of page views (I really should have). But if it is 6%, even including API calls, that makes Search even more important than I thought!
[14:47:25] <gehel>	 I should triple check that number and use it to advocate the value of our team!
[14:47:55] <gehel>	 Not too bad for a team of 6 engineers!
[14:52:10] <gehel>	 inflatador: there are a few cases where a cluster might go red during normal operation. For example when creating an index, the cluster will be red for a short time. Probably not what happened here, but just to mention that elasticsearch reporting red isn't always unexpected.
[14:52:54] <gehel>	 That's why we don't have an alert on the cluster being red, but only on a percentage of shards being unallocated.
[14:53:05] <inflatador>	 gehel Oh yeah, I got that, but it still makes my heart race a bit ;)
[15:00:27] <gehel>	 looks like we have a few indices with 0 replicas on cloudelastic. That does not seem right!
[15:00:34] <gehel>	 https://www.irccloud.com/pastebin/T7kirgWP/
[15:01:22] <gehel>	 we also have one on the production cluster:
[15:01:27] <gehel>	 https://www.irccloud.com/pastebin/JjP6T9SV/
[15:02:21] <ebernhardson>	 \o
[15:03:07] <inflatador>	 interesting. I don't have any alerts on that
[15:03:17] <inflatador>	 or if I do, they may have been eaten by a filter ;(
[15:04:30] <ebernhardson>	 regarding frequency of searches, while we get peaks of some 400-500/s, and 30M-ish per day, queries on the web are only ~100-120 iirc, the rest is api traffic of bots, apps, etc.
[15:04:43] <ebernhardson>	 for full-text, autocomplete is more
[15:05:25] <inflatador>	 I guess the solution for missing shards is to run force-shard-allocation cookbook?
[15:05:43] <inflatador>	 although that says "all shards"?
[15:05:47] <inflatador>	 https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/elasticsearch/force-shard-allocation.py
[15:06:41] <inflatador>	 hmmm
[15:06:50] <ebernhardson>	 indices with 0 replicas, do they have aliases?
[15:06:59] <ebernhardson>	 indices with no replicas and no aliase are failed reindexes
[15:07:03] <inflatador>	 checking...that's the usual explanation
[15:07:30] <gehel>	 inflatador: no, the shard allocation cookbook will try to reallocate shards that are not currently allocated to a server for various reasons.
[15:07:55] <gehel>	 if the number of expected replicas is 0, then it will not try to allocate more shards
[15:08:55] <ebernhardson>	 it's gotta be reindx related, that's the only time we have 0 replicas. We index into a 0 replica index and then let it copy out when complete
[15:09:09] <gehel>	 we need to check if those indices are supposed to exist (maybe there was a failed delete after a reindex). And then either delete those, or change the replica count.
[15:09:25] <inflatador>	 yeah, that's usually what makes a cluster go red
[15:09:54] <inflatador>	 at least when we do these maintenance operatoins
[15:10:12] <gehel>	 zero replica still means that there is a primary, so the cluster should not be red in that case
[15:10:40] <gehel>	 but if we have failed reindex leftover, those could be giving us false positive on alerts
[15:10:49] <gehel>	 and eat some disk space for no reason
[15:12:05] <ebernhardson>	 dr0ptp4kt: i dunno if you ended up finding it, but the request metrics are logged in CirrusSearch\ElasticsearchIntermediary::finishRequest
[15:18:14] <inflatador>	 going to work out, but the first 3 shards I check have no aliases and no replicas, so will be deleting them when I get back
[15:34:25] <gehel>	 I'll start my weekend! Have fun!
[16:00:02] <inflatador>	 back
[16:00:17] <inflatador>	 Maybe we should add an alias/check delete to the rolling operation cookbooks
[16:03:53] <ebernhardson>	 there is a script that can look for these, just takes a couple minutes to run. sec
[16:06:29] <ebernhardson>	 from mwmaint1002 (or any mw maint host): python3 /srv/mediawiki/php/extensions/CirrusSearch/scripts/check_indices.py --run-cache-path /tmp/check_indices.cache
[16:06:51] <ebernhardson>	 The run cache path is about caching info from mediawiki, part of the reason it takes awhile is it has to run a maint script for each of 1k wikis
[16:10:39] <inflatador>	 nice, is that part of the cirrussearch extension repo?
[16:10:49] <ebernhardson>	 yes
[16:11:03] <dr0ptp4kt>	 thanks ebernhardson! sorry i missed the unhangout y'all, i ended up drilling into some wmcs stuff with arturo
[16:16:00] <ebernhardson>	 i do wonder how we have these though, i thought we fixed up the reindexer to cleanup...apparently doesn't fully work :P
[16:28:56] <inflatador>	 it happens pretty much every time we do a cluster restart or reboot...not a huge deal but I think it would save time to detect/delete first
[16:29:23] <inflatador>	 https://phabricator.wikimedia.org/T345449 created a ticket for it, verbiage might not be 100% correct, feel free to edit
[16:29:37] <ebernhardson>	 we could also ponder making that check_indices.py script have a cleanup option. It's super easy to add, the hard part is being confident it won't delete anything you still need :)
[16:30:10] <ebernhardson>	 But we could specifically detect 0 replicas, no aliases and allow deletes there for example
[16:34:41] <inflatador>	 If that's appropriate for anyone using Cirrussearch extension, that works for me. I guess you could just make it an option even if that might not be desired behavior for everyone
[16:50:45] <dr0ptp4kt>	 Thanks also gehel - I just realized I needed to scroll up more.
[16:54:34] <inflatador>	 OK, rebooting codfw now
[16:54:47] <inflatador>	 codfw elastic hosts, that is
[17:32:32] <inflatador>	 lunch, back in ~45
[18:19:36] <ebernhardson>	 trying to figure out why we can't publish the cirrussearch-streaming-updater images,  but it doesn't make sense. I don't know where it comes from, but based on the output there must be a KOKKURI_JWT env var, and it seems like it has outdated credentials
[20:41:33] <ebernhardson>	 the answer was naming restrictions on the images,  we can either publish `repos/search-platform/cirrus-streaming-updater` or `repos/search-platform/cirrus-streaming-updater/foo` but not `repos/search-platform/cirrus-streaming-updater-foo`