[07:18:43] I think replication from os -> es can happen successfully if you don't have any writes on os, such that no new segment with the new lucene versions got written down [07:41:57] https://people.wikimedia.org/~jiawang/web/empty_recommendation_analysis_report.html [08:17:13] Do we have an estimate for the number of searches per day that we serve? I don't think that https://superset.wikimedia.org/superset/dashboard/search/ has that. [08:18:15] I should be able to use https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1&viewPanel=34 as a basis [08:18:46] but that probably does not take caching into account [08:22:40] 40M req/day? that seems high... [08:25:33] cached at which layer? :D [08:26:53] varnish / ATS. [08:28:18] gehel: looking, but Special:Search is the second most viewed page on the wikis (after the Main Page) [08:28:20] so I don't know all the ways I can hit search, but looking at the first 2 more obvious (title=Special:Search and (/w/rest.php/v1/search/title) it looks like it's close to that 40M req/day you mentioned [08:28:37] https://w.wiki/DpB$ + https://w.wiki/DpC2 [08:29:57] close enough! [08:33:12] /w/rest.php/v1/search/title should be autocomplete [08:33:52] yes, aren't those counted as searches :) [08:34:41] they should be different from the graph that Guillaume pointed out initially [08:36:06] but I suspect the numbers we get for "fulltext" searches in graphana come mostly from searches via api (list=search or generator=search) [08:43:31] I wonder how turnilo handle high cardinality on "Uri query" esp. when giving a regex [08:44:07] yes, depends what you want to measure. I used contains not regex, but AFAIK it should work ™ (last famous words) [08:44:39] ok :) [08:53:53] ryankemper: I merged https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/1133918 which should now simplify wdqs/wcqs deploys, no need to manually restart any other services like the updater or the categories blazegraph instance [10:20:03] lunch [13:14:40] o/ [13:24:36] o/ [13:47:29] I'm sure I've complained about this already but 7d for an airflow sensor to timeout is quite long, esp when you know the data it's waiting for is never going to happen... [13:52:59] ryankemper I assigned T389859 to you, let me know if you're not able to finish that one off [13:52:59] T389859: WDQS: Alert on high thread count or no lag metrics reported - https://phabricator.wikimedia.org/T389859 [13:53:30] heading out, back later tonight [13:53:40] .o/ [13:55:47] \o [14:02:58] recommendations report is nifty. They also included a sankey, makes me feel lazy for not figuring out how to use them :P [15:12:45] also i didn't realize, todays the day that the graph split is finalized. Congrats! [15:58:15] ryankemper heading to workout for next 40m. cirrussearch2059 is in flight along with elastic2098 and elastic2102 [16:19:59] Stepping out to take dog out. Checked in on cirrussearch2059, it's waiting for reboot right now so no action required for the timebeing [16:41:45] ryankemper ACK, back now [17:50:34] still waiting for ES/OS to give up on INITIALIZING shards and change them to UNASSIGNED so the reimages can continue. I've been running `_cluster/reroute?retry_failed=true` after every reimage, but it's kind of a double-edged sword b/c AFAIK it resets the clock on the scheduling timeout [17:51:59] meh...comparing diffs of exact executions arguments in airflow fixtures is a bit tedious...tried spending some time to see if i could pretty print the command line arguments with one per line...but the yaml library really doesn't want you to customize that effectively :S [17:55:14] damn, that sucks [17:55:22] I guess you can't convert it to json? [17:55:52] wouldn't really help, the problem is a spark execution has like 20 arguments, and if one in the middle changes then the rest of the lines shift around [17:56:05] actually json would be worse, because you can't have newlines in json [17:57:30] the thing is the source commandline doesn't have newlines either, it's just one 400+ char line. I was hoping to convince the yaml output to reformat it to have `--foo bar` on a line, etc. [17:57:39] of course, thats all convention, there is no rule about how command line arguments work [18:05:53] "one 400+ char line" sounds like a nightmare ;( [18:08:16] that's why it's generated by automation and not written by hand :P [18:15:23] unsure if applicable or related to the problem you have but I remember research presenting something about how they handle script args (https://gitlab.wikimedia.org/repos/research/research-datasets) [18:15:28] ACK. Lunch time, back in ~40m [18:16:17] ryankemper I tried forcing a reroute, it helped but we still have a few it won't reassign, we'll probably have to check the shard status and do manual reimages for a couple hosts [18:18:36] we should add a flag that makes the cookbook output the rename/reimage commands for the whole batch [18:18:39] and wait for confirmation [18:18:43] that way operator can just run 3 in parallel [19:04:27] back [19:04:43] might have found a way..hope it was worth several hours :P [19:04:53] i suspect this will help many more people than just me though [19:05:10] {◕ ◡ ◕} [19:07:19] ryankemper the cookbook currently doesn't try to batch until it's happy, so we might want to just manually create batches [19:08:29] that's what i'm saying, basically we change this loop to just print the commands instead of actually running them https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/elasticsearch/rolling-operation.py#289 [19:08:33] the good news is, it gave up on those shards while I was at lunch. Starting a new batch of elastic2082, 2103, 2112 [19:26:10] inflatador: btw any reason the forcing puppet run after reimage was removed in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1135826 ? [19:29:49] ryankemper yes, it's in the comment but not in the first line ;( . The method we were calling used the old name, and since we have to run puppet immediately after reimage anyway, I just took it out [19:31:13] ah I see [19:31:52] it's not a big deal either way but IIRC we put that in because historically we've seen that our elasticsearch puppet dependency logic isn't quite how it should be so freshly reimaged hosts can take two puppet runs to be in the right state [19:32:04] i.e. the first run does like half the stuff but some stuff fails first time [19:39:40] ryankemper yeah, that's a concern as well. Since the ferm stuff seems to fail already, I've been running puppet + the ferm start command after every reimage. Notes are at https://docs.google.com/document/d/1S4p03N_kJAF-tr4qDWi23ZG3LKiSDFLcMSogF02L9L8/edit?pli=1&tab=t.su4z8ieihrn3 if you wanna look over/change anything [19:43:12] sounds like it's time to implement https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/elasticsearch/rolling-operation.py#320 then [19:43:31] I'm working on a small refactor patch, then I can re-add the puppet run and also add the ferm command [19:45:14] ryankemper ACK, ping me when you need a review [19:49:04] ryankemper if you're adding the ferm command, make sure it hits all the hosts, not just the one we're reimaging [19:49:46] ofc [19:50:12] inflatador: first things first, here's the patch to refactor cookbook invocations https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1136796 [19:50:41] once we verify it works the same way then we can trivially add a dry_run variable that just makes that method print out the command rather than actually run it (and add wait_for_confirmation between batches) [19:51:02] once that's done then we add the puppet run back and the ferm invocation and we should be in a good spot [19:52:47] dcausse: oh I just saw last night's ping about https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/1133918. that is an awesome feature, will save us lots of headache in future deploys :) [19:56:33] ryankemper can you put more details in the commit message for ^^? I guess I don't understand what's changing [20:00:41] done [20:05:05] cool, conditional +1 added [20:05:20] inflatador: I fixed up https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136386, it look good to merge? [20:12:22] ryankemper I think so, double checking https://www.elastic.co/docs/deploy-manage/maintenance/add-and-remove-elasticsearch-nodes and https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Replacing_master-eligibles [20:13:11] since we're just adding not replacing it will be fairly simple [20:13:40] yup, looks good, let me merge now [20:14:26] thanks for fixing the master info BTW [20:14:27] we'll likely need to manually restart the existing masters so they accept the new guy [20:14:31] one-at-a-time ofc [20:15:11] Elastic docs say `If you wish to add some nodes to your cluster, simply configure the new nodes to find the existing cluster and start them up. Elasticsearch adds the new nodes to the voting configuration if it is appropriate to do so.` [20:15:25] " if it is appropriate to do so"? [20:15:57] appropriate to do so means that elasticsearch makes the decision by itself, without operator needing to tell it [20:16:03] unfortunately these docs are written for latest elasticsearch [20:16:15] i'd bet they don't apply to 7.10.2 [20:17:01] the main uncertainty I have is whether the worker nodes need to be restarted for them to accept the new master as well. but my guess is that worst case scenario they won't route requests to the new master but it won't necessarily cause issue [20:17:33] yeah, it looks like they deliberately removed docs for the older versions and are pushing you to https://www.elastic.co/guide/index.html [20:17:36] inflatador: oh, if cirrussearch2115 has any sole primaries on it we won't be able to restart it regardless [20:20:53] let's see what the opensearch docs say, if anything. based on the ES docs it sounds like the master will join OK, but we probably want to wait until the close of the migration or at least green status [20:22:13] https://docs.opensearch.org/docs/1.3/install-and-configure/upgrade-opensearch/rolling-upgrade/ [20:23:52] for now we can just keep forging ahead with the upgrade and not do any manual restarts. we're almost at the point of a third row being done in which case we'll be back to being able to do rolling restarts without worrying about yellow status [20:24:25] agreed. And it's too bad about the Elastic docs, I guess we're going to have to start hitting archive.org [20:24:46] or maybe just using OpenSearch docs [20:25:46] FWiW I don't remember having any problems running mixed masters in Relforge or Cloudelastic [20:27:14] The new Elastic docs have a section called "Anatomy of an analyzer".. I've seen it done better. 🤣 [20:27:23] lol [20:27:27] Trey314159: :P [20:28:07] inflatador: yeah afaik mixed masters shouldn't be a problem. the only theoretical concern i have (not pertaining to mixed masters, but pertaining to this scenario of adding cirrus2115 as another unicast host) is if a worker node gets restarted and sees the updated `unicast_hosts`, but the old masters and/or new master haven't been restarted yet, it might choose `2115` to talk to as its master [20:28:50] but I forget if data nodes even talk to a random master from the unicast host list or if they only talk to the one that's actually been elected (i.e. has `dimr *` in its `_cat/nodes` output) [20:29:39] tldr we can just forge ahead and stuff should be fine. we're actually in the middle of the third row right now so I think the cluster should be able to get fully back to green between batches FWIW [20:32:32] yup, agreed [20:38:48] I'm not sure how much more we want to overload this poor cookbook ;), but maybe we should add a forced reroute command at the end too [20:42:58] cirrussearch2082 having trouble joining psi, could be that master patch. Checking... [20:46:45] confirmed. If I manually remove `cirrussearch2115.codfw.wmnet` from its psi config, it can join [20:50:12] so I think the safest thing to do is roll back, but we could also try restarting omega in place on 2115 and seeing if it is recognized as a master [20:50:46] sorry, I meant psi. Psi is green now, so I'm game for giving it a shot [20:55:55] inflatador: let's discuss in pairing meeting. joining in 1 min [21:00:00] brt