[10:33:21] lunch [12:31:17] Hi! The cloudelastic nodes have been triggering an alert about shards not being assigned, and it seems to be tagged for the wmcs team (my guess is the name of the host matches whatever regex is there), can someone take a look? [12:31:27] https://alerts.wikimedia.org/?q=team%3Dwmcs&q=%40state%3Dactive&q=alertname%3DElasticSearch%20unassigned%20shard%20check%20-%209200 [12:57:02] dcaro ACL, looking now [12:57:08] or should I say "ACK" [12:57:43] xd, thanks! [13:19:39] also, greetings all [13:20:04] looks like we're hitting an upper limit on how many shards from the same index are allowed on a particular instance [13:34:26] o/ [13:35:47] inflatador: if this is for the commons index I remember a message from Erik saying that index.routing.allocation.total_shards_per_node might have been kept during snapshop recovery and thus might have to be adjusted manually [13:37:36] dcausse that sounds familiar. Currently it's at 8, do you know what it should be? Haven't checked total number of shards for that index yet [13:37:47] ah, 32 [13:38:49] 6 nodes total, 32 shards, 1 replica/shard ... guess it we need it to be 16 at the least? [13:39:27] lemme check the config [13:40:38] I think we don't constrain this on cloudelastic, finding a link to the source code [13:41:20] the explain API says "too many shards [8] allocated to this node for index [commonswiki_file_1647921177], index setting [index.routing.allocation.total_shards_per_node=8" [13:41:42] yes it's not set for this index: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/InitialiseSettings.php#24926 [13:42:00] so updating this setting with null might drop the constrain [13:42:02] t [13:42:39] Do we change that at cluster or index level? [13:42:50] this is at the index level [13:43:24] or -1 instead of null sorry: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/allocation-total-shards.html [13:43:35] Ah OK [13:43:55] well I don't know actually the -1 is for the cluster level setting [13:44:17] I don't see the string "total_shards_per_node" in puppet, do we just set it in the API? [13:44:49] this is set from the mediawiki config because this is a mediawiki maintenance script that creates the indices [13:45:07] (the gerrit link I pasted) [13:45:33] InitialiseSettings.php is a giant config file for the wikis [13:46:03] ah oops, somehow missed that link ;( [13:46:07] np! [13:47:05] weird that it doesn't have a setting, I guess the default is 8? [13:47:30] no the 8 comes from the snapshot that was restored [13:47:52] when taking a snapshot it must be storing the index settings in there and will restore them [13:48:36] how do we get to 8 then? is it 'file' * 'general' from that settings file or something? [13:49:36] oh indeed it should 4 from the mediawiki-config, hm... [13:49:51] * inflatador wonders why this wasn't a problem before the index was lost [13:49:54] not sure perhaps it was manually updated? [13:50:22] I mean cloudelastic comoonswiki_file was bumped from 4 to 8 after the recovery [13:50:48] or the production settings are not inline with the mediawiki-config, looking [13:51:02] np, need to step out for a quick errand, back in ~20m [13:53:40] production has 4 on both eqiad & codfw so it must have been manually adjusted after the recovery [13:54:27] we should drop this constraint on commonswiki_file@cloudelastic to be inline with the mw-config [14:14:32] \o [14:14:54] i was totally not thinking about replicas when setting that to 8 :( yea i suspect in the past it was null'd and we let the cluster do whatever works [14:15:45] i gotta switch locations this morning, probably miss unmeeting and back shortly after [14:19:24] back [14:36:53] o/ [14:37:58] inflatador: do you take care of this total_shards_per_node setting on cloudelastic? [14:38:38] dcausse was gonna run this API call on the index, but you said it's a cluster-wide setting? https://paste.opendev.org/show/815457/ [14:39:07] sorry I was not very clear, it should be the per-index setting [14:39:54] inflatador: I think you can set "index.routing.allocation.total_shards_per_node": null instead of nesting all these json properties [14:40:26] elastic should support both format but the dot based format is more convenient I think [14:41:37] dcausse ACK, will give it a try [14:41:43] thanks! [14:45:16] dcausse OK, it's set, I can see shards in INITIALIZING so I think it worked [14:45:24] nice! [14:57:43] OK, back to green for cloudelastic-chi [15:14:24] \o/ [15:40:55] sorry, was distracted and forgot to join unmtg . Looks like no one's here anyway (which is fine). Dropping out for now [15:59:39] workout, back in ~40 [16:30:26] back [16:32:38] back [16:33:45] ebernhardson and/or ryankemper LMK if you wanna start the rolling operation again, we can hop on a meet if that works for y'all [16:36:02] inflatador: it's friday, let's hold off till mon [16:36:42] ryankemper ACK, will hold off then [16:37:30] inflatador: I am down to do an audit of the 10G vs 1G nics/ports and see what needs to be done to get true cluster-wide 10Gbps though [16:37:58] ryankemper cool, give me 5-10m and I'll set up a meet [16:42:02] i'm not in a great location for meet, but have fun :) [16:55:18] no worries, ryankemper I'm up at https://meet.google.com/gtj-thiu-nqg whenver you're ready [17:05:53] meh., kinda tedious...my mobile internet gives me ipv6 but can't seems to connect to anything. I disabled ipv6 in /proc/sys/net/ipv6/conf/all/disable_ipv6 and now irc/ssh works fine. But neither firefox or chrome want to open any connections. Had to also override the DNS servers to use 1.1.1.1. I dunno how normal people get by :P [17:13:42] * ebernhardson is also surprised it took 30+ minutes to realize the web browser didn't work, almost feels like a record :P [18:38:34] forgot to hit "enter" on my "going to lunch" msg, but I'm back! [19:38:27] * ebernhardson wonders if cindy could run a re-index after most of the test is complete and somehow verify that it works appropriately