[08:49:20] dcausse / inflatador: do we have a ticket tracking wdqs1013 needing a data transfer? I can't seem to find it... [08:49:50] I see T360993 about the parallel issue with max lag propagation [08:49:50] T360993: WDQS lag propagation to wikidata not working as intended - https://phabricator.wikimedia.org/T360993 [08:50:07] gehel: not really, ^ is more about the root cause [08:50:52] I have a pairing session with Brian this afternoon we'll make one [08:51:12] thanks! [11:11:22] lunch [13:15:49] o/ [13:21:31] o/ [13:22:39] dcausse any objections to me starting the data transfer now? [13:23:17] inflatador: yes, we can't til T360993 is fixed sadly [13:23:18] T360993: WDQS lag propagation to wikidata not working as intended - https://phabricator.wikimedia.org/T360993 [13:23:54] once the transfer is done the catchup phase will trigger the maxlag protection and bots will be stopped [13:24:35] dcausse ah, so we need it to stay broken until we apply your puppet CRs? [13:25:05] or we temporarily disable the lag propagation (a systemd timer on mwmaint) during the transfer [13:25:32] It's no problem, we can leave it broken [15:56:48] workout, back in ~40 [17:41:53] sorry, been back awhile, but going to lunch now [17:49:00] dinner [18:17:50] back [21:12:28] ebernhardson this CR's showing a post-merge build failure...do we need to care about that? https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/1014566 . We were just about to deploy WDQS [21:14:03] inflatador: looking [21:15:17] inflatador: you can ignore the site publish, that artifact shouldn't prevent your deploy from working. It does suggest something is wrong with the pom though, it's failing in check-forbidden-apis, and that should have been enforced pre-merge [21:15:39] ebernhardson ACK, thanks for checking [22:30:46] codfw omega cluster is not happy, master not discovered [22:30:53] :S [22:35:42] looks like it started about 21:59 [22:35:46] probably decommed a master prematurely. static config is correct, so the plan is to restart hosts one-at-a-time https://etherpad.wikimedia.org/p/omega-restarts ebernhardson if you have suggestions let us know [22:37:13] no clue right now, can't say i've seen it fail that way before [22:43:21] I think we didn't roll-restart the cluster after we decommed some masters, so the running config is off...verifying the restart date now [22:45:01] i suspect also that puppet hadn't run across the fleet? [22:45:31] From logstash i suspect 2052 was the master previously, but now isn't in the unicast list [22:45:48] i'm still not sure why it would then fail master election though :S [22:46:13] the unicast hosts is a bootstrap, not a limit. I would have thought it would find the cluster anyways as long as it bootstrapped from nodes in the cluster [22:46:18] elastic2052 is still up too [22:47:41] I think Puppet must not have been run before the roll restart then [22:54:32] hmm, my cumin history looks like I actually did run it right before the roll restart [22:55:41] hmm, `node.master: true` is set in 2052:/etc/elasticsearch/production-search-omega-codfw/elasticsearch.yml, but i dont see it set in hiera. curious. That shouldn't cause the whole problem anyways, just curious [22:58:01] maybe wasn't puppet-merged when we ran puppet? Still confused [23:00:57] We've restarted all the masters. Working on restarting all the data nodes, have most done but not all [23:01:03] It still hasn't fixed things, very perplexing [23:01:19] i suspect what happened is we set firewalls with cluster_hosts. When we removed the current master from cluster_hosts then things went wonky? Maybe [23:02:37] if things aren't getting going, i would suggest putting all the hosts back into the cluster, but leave the new master capable list [23:04:11] Unfortunately we've already ran the decom cookbook so adding hosts back isn't an option [23:04:26] Or do you mean just adding them to cirrus.yaml so that the firewall rules are up but the actual hosts themselves don't need to be there? [23:04:34] I guess that wouldn't make sense tho cause they still wouldn't be able to talk [23:04:51] we stopped at 2049 [23:05:36] let me try the firewall rules angle, 1sec [23:05:55] hmm, not sure if it would matter. I suppose i was guessing that the firewall is splitting the cluster into multiple pieces, with the restarted instances being unable to talk to everything in the old cluster [23:06:37] so putting the instances back in cluster_hosts would let everyone that's expected to talk to each other talk. I was partially guessing becausee poking through the logs of 2052 there are plenty of complaints about failed network connections [23:07:03] although i'm still not 100% sure 2052 was the master, although i found logs from other instances that suggested they thought it was [23:14:15] From poking the logs of all the new master capable instances, they are all waiting on basically the same thing: master not discovered or elected yet, an election requires at least 4 nodes with ids from [...] [23:14:25] Curiously the # of nodes it's waiting for differs by host [23:16:53] yeah, I was also thinking about dropping the number to 3 [23:18:46] f**k me, I hit abort on the cookbook and that actually caused it to go ahead and wipe 2052 [23:19:55] The logs are very confusing, I feel like it lists enough masters for it to be able to do an election, but I might be misreading [23:19:57] for example: [23:20:04] https://www.irccloud.com/pastebin/j0wuVfQT/2086_log.txt [23:20:17] yea i was trying to parse out that same message and resolve into names [23:21:11] It sees A8A, -HOu, and XyI, which should be the 3 it's asking for [23:24:11] wrt `discovery will continue using [10.192.16.228:9500, [2620:0:860:102:10:192:16:228]:9500, 10.192.32.219:9500, [2620:0:860:103:10:192:32:219]:9500, 10.192.48.88:9500, [2620:0:860:104:10:192:48:88]:9500, 10.192.0.28:9500, [2620:0:860:101:10:192:0:28]:9500]`, 10.192.16.228 is elastic2092, 10.192.32.219 is elastic2100, 10.192.48.88 is elastic2106, 10.192.0.28 is elastic2073 [23:24:28] So that's all the other masters by itself...maybe I'm reading it wrong though and that's just telling me what we have configured as master eligible but doesn't mean anything beyond that [23:24:37] other masters besides itself* [23:26:27] the number of masters elastic requires being able to see isn't actually config, it's based on the number of masters in the cluster state [23:26:59] thats why the process for removing master nodes is to remove them from the config and reboot them, so they come up as non-master and get removed from the state [23:29:49] That makes sense [23:30:06] ebernhardson: are you aware of any way we could clear out that cluster state without needing to fully tear down and bring back up the cluster? [23:30:58] ryankemper: not yet, i was looking for a way to override it [23:31:20] Still weird because even with 2052 and 2047 stuck in the cluster state I'd think we still have enough of the new hosts for it to establish quorom...hmm [23:32:51] The comment at the top of the election scheduling class is amusing: It's provably impossible to guarantee that any leader election algorithm ever elects a leader, but they generally work [23:34:11] lmao [23:34:12] ebernhardson: we're in https://meet.google.com/fde-tbpf-wqh if you wanna hop in [23:52:41] i.nflatador, e.bernhardson and I are merge-deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1015152 to shift small-index (psi & omega) traffic away from codfw towards eqiad. Plan is to reconvene after retro time tomorrow and bootstrap the cluster then restore the indices from the db