[09:35:50] errand+lunch [13:13:22] well, how 'bout that...the chi cluster went back to green over the weekend [13:27:56] o/ [13:27:58] nice! [13:31:12] time to fix that ;P [13:33:27] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136026 patch to remove non-existent masters I accidentally reimaged...follow-up patch to add row A masters in progress [13:59:08] \o [14:00:31] o/ [14:09:16] .o/ [15:58:58] inflatador: a puppet patch for you, this creates some new envoy proxies that will be used for read-only cirrus traffic (most of it): https://gerrit.wikimedia.org/r/c/operations/puppet/+/838182 [15:59:31] the main thing it brings is it uses discovery endpoints instead of specific datacenters, so in the future traffic can be shifted with normal SRE tooling instead of the one-off cirrus config cars [15:59:34] s/cars/vars/ [16:00:14] one thing i'm unsure of though, last time i made an envoy hieradata patch it broke mediawiki deploys because my envoy config was invalid. Not sure how to verify [16:01:52] inflatador: one other thing, i notice in codfw most hosts have ltr 1.5.4-wmf1, but 2055 and 2056 still have 1.5.4. If we get those restarted with the updated plugins we should be able to revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134268 and put that back to normal [16:02:18] assuming we have enough hosts now that restarting some instances isn't a problem anymore [16:18:53] ebernhardson ACK, those were the first two hosts we reimaged, I guess they didn't make the cut ;( [16:20:42] inflatador: yea, they were the hosts that let us know there was a problem :) [16:21:33] clusters are all green, so will restart shortly [16:26:30] ebernhardson OK, both hosts restarted, LMK if I missed anything [16:27:09] inflatador: hmm, `curl https://search.svc.codfw.wmnet:9243/_cat/plugins | grep ltr` still shows 1.5.4-os1.3.20, needs an apt command to update the .deb? [16:32:33] inflatador: if you get to restart opensearch hosts perhaps might be a good time to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135441 as well :) [16:32:58] this would only be two hosts and not all of them though. But perhaps reasonable to start it moving along [16:33:57] ah ok, please ignore if that's a burden [16:34:33] i suspect for that one we will want to be able to do a full cluster restart, not sure how the tooling is working but assuming that will be easier when it's all homogeneous [16:35:00] curious it needed LD_LIBRARY_PATH, i guess upstream does similar? [16:36:25] yes upstream does this but only on their docker entrepoints :/ [16:37:02] i guess i would have expected them to `dlopen` more directly, but maybe thats hard from java [16:38:23] well issue that these .so reference other .so is that same folder, so setting java.library.path alone is not working [16:38:52] ahh, yea i guess that makes sense [16:44:56] aww, summary from the ios app test: Making the search more prominent in article view means more people will use it in article view, but it does not necessarily result in a net increase in browsing Wikipedia articles (no increase in daily pageviews/user). [16:45:05] understandable i guess, but not what i was hoping for [16:50:09] am i right in thinking that this curl command doesn't seem related to the titlesuggest indices? https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old [16:51:18] ebernhardson: ouch, yes totally unrelated [16:51:33] i think i found the right one in old wikitech:Search page, will update :) [16:52:33] ebernhardson: fyi we disable completion index rebuild in codfw to ease the migration [16:52:43] oh, ok so thats why the alert is firing [17:01:20] OK, we should be good on the LTR plugin, LMK if not [17:02:46] inflatador: looks right now, thanks! That means we should be able to merge this revert in puppet: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136414 [17:04:14] ebernhardson ACK, on it now [17:06:50] inflatador: thanks! [17:09:22] dcausse have you tested https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135441 in relforge? Just wanna make sure it won't break anything when we deploy to prod [17:09:50] inflatador: yes it's tested and working as expected on relforge [17:19:22] dcausse ACK, just merged the above [17:19:38] ebernhardson https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136414 is merged/deployed as well [17:19:49] well..merged anyway [17:20:01] I can run puppet on search-loaders if need be [17:21:01] inflatador: shouldn't matter, mjolnir only runs once a week for ~24 hours [17:21:02] inflatador: thanks! [17:21:15] dinner [17:21:35] next run should start on the 17th [17:21:52] https://gerrit.wikimedia.org/r/c/operations/puppet/+/838182 looks like a winner, but I need some time to check it over and probably get another SRE who knows envoy more than me [17:22:10] thanks! I'm more cautions since i broke envoy last time :P [17:22:30] that was on me, too ;( [17:22:31] maybe there is some way to tell by running puppet on an mwmaint server after merging or something [17:34:34] ryankemper heading to lunch, back in 1h. cirrussearch2067 is the only host reimaging, it's pretty far along and no problems so far [17:41:57] * ebernhardson kinda wishes grafana would invalidate all the graphs on screen when i change a variable, hard to know when they've been updated [17:43:24] unrelatedly, wondering what to do with the CirrusSeachJVMGCYoungPoolInsufficient alerts...i expect what's happening there is the instances are idle and it doesn't need much of a young pool [17:43:37] maybe kill the alert? Trying to remember the last time it was useful [17:44:03] the updates in elastic 7 to better handle memory seem to have been effective [17:44:46] i think i'll do that...the alerts on high frequency of old gc running should be sufficient [18:42:15] back [18:43:41] ebernhardson thanks for looking into those GC alerts, happy to review a patch whenver [18:59:52] inflatador: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1136426 removes the alert [19:00:29] i've been reviewing some tickets on the discovery-search backlog to look for something to put in the APP spreadsheet...but mostly i'm finding old tickets and declining them because it hasn't been a problem in 5-ish years :P [19:07:26] Not the sexiest work, but it's good to wrap that stuff up [19:39:43] ryankemper we're on the last non-master in row D. Once that's done we should be able to start doing multiple hosts again [20:35:38] finding curious statements in phab too...in https://phabricator.wikimedia.org/T379938 i marked moving a shard from opensearch to elastic worked fine :S I wonder what i did differently there [20:43:55] we've seen some unexpected behavior over the last week or so. We were pre-banning the cirrussearch hosts so they didn't get any shards, but the cluster seemed to prefer sending primaries to OS regardless [20:44:17] doesn't really hurt anything, but we stopped pre-banning since it didn't seem to work anyway [20:54:52] we've also seen shards that have primaries on OS and replicas on ES, like https://paste.opendev.org/show/827449/ [20:57:18] maybe thats what i saw, suggests i should have put more details of how the testing actually was done instead of just marking this as ok. IIRC we decided the primary on OS and replica on ES is when the replica was already assigned when OS became the replica, the problem is we can't create a replica from an OS primary [21:03:42] OK, that makes sense. So there was an OS replica, one of the ES shards went away and the OS shard was promoted to primary. There's still an ES replica...is that one safe? What would happen if we lost the OS primary in that situation, I wonder? [21:11:19] yes, i think that's what happens. As for if the OS went away, i would hope elastic can promote any replica to primary. [21:15:10] yeah, the lucene incompatibility is what throws me. If it's not OK to replicate from OS->ES because of lucene versions, then how does a primary on OS replicate to ES? Maybe it deliberately holds back the lucene version on the primary? [21:15:56] replication is done, to borrow an sql reference, uses statement-based replication and not row-based replication [21:16:17] essentially the source document is sent to all primaries and replicas, and they all repeat the indexing procedure [21:17:45] i did see that newer version of elasticsearch (or maybe it was opensearch? i forget...) have the ability to do segment based replication which would be like sql row-based replication, which would mean only primaries do the indexing work and then they ship flushed segments to all replicas. That would fail in the mixed env [21:18:44] our indexing load probably isn't heavy enough for that to be much of a win for us anyway, i imagine it's more important on logging clusters that index tens or hundreds of thousands of docs in a 30s segment [21:19:11] ah OK, that makes sense