[09:34:57] errand+lunch [13:30:18] o/ [13:31:50] o/ [13:50:54] \o [13:54:09] o/ [14:05:43] o/ [14:10:28] BTW: I’ve bothered multiple image generators with increasingly complex prompts to come up with a t-shirt design for our team. Here’s one of the more promising outputs (refinement of texts/logos TBD). Please let me know what you think: https://docs.google.com/drawings/d/1Hsh58VyhyKNwAvBpSOuAJpCsBg2iqwXCMJIGV4zfz5w/edit?usp=sharing (you should be able to comment on the drawing) [14:12:12] nice! :) [14:24:48] yeah, that looks cool [14:25:26] Also, it looks like our rolling-operation cookbook is ignoring its `allow-yellow` flag...checking [14:26:49] the SUP seems a bit fragile with elastic: java.lang.IllegalStateException: Unsupported Content-Type: text/plain [14:27:30] ES responds with text/plain? [14:27:32] might be envoy when there's some connectivity issues I suppose? [14:28:00] we target https://search.svc.codfw.wmnet:9243 tho... [14:28:47] it matches when one node was restarted [14:29:29] status line [HTTP/1.1 504 Gateway Timeout] [14:29:52] Ah, okay. I’ll have a look [14:29:58] we still use nginx I think [14:31:59] actually it looks like the allow-yellow is too strict for the migration ( https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/spicerack/elasticsearch_cluster.py#488 ) . We'll have to accept no moving replica shards [14:32:34] having some issues w codfw cluster, need to roll back a patch [14:33:43] if shards are stuck because of version mismatch very likely that shards recovery will be constantly starting & failing [14:35:52] I merged a patch that removed the old hostname, that's causing the firewall to block elastic2056 [14:36:01] I should have banned the node first [14:36:24] banned & depool? [14:36:25] I reverted the patch and I'm running puppet now, should fix it once it's done [14:37:00] yeah, the cookbook should ban it, but the cookbook got stuck for other reasons [14:37:29] anyway, we are back to yellow [14:38:20] and yeah, depooling as well. I probably should stop Puppet across the fleet too [14:39:02] dcausse: Is that IllegalArgumentException the root cause? If so, is this wrapped as ElasticsearchStatusException? [14:39:35] pfischer: it's a suppressed exception, the root cause is the 503 from nginx I think [14:48:35] dcausse: Hm, because we already handle 503 responses gracefully, which works (see org.wikimedia.discovery.cirrus.updater.consumer.graph.AsyncElasticsearchWriterTest#failsWithoutRetry) so it would be an easy fix to accept 504s too [14:51:47] * inflatador wonders if you can mark a ES/OS node as replica-only. Probably impossible due to the way it operates, but would make things easier ;( [14:52:29] inflatador: yes... I looked for this but apparently no... [14:52:33] pfischer: sure [14:54:10] they claim that are no good reasons for this feature, but in case of version conflict that could have helped I think [14:54:50] dcausse: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/184 [14:56:18] I guess we could pre-ban the cirrussearch* hosts in a particular row to keep them from getting primary shards, then unban once we're done with that row [15:00:37] yea sounds reasonable [15:08:32] pfischer: potentially relatedly, i'm setting up the envoy proxies to auto-retry 503 errors. Mostly for the mediawiki side, but i think that will effect SUP as well [15:08:43] afaict that applies to all http methods, not just GET [15:09:49] i suppose it could also retry 502 and 504, i was just trying to avoid retrying 500 since those come back for things like bad rege's [15:09:50] regex's [15:12:37] there might also be headers we can use to turn that off from the SUP side if we don't want envoy doing retrys and want to control it ourselves [15:17:33] anyone know the machine-friendly API version of /_cat/shards? I'm guessing it's somewhere under /_cluster? [15:17:39] I think we do indeed but unsure if that applies to the client targetting elastic [15:18:11] inflatador: you can just ask for json from _cat [15:18:16] sec [15:18:32] i guess it's still a bit annoying, its a very direct conversion [15:19:03] send an 'Accept: application/json' header to get json [15:21:14] cool, will check docs to see if that's supported by the python library [15:21:25] you could get it from _cluster/state, but that would be much more tedious [15:21:27] and indeed it is [15:21:45] I may be paying too much attention to the "don't use this for machines, it's only for humans" warnings [15:22:35] inflatador: i think those warnings existed since before there was a json formatted output. [15:24:23] the other option is to parse out /_cluster/state/routing_table,routing_nodes. It's not terrible, but its a more annoying format [15:24:35] and somehow i suspect the _cat output to stay more consistent than the routing table [15:26:56] ` client.cat.shards(format='json')` does the trick [15:31:02] hmm, i suppose another option around retrys, i'm setting up a new envoy "cluster" that will use the dns-discovery endpoint instead of the direct-to-cluster. I could perhaps only configure the dns-discovery endpoint to auto-retry, as I'm mostly looking to have read requests auto-retry and those will all flow through dnsdisc [15:50:37] pondering...perhaps thats a better option? Keep the auto-retries to the read-path of queries, and let writes manage things themselves? [16:07:08] trying to figure out how to get `unassigned.reason` from the json output of client.cat.shards(format='json')...doesn't look like it's there [16:08:12] yea thats not part of the _cat/shards output afaik, i imagine you have to hit the explain shard allocations endpoint? [16:08:26] _cluster/allocation/explain [16:08:45] looks like it's there under `_cat/shards?h=index,shard,prirep,state,unassigned.reason` if you curl [16:09:01] ah, that's the problem, I guess I need to specify the headers I want [16:09:24] oh interesting [16:10:07] yup, that did it [16:24:52] ebernhardson: agreed, retries on writes should be decided by the client [16:25:43] dcausse: ok, sounds like a plan. [16:26:32] the sup is explicitly disabling envoy retries with envoy headers but this is not the sole writer we have [16:26:49] right, cirrus is still doing archive writes, and mjolnir is doing some things [16:27:02] * ebernhardson sometimes wonders if the daemons should have had a not-mjolnir name...oh well [16:27:08] :) [16:27:16] maybe i should just call it search-loader after the host names [16:28:03] i'll blame it on python package management of the era, at the time it was tedious to setup python packages and re-use code from multiple places so it all got put in the mjolnir repo [16:30:14] OK, https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1133967 is up for the spicerack changes [16:30:32] ebernhardson and ryankemper can you take a look if you have time? I'm sure it's broken, just wanted to get the general idea out there [16:30:37] sure [16:31:53] dinner [17:39:16] inflatador: hmm, the control flow there is a little odd, the Exception that is raised feels like it's in the wrong place. [17:40:13] ebernhardson yeah, it's terrible. maybe we could pair on it if/when you have time? I need someone with real programmer chops [17:40:31] no rush, tomorrow/next week is fine if that's better for you [17:41:19] inflatador: lol, yea i can almost always make time. Today or tomorrow is fine, the only things on my schedule are my normal school runs at 2:45 [17:41:41] otherwise i'm working on some bits related to envoy, but that can happen any time [17:50:43] inflatador: I might not be following, but why does the default wait for yellow behavior not work for our purposes? [17:51:13] ryankemper: the issue is that we may have allocation errors that, due to the lack of hosts in multiple rows, prevents replica shards being allocated [17:51:29] i wonder if this means we should migrate some number of hosts in each row, with them banned, then un-ban at the same time [17:52:33] Ah yeah I was thinking the existing behavior of `yellow with no initializing or relocating shards` but I guess the cluster will show be showing initialization failures [17:52:45] existing behavior would be sufficient* [17:53:27] wait now i'm going back to my original thought [17:53:34] don't those shards show as unassigned, not allocation failed? [17:53:46] hmm, i'm not sure which way they will report :S [17:54:01] sounds like time to make a small docker cluster and find out :P Not sure how tedious that is [17:54:51] i had a docker-compose thing awhile ago that could stand up a multi-node cluster, of course always with the same image, wonder if i have that somewhere still... [17:55:58] i think i used this in the past: https://phabricator.wikimedia.org/P74594 [17:56:10] 6.5.4, so it was awhile ago :P [17:57:12] The shards show as `state: UNASSIGNED` and `unassigned.reason: ALLOCATION_FAILED` [17:58:38] I see [17:58:42] actually here is a more recent one that sets up a mixed cluster with elastic and opensearch. Although sounds like maybe unnecessary: https://phabricator.wikimedia.org/P74595 [17:59:09] I do think we should pre-ban [18:00:01] Spicerack only cares if shards are initializing or relocating [18:00:12] if they're simply unassigned it should accept it fine [18:00:14] anyway, I'm up in https://meet.google.com/ozu-gdro-zxg?authuser=0 if y'all wanna talk it over [18:00:22] haven't tested it yet tho [18:00:23] brt [18:01:54] going to refill my water then join, just a sec [18:01:54] The cookbook stalled earlier when I tried running as-is, looked like it was waiting for yellow [18:03:18] `relocating_shards":0,"initializing_shards":0,"unassigned_shards":4,"delayed_unassigned_shards":0"` is what cirrussearch2055 looks like currently [18:11:37] ok to summarize, we found the issue is with the cookbook logic that requires `self.allow_yellow and groups_restarted == 1:`. basically if already started from yellow status the cookbook will refuse. so simple patch on the cookbook side to address [18:32:42] * ebernhardson realizes laptop felt slow because for somereason `/sys/firmware/acpi/platform_profile` was in low-power mode, even though I'm always plugged in. +1Ghz to all cores by changing [18:47:54] so much better...it was just sitting at 800mhz on all cores, now it properly varies up to the 3Ghz range [19:07:38] damn, it's like getting a whole new laptop! [19:08:24] looks like the ban cookbook might need some work too, I don't think it can ban hosts it doesn't know about and I don't think that's an Elastic limitation [19:15:08] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/elasticsearch/ban.py#91 hmm.. [19:15:39] #TODO: implement [19:16:33] not sure if thats what you are referring to, but usually the right thing there is `raise NotImplementedError('...') [19:20:09] no, but thanks for the tip...I actually should rip that out completely as we don't need to implement our own logic, we should be able to do that with existing libraries [19:39:36] reimage worked, but we're getting the same `Nagios_host resource with title cirrussearch2056 not found yet` which I think means our role isn't on Puppet 7. Not sure why that's the default, but I'll take a look [19:51:37] the hieradata looks right, not sure why it isn't being applied https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/cirrus/opensearch.yaml#89 [19:54:46] hmm [19:58:54] its set to false in common/profile/puppet/agent.yaml. but doesn't seem likely thats overriding it [20:05:34] we fixed this yesterday by setting the host hieradata directly for 2055, but that shouldn't be necessary for every host (or at least, I hope not ;P ) [20:09:23] inflatador: hmm, in current merged puppet i don't see 2056 set to cirrus::opensearch? Although maybe i'm missing something, i don't find it in site.pp at all [20:09:55] searching site.pp for cirrussearch2.*56 only finds commented out lines [20:10:29] should be covered by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/manifests/site.pp#1113 [20:10:55] * ebernhardson failed to pull the latest puppet :P [20:11:11] indeed i see that now, should be fine... [20:17:37] * ebernhardson is finding nothing :( [20:19:44] no worries, I've got j-hathaway helping me in #sre-foundations [20:41:59] OK, looks like I forgot to add some opensearch hiera for the new hosts in regex.yaml. Fixing now [21:08:27] I'm running the reimage again with no-pxe after adding some missing stuff to regex.yaml. I'm pretty sure it will fail for the puppet 7 stuff again though [21:32:19] * ebernhardson notes that HivePartition in mjolnir and discolytics have diverged making it more tedious to move...mjolnir has options for verifying schema equivalence and for re-loading direct from parquet files to get the real schema and not the hive schema (which downgrades VectorUDT, even stil in 3.1.2) [21:33:35] also it didn't use the from_spec(str), we hadn't defined that yet :P [21:34:10] debating if it's really worthwhile to migrate this...it's perhaps fine as is