[10:43:48] errand+lunch [14:17:26] o/ [14:48:17] \o [14:50:55] o/ [14:54:48] realized o11y never setup the security plugin, so now working out how ssl certs work...always fun :) [14:55:21] i suppose in theory we could continue without the security plugin and bring the tlsproxy back, but seems improper [14:58:09] o11y uses built-in tls support without the sec plugin? [14:58:50] they have the plugin fully disabled [14:58:57] i dont think there is tls [14:59:04] ah [14:59:31] i guess there is the tls to the outside world, but that isn't being terminated at opensearch [14:59:43] ok [15:00:27] and you can't get the sec-plugin enabled but only get tls from the http ports? [15:01:03] with the plugin enabled you need at least certs for the transport, that tls can only be turned off by disabling the plugin. Then there is also the api port which can be plain http if you want [15:01:51] also it makes our ports a bit awkward, 9200 will be empty since keeping tls on 9243 seems most-sane [15:02:45] ah having tls on the transport port we can't do rolling upgrade I suppose [15:03:23] i've been wondering about that too, the docs are weird. They claim elastic->opensearch can be done with security enabled, but if you have an opensearch cluster with security disabled you have to full-restart the cluster to turn it on [15:03:29] i don't know what to believe...will have to construct a test [15:03:58] perhaps they mean elastic with their sec pack enable? [15:04:00] d [15:04:07] the transition docs only suggest turning off the security plugin for the migration, they don't require it [15:04:09] hmm, maybe [15:04:29] hm... seems obscure [15:05:10] seems like the kind of thing i need to just test and see what happens, but indeed adding tls to inter-node transport seems like it would break things unless specially handled [15:05:48] yes... [15:07:26] this is also going to be a very long upgrade....if reimage takes 45 min thats enough for data on disk to become stale, meaning probably 2-3 hours per set of servers migrated [15:09:54] :/ [15:11:12] indeed [15:12:46] not sure there's a better alternative? I mean I'm fine going slow and be safe and sure that opensearch nodes are "clean", but I'm not the one going to re-images so... [15:13:59] I suppose the main concern is that we might be in a mixed cluster for at least a week [15:14:25] perhaps a cold restart is faster then? [15:14:40] oh nvn [15:14:45] we can't keep the data... [15:14:45] ya, some open questions about things like index creation, iirc there are limits about shards only moving from older to newer versions but not the other way, but maybe thats only on major version changes [15:15:13] oh right [15:15:31] yea we would have to snapshot the whole cluster into s3 and try and bring it back before kafka runs out? But thats not 100% of writes and would be ... fun :) [15:15:43] yes... [15:27:45] dcausse: I'm not exactly sure what you mean in https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/1090885/4..5/eventutilities/src/main/java/org/wikimedia/eventutilities/core/event/EventUtilitiesConfig.java#b130. Could you elaborate please? [15:31:49] gehel: done [15:36:06] dcausse: thanks! While you're in there, another look at https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/1090885/4..5/eventutilities/src/main/java/org/wikimedia/eventutilities/core/event/EventUtilitiesConfig.java#b34 would be welcomed [15:36:19] sure! [16:23:29] hmm, maybe it does require x-pack to transition with tls :S My test cluster reports "SSL/TLS request received but SSL/TLS is not enabled on this node" when an opensearch instance tries to join [16:26:29] yes that would have been surprising if it was able to do this transition elastic-no-ssl -> opensearch-with-sec and still require to do a full restart to enable the sec-plugin [16:27:06] indeed, so we have to have the tlsproxy for the initial transition [16:30:49] I'm happy to report that https://airflow-search.wikimedia.org/home is now publicly available (and OIDC authenticated) [16:32:05] brouberol: nice, thank you! [16:32:43] indeed, thats awesome! [16:33:15] i thought about setting that up for about 5 seconds when initially deploying airflow, it seemed like a lot of work :P [16:33:16] the full announcement is here https://wikimedia.slack.com/archives/CSV483812/p1731596077852129 but TLDR this is just another way to get to the web UI. For the moment, the tasks still execute as subprocesses of the scheduler, running on the an-airflow host. But still, it's a start! [16:34:10] and the scheduler migration to Kubernetes should start during the quarter, if possible [16:50:18] when doing cluster upgrades, do we ever make special considerations for master-capable nodes? Migrate them first/last? [16:52:23] I don't think so. We need special handling for changing masters (something I've screwed up in the past). Definitely worth reviewing ES/OS docs for recommendations [16:53:01] err...we need special handling for changing masters, but not for upgrading the ES version AFAIK [16:54:33] ok, good we haven't in the past. I was just working up a list of some tests to run in mixed-cluster state so we understand how thats going to work. I couldn't remember having any issues with master-capable on different versions in the past. [17:30:48] dinner [18:04:57] ran some tests with mixed cluster and i can't find any obvious limitations, things (replicas, primaries, cluster master) move back and forth between elastic and opensearch without issue [18:05:49] {◕ ◡ ◕} [18:45:51] * ebernhardson realizes while writing the commit message for hot-threads that really hot-threads is just a bad substitiute for flame graphs [18:50:49] lunch/appointment, back in ~2h [21:41:33] back [22:04:45] CODFW morelike latency alerts are flapping again. Not really a problem, but I'm just trying to remember why that was happening [22:09:04] last time it was high traffic to one wiki, can check if its same again [22:10:22] oh yeah! it was that ceb wiki [22:10:33] T379002 [22:10:34] T379002: Consider resharding cebwiki_content - https://phabricator.wikimedia.org/T379002 [22:10:53] from grafana there are 12 instances with elevated p95 for morelike in codfw [22:13:20] and indeed those are the 12 nodes holding cebwiki_content shards [22:16:23] interesting...the alert has flapped back off [22:16:46] or not...it just fired again [22:20:08] that per-node percentiles dashboard is pretty cool [22:30:43] it's some custom metrics collection from our plugin, sadly elastic/opensearch doesn't offer those directly