[09:08:25] the upgrade to openseearch rest client 2.19 did the trick, writes are flowing again to cloudelastic, will deploy to the main clusters after a quick errand [09:08:27] errand [10:42:13] hm... not there yet, getting timeouts from https://cloudelastic.wikimedia.org:9643 [13:06:26] o/ [13:07:05] dcausse are you able to reach 8643 by any chance? I noticed that I can't reach the 9*** ports from cumin anymore for some reason. I wonder if there was a FW change somewhere [13:07:38] \o [13:07:41] o/ [13:07:57] inflatador: what is 8643? the read-only port for cloudelastic? [13:08:49] yes seems like it, works fine from deployment.eqiad.wmnet [13:09:06] both 9[246]43 and 8* ones [13:09:28] still puzzled by the frequent timeouts [13:09:54] is this opensearch2 simply slower on cloudelastic or something that changed in the network stack.. [13:10:25] we send bulk requests with 120000ms timeout but the envoy listener times out at 50s [13:10:33] :S [13:10:43] but that does not seem new... [13:10:45] I don't know how many times we have re-aligned timeouts...i wonder how to be better [13:11:17] yes... [13:11:28] I would like to have a better idea of perf changes, I opened T424852 but haven't had time to look at much [13:11:28] T424852: Investigate performance issues in cloudelastic - https://phabricator.wikimedia.org/T424852 [13:11:38] will try to lower the bulk request size to see if that helps [13:11:52] yea perf is my first random guess, but no proof [13:12:57] Might also be a good time to switch from nginx to envoy TLS term and add ourselves to the service proxy [13:15:31] unrelated question, are y'all done with the semantic-search clusters? I set a date of May 21st to undeploy them, MLK if y'all need any help getting data off or anything [13:18:08] inflatador: I don't think it's being used, the data is in hdfs, no need to backup anything [13:18:20] hmm, feel like maybe i'm getting distracted. I'm looking at how the trigram regexp extractor turns `Clover.*West Virginia` into the accelerated query "(wev AND evw)". Although curiously, only via the integration test and not the direct n-gram extractor test [13:18:33] tempted to set up a fuzzer that generates regex's and matching strings, and finds variances...but maybe too much [13:19:05] my suspicion is the rewrite of .* to exclude the utf-8 reserved characters for anchor matching is triggered something silly [13:20:38] maybe fuzzing just sounds like fun because i read about it but never have a great use case :P [13:21:15] :) [13:22:04] what's the shape of the regex after the handling of the EOS char? [13:26:12] after rewriting it's `Clover[^﷐﷑]*West Virginia` (with the utf8 non-characters), as a lucene regex it reports as: `"clover"((.&~((\﷐|\﷑))))*"west virginia"` [13:26:25] (after lowercasing elsewhere) [13:26:48] afaict the quotes are just how lucene prints regexes, and .&~ is how it represents the negation [13:29:03] wow that's a nasty representation :/ [13:29:44] yea at first i thought that was the problem, untill looking into how the RegExp.toString works [13:36:21] capping at 5 actions per bulk was at 16 if I read the code right, did not seem a lot :/ [13:36:37] hmm, yea that is not many :S [13:43:24] still failing sometimes... [13:43:36] just timing out? [13:44:35] yes: Caused by: org.opensearch.client.ResponseException: method [POST], host [https://cloudelastic.wikimedia.org:9243], URI [/_bulk?timeout=120000ms], status line [HTTP/1.1 503 Service Unavailable] upstream connect error or disconnect/reset before headers. reset reason: connection timeout [13:44:48] sigh [13:45:44] lowering to 2 [13:46:14] could be that the limits do not work as expected and my new settings have no impact [13:48:16] this ngram extractor does not do what i expect.... `abcde.*West` -> TRUE, `abcdef.*West` -> ((wes AND est) OR weu OR (wes AND esu) OR (weu AND euw) OR (wes AND esu AND suw) OR wev OR (wes AND esv) OR (wev AND evw) OR (wes AND esv AND svw)) [13:48:33] Whatever is going on, clearly i'm going to have to, after 10+ years, finally figure out how this thing works :P [13:49:23] :) [14:10:06] I'll create a ticket for the services proxy/TLS stuff, it is also long overdue ;) [14:14:49] meh... no index mapper found for field: [_type] returning default postings format [14:15:16] how can we possibly still send a _type from an opensearch 2.19 client [14:16:49] Could nginx modify the data in flight or something? [14:17:39] hmm, indeed that shouldn't be possible. :S [14:18:17] oh, I thought you meant you wanted to add a _type [14:19:08] the hot thread logs mention "shard_indexing_pressure_enabled=true" but reading the doc it's supposed to be false by default [14:19:37] I mean for 2.x, true by default for 3.x [14:19:52] disabling just to see [14:39:11] also just tagged y'all on T424860 [14:39:11] T424860: Consider managing OpenSearch cluster dynamic settings with terraform - https://phabricator.wikimedia.org/T424860 [14:43:09] thanks! [14:43:45] sigh... a bit clueless tbh... not seeing much on opensearch nodes, they're pretty much idle [14:44:39] if I read the envoy telemetry correctly the timeout is sent by nginx [14:46:43] On the topic of services proxy, I added it for semantic search last week ref https://gerrit.wikimedia.org/r/c/operations/puppet/+/1264739/9/hieradata/common/profile/services_proxy/envoy.yaml#536 . If you are able to test semantic search (by changing mwconfig connection string to localhost:6044 I think) that could possibly help us with the cloudelastic envoy setup [14:47:25] If there's a way to test by running scripts in k8s or something LMK, I've never tried that [14:48:34] could be possible with a mwscript shell session but perhaps easier to just change the mw-config settings [14:52:21] Maybe we just need an internal URL for cloudelastic [16:08:14] inflatador: I got a notification that the work requiring relforge is now complete, so feel free to re-image it whenever you want [16:12:11] dcausse thanks for the heads-up, will create a ticket. I've been thinking of using containers there too, something like https://www.thelinuxvault.net/blog/how-to-run-podman-containers-under-systemd-with-quadlet/ [16:12:42] * inflatador applauds Jeff's speech [16:13:10] oops, that was meant for elsewhere ;P [16:52:39] entering the container running python urllib.request.urlopen("http://localhost:6107") I get HTTP Error 503: Service Unavailable [16:53:13] the sup container? is it immediate, or after some time? [16:53:21] almost immediate [16:53:27] hmm [16:54:07] sup logs are 2026-04-30T16:53:28.364Z] "POST /_bulk?timeout=120000ms HTTP/1.1" 503 UF 3415 91 252 - "127.0.0.1" "Apache-HttpAsyncClient/4.1.5 (Java/17.0.18)" "ee1aa6e5-739f-44d9-9271-c0bd0d498111" "localhost:6107" "208.80.154.241:9643" [16:54:18] mine is 2026-04-30T16:53:37.710Z] "GET / HTTP/1.1" 503 UF 0 91 250 - "-" "Python-urllib/3.11" "66a8e6f3-9689-430b-9ef8-4f6024c8344c" "localhost:6106" "208.80.154.241:9443" [16:54:45] trying the mw endpoint from python [16:56:42] yes it works with resp = urllib.request.urlopen("http://localhost:6500") [16:56:54] so definetely something between envoy and cloudelastic [16:57:21] maybe one host is unreachable from another host? Could try the servers one at a time, if it can query them directly [16:57:59] seems pretty consistent [16:59:09] I can query the hosts directly, would have hop in the envoy container I guess? [16:59:15] can't* [17:00:25] seems plausible, but poking the envoy container it's pretty slim [17:01:33] oh right it won't even have python :/ [17:02:28] it's a bit silly, but they have openssl. so `echo | kubectl exec ... -- openssl s_client -brief -connect host:port` should work [17:02:33] there is perl :P [17:02:53] maybe needs the --stdin and --tty flags for kubectl (i always have those) [17:03:37] hmm, can only reach cloudelastic.wikimedia.org, but not the hosts directly. So i guess can't really test [17:04:59] trying openssl s_client -brief -connect 208.80.154.241:9643 it does not respond [17:06:01] that part feels intentional to me, like egress is only to LVS. i guess i should read the k8s bits that define that [17:06:09] because i could connect to any of 1008-1012 [17:06:22] could not* [17:07:54] must be something that changed today and slowly degraded because I'm sure I saw traffic flowing this morning [17:13:18] another curiosity, openssl from the tls-proxy container seems to work for 9443 and 9643, but i'm not getting a connect on 9243 [17:13:27] against cloudelastic.wikimedia.org [17:13:58] Does it work on 8243? [17:14:47] inflatador: 8243 looks to hang too, but thats more expected as it's not in the egress rules [17:15:07] ah. I asked in #security about what might've changed but haven't gotten a response yet [17:15:56] I think we might need a `cloudelastic.svc.wmnet` and/or a proxy to get around whatever's changed [17:16:29] example would be port 9243 having 208.80.154.241/32 in the egress rules, but trying to connect times out. Same req works w/ curl from deploy host [17:16:38] clearly something in the service proxy layers, but lost on what [17:21:03] this kinda feels like...do we promote a ticket to unbreak now and escalate to someone in networks? I always feel lost trying to dig into this end of things [17:23:41] I don't think cloudelastic is important enough for UBN but I really have no idea. The impact is that it's not getting updated, right? [17:24:10] +1 hopefully it's something that changed today and somebody will rapidly know where to look [17:24:11] i would have to double check, i think retention is 7 days for updates? After that we have to re-initialize the cluster [17:24:40] well, maybe not a full re-initialize, but it will lose updates and we have to do something. In the past we copied over indexes from prod with snapshots [17:24:45] yes or we could let the sanitizer do its job [17:25:00] i guess in theory, after 2 weeks it would have found all the missing revisions at least [17:25:20] weighted tags would be out of sync, but maybe ok [17:27:12] yes :/ [17:27:13] I'll get an incident doc started regardless [17:27:18] hopefully it's used that much there [17:27:23] *not [17:27:53] yea i'm not sure, global-search doesn't have anything. But i have no clue (would be nice to know) if anything else uses cloudelastic [17:27:55] stopped working completely around 12:25 today [17:27:59] https://grafana-rw.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&from=now-12h&to=now&timezone=browser&var-datasource=000000026&var-cluster=main-eqiad&var-topic=$__all&var-consumer_group=cirrus-streaming-updater-consumer-cloudelastic-eqiad [17:29:13] have to go afk [17:29:42] filling out https://wikitech.wikimedia.org/wiki/Incidents/2026-04-30_cloudelastic_unavailable with what I know, will ping y'all once I reach the limit of my knowledge ;) [17:33:27] Tomorrow is a bank holiday for most of Europe too. Let me ping in #traffic as well [17:44:11] And now the updater alerts are clearing? I have a feeling we are being rate-limited by requestctl or something [17:45:08] inflatador: hmm, indeed the graph david linked above started commiting ofsets again, it ran for about 1.5 hours earlier but then stopped for 7 hours [17:45:59] * inflatador wonders where 429s would be logged [17:46:01] don't know how relevant it is, but the taskmanager pods are fairly new. 8m, 13m and 57m old [17:46:29] could be, not sure either [17:46:32] per the graph updates started commiting ~6 minutes ago, so after the most recent one restarted [17:47:20] can confirm my earlier openssl tests that would timeout on the other pods now do not timeout on the new pods [17:50:49] Have to step out for the next 2.5h but will take a look once I get back. cc ryankemper for awareness on cloudelastic weirdness [18:24:09] think i'm finally getting the hang of how this state machine works, but now i suspect the problem is it treats short spans in the regex automata as being characters that have to be in the output, but that is not always the case. Still trying to figure out the exact conditions that generate those variances in the automata. [18:26:06] it's also reminding me how terrible .* is in state machine format :P `a.*xyz` has to have a transition from every character after the .* back to the .* [18:27:47] re cloudelastic: weird that it started to work again... restarting with bigger batches (was running at 2 actions per bulk) [18:40:49] seems to work fine... anyways going afk for real, forgot to mention that I'll be out tomorrow (labor day), have a nice week-end [18:41:03] sounds nice! enjoy [19:03:12] i'm going to have to be shocked. I explained what i understood of the problem to claude (web ui) and gave it two java classes, and it found the problem. Or at least, it's fix works on the example queries i've used so far. [19:03:15] "That [t..v] slice is the smoking gun. It's logically part of the same [^﷐﷑] self-loop, but the determinizer punched a hole at s (continues wesz) and another at w (restarts wesz), and the gap left between them happens to be three codepoints wide. Your transition.max - transition.min >= maxExpand test then mis-classifies it as a "useful" required-character transition, and t, u, v get [19:03:17] materialized as required edges. Same thing at S_D, between w and z, where the [x..y] slice gives you the esx, esy, sxw, syw you're seeing in the output." [19:03:47] and it seems sensible and aligned with what i've seen stepping through the debugger [19:20:47] it aligns with something i was worried about reading this though...there isn't actually a guarantee that the characters we pull are required characters, it's heuristic based on transitions over small characters ranges. [19:27:59] That's not the root cause for the cloudelastic stuff, right? That sounded more like FW stuff [19:31:57] right, this is for the intitle regexp's that are not running correctly [19:34:17] OK, just making sure. Stranger things have happened ;) [20:09:36] Cloudelastic SUP alerting again ;( [20:12:12] Resolved...probably due to a MW deploy [21:01:32] * ebernhardson is less convinced after working through all this that i have the full answer :S [21:02:28] i mean the bit i have does fix it, but i suspect something might still be off [22:06:21] cloudelastic looks to be backfilling, not sure how quickly we expect it to go though [22:07:03] I made https://w.wiki/MS5H and when I get back Monday I'm going to trim off some of the broken stuff from the `OpenSearch Node Comparison` dashboard, that should be a decent enough replacement for the generic server metrics