[00:03:52] doesn't appear to come from the other hosts, collected tcpdump for 9(2|4|6)00 on all the nodes over a restart, all recorded http codes are 200's (and wireshark is tediously slow with a ~1G merged pcap) [00:21:41] i dunno :( Running out of time today, it doesn't seem fatal. But we might want to not roll too quickly with opensearch until we know why [00:21:58] (the pipeline regularly gets canceled and redeployed by the taskmanager) [01:19:58] Thanks for taking a look! I'm reimaging cloudelastic1008, but we can stop there until we figure out what's causing these errors [08:27:37] has been stable since 2:30 am, wondering if it's simply not because we hit nginx while elastic is not ready then returning a text/plain 503 that the client does not understand [08:28:11] perhaps not new in the end but simply visible now that we pull some hosts out of the cluster [09:12:47] last error is at 2:29 with "upstream connect error or disconnect/reset before headers. reset reason: connection failure" with a 503 [09:21:25] and every time after this error flink blows up with a metaspace OOM ... [09:28:09] just failed [09:28:23] from envoy POV: [2025-03-06T09:25:01.839Z] "POST /_bulk?timeout=120000ms HTTP/1.1" 503 UF 1052672 91 248 - "127.0.0.1" "Apache-HttpAsyncClient/4.1.4 (Java/11.0.25)" "51f2241c-5702-4e11-81b1-bef58ff3f254" "localhost:6106" "208.80.154.241:9443" [09:28:46] (unrelated but we should change that UA) [09:29:46] UF is upstream connection failure, so I suspect that the 503 is returned by envoy [09:31:25] perhaps it's trying cloudelastic1008 which is down? [09:40:27] seems to be always "omega"... [09:41:28] ah no scratch that [09:42:34] from cloudelastic1007 nginx logs I don't see any errors since yesterday 18pm [09:44:15] wondering how I can see lvs pooled hosts [09:46:52] seems like it's with confctl but no clue where/how to run this [09:53:14] thanks for the f/up on T385972 Trey314159 [09:53:14] T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972 [10:05:56] does not seem like it's cloudelastic1008 being accessed, from https://grafana-rw.wikimedia.org/d/000000421/pybal?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-server=All&var-service=cloudelasticlb_9443&var-service=cloudelasticlb_9643&var-service=cloudelasticlb_9243&from=now-12h&to=now it's properly considered down [11:13:39] does not seem specifically related to /_bulk [11:15:11] kubectl exec -ti flink-app-consumer-cloudelastic-taskmanager-1-16 -c flink-main-container -- curl localhost:6106 -> "upstream connect error or disconnect/reset before headers. reset reason: connection failure" [11:28:27] we might want to depool cloudelastic1007 from lvs to confirm but I'm not sure that cloudelastic1007 is reachable via lvs [11:28:47] lunch [12:25:37] I just realized that we enable mlr-1024rs hewiki under wmgCirrusSearchMLRModel https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/5c733beabce2d251e489e0763f4b9583988d7c14/wmf-config/ext-CirrusSearch.php#935 [12:26:35] but the wiki is not configured to use mlr in wgCirrusSearchRescoreProfile: https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/5c733beabce2d251e489e0763f4b9583988d7c14/wmf-config/ext-CirrusSearch.php#935 [12:26:59] is it intentional? [12:28:48] dcausse AFAIK this does not look like a regression introduced by the recent deployment (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1118782/4/wmf-config/ext-CirrusSearch.php) [12:30:14] lunch [12:57:19] inflatador dcausse https://flink.apache.org/2025/03/03/apache-flink-kubernetes-operator-1.11.0-release-announcement/ [12:57:35] > The Flink Kubernetes Operator and Autoscaler 1.11.0 brings support for Flink 2.0 preview version. This should help users to try out and verify the latest features in Flink planned for the 2.0 release. [13:30:42] gmodena: interesting... wondering if it was of the two latest A/B test [13:32:06] https://people.wikimedia.org/~ebernhardson/T377128/T377128-AB-Test-Metrics-WIKI=hewiki.html [13:33:11] https://people.wikimedia.org/~gmodena/search/mlr/ab/2025-02/T385972-AB-Test-Metrics-WIKI=hewiki-EXPERIMENT=mlr-2025-02.html [13:42:56] I need to check past A/B tests but its possible mlr on hewiki was not better and we did not enable it, Erik might remember [13:43:31] very possible that this wiki does not have enough data for mlr to train properly, increasing retention might help there hopefully [13:44:05] in the last Erik's A/B test control is still better but not so clear in yours [13:53:29] traceroute 10.64.164.18 (lvs1018) is one hop from cloudelastic1012 but 4 hops from cloudelastic1007 (starting from ae1-1017.cr1-eqiad.wikimedia.org) [14:12:36] I don't see /etc/default/wikimedia-lvs-realserver on cloudelastic1007 [14:12:39] o/ [14:13:30] sounds like maybe I didn't add the ipip config in cloudelastic1007's profile...checking [14:14:19] yeah, that looks to be it...sorry, will get a puppet patch up [14:14:55] FWiW, the new LVS config uses ip over ip tunneling instead of layer 2 LB [14:15:04] so you'll see an ipip interface if you run `ip link` [14:15:43] dcausse ack. I am on the fence on the results of the most recent a/b test. [14:16:33] I'll depool cloudelastic1007 in the meantime [14:17:04] or not...no LVS ;( [14:20:05] yes no depool/pool tool on the host but perhaps doable with confctl directly? [14:20:37] This can also wait for LVS to be properly setup if that fixes the whole issue [14:21:50] oh good call, yes it can [14:22:33] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1125165 should fix it though, just waiting for PCC to come back [14:23:05] oops, need to include one more profile...sec [14:23:09] see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1043302/5/modules/role/manifests/elasticsearch/cloudelastic.pp for comparison [14:23:36] inflatador: should you add include profile::lvs::realserver as well? [14:24:36] dcausse Y, just fixed [14:33:31] dcausse OK, I merged/activated changes and repooled the host. LMK if you're still seeing errors [14:33:49] inflatador: seems like it fixed the issue, thanks! [14:34:01] NP, sorry for not catching that earlier ;( [14:34:06] np! [14:39:48] speaking of ipip migrations...Valentín is starting to migrate the wdqs hosts now. We've been working w/him to migrate services all week and it's been seamless so far. But I'll be keeping an eye out [14:40:09] inflatador: thanks! [14:44:27] ryankemper: can T384422 be moved to [Done] ? [14:44:27] T384422: Provide a low availability / scalability full graph endpoint to ease the transition to a split graph - https://phabricator.wikimedia.org/T384422 [14:47:48] gehel have we reached out to scolia so they can test? that'd be the only thing i'd do before resolving [15:02:15] hmm, we just got an "alert lint problem" for our CirrusSearchJobQueueLagTooHigh alert. That's a new one on me [15:02:30] \o [15:02:36] .o/ [15:02:56] o/ [15:03:06] inflatador: seems like a linting alert? [15:04:01] yeah, looks like there are a lot of 'em. Guessing they just turned that on. https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [15:04:58] indeed for hewiki we never turned it on in prod, when sorted by available training samples it was the cutoff where we stopped deploying since the AB test didn't show it was better [15:05:18] optimistic longer retention = more samples = can turn on more wikis [15:05:18] iirc flink is reporting counter as gauge and probably alertmanager does not like counter function to be used as gauge, looking [15:06:24] OSError: You do not have Kerberos credentials. Authenticate using `kinit` or run your script as a keytab-enabled user. [15:06:32] oof... i had when this happens a in for loop [15:06:51] ebernhardson ack - thanks [15:07:12] gmodena: you can use `sudo -u analytics-seearch kerberos-run-command analytics-search ` [15:07:22] gmodena: at least, thats how i avoid those in long running things [15:07:30] good point [15:07:35] i forgot :) [15:07:57] i use it so often i added an alias :) alias krc='sudo -u analytics-search kerberos-run-command analytics-search' [15:10:19] for the saneitizer thing...i'm having a terrible time reproducing locally. I think today i will just fix the param handling so it bounds at appropriate time and undeploy/redeploy the prod SUP's to get saneitizer going again from an empty state [15:11:56] sounds good [15:14:10] inflatador: quick cleanup patch whenever you have time: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1113461 [15:14:30] dcausse: i guess i already wrote the patch, just had to clean out some extra bits and submit: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/177 [15:15:01] looking [15:16:10] dcausse ACK, just merged it [15:16:15] thx! [15:20:56] ebernhardson: lgtm [15:22:09] I just added a thread pool rejection graph to the percentiles dashboard...feel free to tweak it if necessary. Working on an alert next... https://grafana.wikimedia.org/goto/EHCgwUpNg?orgId=1 [15:22:15] tx! [15:23:04] inflatador: thanks! [15:50:14] I know this is gonna fail CI, but if y'all have any idea on the unit tests let me know. I'm not so good at writing 'em ;( https://gerrit.wikimedia.org/r/c/operations/alerts/+/1125180 [15:59:50] i don't know if anyone is good at writing those tests :P half the time it's about reverse engineering the alert [16:01:49] quick errand [16:12:18] yeah, I wish it would give me a hint like "your alert conditions won't ever trigger" vs "you missed a comma on the alert verbiage" [16:17:33] inflatador: submitted a new version that should pass [16:29:55] ebernhardson ACK, looks much better. I'll fix the less important error (missing description) [16:47:56] hmm, kokkuri:setup-variables failed in SUP ci :S Job failed (system failure): prepare environment: setting up build pod: Internal error occurred: failed calling webhook "namespace.sidecar-injector.istio.io": failed to call webhook: Post "https://istiod.istio-system.svc:443/inject?timeout=10s": context deadline exceeded. [16:48:00] not entirely sure what to do with that [16:48:22] since it's .svc that's at least something internal and not a random web thing being called [16:48:26] hopefully :P [16:51:01] I vaguely remember seeing a ticket about changing the `cluster.local` suffix from k8s CoreDNS records [16:55:00] https://phabricator.wikimedia.org/T376762 ... seems unrelated [17:02:32] seems itermittent, will ignore [17:18:41] gehel I've blocked the cloudelastic migration, contingent on T388150 (cloudelastic1008 HW errors). This probably means we won't finish the migration this week ;( [17:18:41] T388150: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150 [17:20:48] :( [17:23:07] seen this happen a few times before...I think it's just the disks coming unseated [17:23:41] inflatador: notes! [17:24:07] all the more reason to go to a large hypervisor infra with network storage ;) [17:24:14] s/notes/noted/ [17:25:43] we **could** technically forge ahead with 5 hosts, but it seems a bit risky [17:30:27] sounds reasonable. There is some level of risk in maintaining a mixed cluster (shard moves primary) but we should probably be ok [17:33:22] Anyone else have a strong feeling whether or not we should forge ahead with 5 hosts? I'm hesitant, but I could be convinced [17:35:38] lunch, back in ~90m [17:36:04] errand+dinner back later tonight [18:04:58] hmm, for some reason i thought we had more docs about helmfile + SUP on wikitech but not seeing much. I can probably remember how to destroy and restart with a kafka timestamp (or find in bash history) but thought i'd double check the docs [18:13:15] saneitizer looks to be initializing in cloudelastic now, confirms (wasn't much doubt though) that something in the state was making it not start [18:16:08] * ebernhardson didn't realize it takes so long to start up ... few hundred ms per wiki to get the starting state * 1000 wikis [18:16:18] sometimes even 1-2s per wiki [18:16:31] i guess it doesn't matter, it's all held in state [18:22:39] finally finished starting, metrics coming in. will repeat for consumer-search in eqiad/codfw [18:43:28] all running now, things seem back to normal [18:57:14] ebernhardson maybe the docs were for rdf-streaming-updater? [18:57:18] back [19:19:04] re: cloudelastic, there's some risk in staying where we're at, too...like we only have 1 opensearch host and its shards can't live anywhere else [19:23:45] one other thing, turns out the bits i wrote to handle plugin name changes doesn't work in a mixed-cluster mode. Easily fixable, but it means cebwiki_content is reindexing in eqiad/codfw right now but not cloudelastic [19:23:53] although maybe its for the best to not run reindexes in mixed cluster mode anyways [19:25:03] * ebernhardson separately wondered if there was a reason the cebwiki shard change was merged in dec but never reindexed...but couldn't find a reason so went forward with it [19:44:46] ebernhardson: my bad, I kept punting this one because the cirrus-reindexer was in a bad shape due to my work on mwscript-k8s... made a "stable" branch of this project (added a quick note in the readme) [19:45:25] dcausse: no worries, and yea i saw that but since you updated the docs it was still just copy/paste and things seem to work. thanks! [19:45:28] dcausse do you have an opinion whether we should move fwd with the cloudelastic migration or wait for cloudelastic1008 to be fixed? I'm starting to lean towards moving fwd [19:45:54] ryankemper ^^ [19:46:13] inflatador: if cloudelastic can survive with 4 nodes during the re-image why not? [19:47:54] dcausse I'm mainly paranoid about losing another host when we try to reimage, but that's very unlikely [19:48:00] as long as we verify that we're not gonna throw the cluster into red status i'm fine with it [19:48:06] and even so, we could live with 3 hosts if we had to [19:48:18] since the yellow shards are all shards where the primary is an already reimaged host we should be ok [19:48:31] ACK, was just about to say that [19:48:34] i'd vote we roll forward and see how the next reimage goes [19:48:35] inflatador: perhaps to be cautious we should wait for all the shards to move out of the banned node before rebooting? [19:49:19] dcausse agreed...and should check for orphaned indices beforehand as well [19:49:50] sure, and we should probably have better automation for those [20:07:02] how old were they? I wonder if they somehow predated cirrus doing the cleanups. That was written may 2024 [20:07:26] it seems like we probably would have cleaned them up, but maybe forgot [20:08:01] I dunno...I see them pretty frequently when I do service restarts, but since the clusters only go red for about a minute or so, I've mostly ignored them [20:08:42] hmm, suggests the cleanup doesn't work, at least not entirely :( [20:16:23] been using `curl -s https://search.svc.eqiad.wmnet:9243/_cat/indices | awk '$6 == 0 { print $0 }'` to find shards with zero replicas, still need to compare 'em against the alias though [20:17:25] or indices with zero replicas, that is [20:20:09] i suppose that will also find reindexes in progress, which you wont have great visibility on. I suppose if we were automating it could look at the timestamps and consider any reindex operation shuld finish in a few days, so an index > 5 days old or some such with 0 replicas is probably stale [20:22:46] Yeah, I'm mainly thinking of keeping the cluster from going into red state for the entire time it takes to reimage a host. Not that it means all that much for these empty indices, but it could mask other problems [20:22:57] true [20:38:58] ebernhardson: haven't checked the dates yet but I have a copy of check_indices output in deployment.eqiad.wmnet:~dcausse/check_indices.json [20:41:38] seems like all older than May 28 2024 [20:41:58] nice! so it does work, we just forgot to finish cleaning up [21:04:27] we don't have any zero-replica indices on cloudelastic, so let's get started! https://gerrit.wikimedia.org/r/c/operations/puppet/+/1125226 [21:04:56] I'll make a chain for the rest of the hosts [21:35:42] Thanks for the +1! I've banned cloudelastic1009, waiting for all shards to drain [21:41:24] hmm, unexpected problems of changing desks...my monitor is now much more infront of the west facing window, early afternoon it makes viewing annoying :P Might have to relocate desk against anothe rwall [21:41:42] or give in and open/close the blinds daily [21:45:44] I have the same problem at this time of day. My floors are reflective too, which doesn't help [21:47:46] hmm, i've been thinking about replacing the flooring in this room too...i guess i have to keep reflectivity in mind :) [21:52:51] I really like our stained concrete floors, except for that one thing [21:56:06] i suppose i hadn't considered that, might be an option. Office is bottom floor so there is certainly a concrete pad under this carpet [21:56:26] well, probably. I guess it could be raised [21:57:07] * ebernhardson is separately failing to understand how this gradle magic in elasticsearch-sudachi works...packaging is always fun [22:28:13] on the one hand...it's impressive they have a single repo that can compile for various versions of elastic 7, elastic 8, and opensearch 2. On the other hand, it's much more complicated :P [22:31:20] also it's all in kotlin, which is java-ish but i've never used it before [23:54:05] well, I banned cloudelastic1009, but it seems to be having trouble moving off all its shards. I'll wait till tomorrow for the reimage