[07:04:18] <dcausse>	 o/
[07:24:48] <pfischer>	 o/
[07:43:39] <dcausse>	 weird... looking at the logs when master could not be elected: seeing [cirrussearch1094-production-search-eqiad] failed to join {cirrussearch1094-production-search-eqiad}, why is it trying to join itself?
[07:45:36] <dcausse>	 seeing this on 1081 too, perhaps expected?
[08:36:01] <pfischer>	 Those are the open search cluster members?
[08:36:40] <pfischer>	 Can they have multiple roles, like leader and worker?
[08:43:47] <dcausse>	 pfischer: yes for us we set master eligible nodes as data-nodes too
[08:46:33] <dcausse>	 looking at the various logs I have the impression that there are many competing master elections happening
[08:53:01] <dcausse>	 something I don't regarding our logs... on cirrussearch1074 production-search-eqiad.log.2.gz last log is at 2025-08-11T23:52:06,159, production-search-eqiad.log.1 first line is at 2025-08-12T21:18:26,304...
[08:53:12] <dcausse>	 *don't get
[09:05:11] <pfischer>	 dcausse: is that related to the SUP cloud elastic backfill constantly restarting? 
[09:05:11] <pfischer>	 https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=000000017&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic-backfill&orgId=1&from=now-6h&to=now&timezone=utc&var-flink_job_name=cirrus_streaming_updater_consumer_cloudelastic_backfill_eqiad&var-operator_name=$__all&var-Filters=&var-s3_prefix=s3:%2F%2F
[09:05:40] <dcausse>	 pfischer: yes I think we need to adjust the alter to ignore those
[09:06:05] <dcausse>	 the backfill jobs is not "restarting" per-se but running multiple times for multiple wikis
[09:06:42] <dcausse>	 pfischer: https://gerrit.wikimedia.org/r/1178483
[09:08:00] <dcausse>	 perhaps it's misbehaving (looking) but imo it should not be the role of alertmanager to check those
[09:08:26] <pfischer>	 Sure, +2ed
[09:08:46] <dcausse>	 thx!
[09:08:53] <pfischer>	 But other parts of the SUP are in that same restart-loop: consumer-search is also not running stably
[09:09:10] <dcausse>	 :/
[09:09:40] <pfischer>	 https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=000000017&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search&orgId=1&from=now-6h&to=now&timezone=utc&var-flink_job_name=cirrus_streaming_updater_consumer_search_eqiad&var-operator_name=$__all&var-Filters=&var-s3_prefix=s3:%2F%2F
[09:11:46] <pfischer>	 https://logstash.wikimedia.org/app/dashboards#/view/7b67aa70-7e57-11ee-93ea-a57242f792cd?_g=h@c823129&_a=h@8c6eedf
[09:12:39] <pfischer>	 Failed to get metadata for topics [eqiad.cirrussearch.update_pipeline.update.private.v1, codfw.cirrussearch.update_pipeline.update.private.v1]. This server does not host this topic-partition.
[09:13:40] <dcausse>	 pfischer: I think the grafana dashboard is not filtering backfill jobs properly and gives the impression that containers are restarting
[09:15:06] <dcausse>	 I think it's because we use regexes like ${job-name}-.* and this captures the -backfill prefix
[09:15:47] <dcausse>	 the consumer-search job does consume&produce things AFAICS
[09:15:50] <pfischer>	 Ah, you are right, looking closely, the consumer-search is running all the time
[09:16:14] <pfischer>	 It’s just the backfill jobs that produce noise
[09:18:25] <dcausse>	 yes... annoyingly I have to write a more complex regex like: flink-app-${helm_release}-[a-z0-9]+-.[a-z0-9]+ to match pod's name ...
[09:19:05] <dcausse>	 looking if there are more meaningful labels
[09:22:36] <dcausse>	 not seeing anything we could use in container_cpu_cfs_throttled_seconds_total ...
[09:26:37] <dcausse>	 updated
[09:53:11] <dcausse>	 lunch
[11:07:19] * cormacparle waves
[11:08:40] <cormacparle>	 if I'm looking at `event.mediawiki_cirrussearch_request` when does `params.action == opensearch` mean? 
[11:09:10] <cormacparle>	 is the new title suggester?
[12:07:20] <dcausse>	 cormacparle: there are requests param so like meaning that the search request was initiated using the "opensearch" API (https://en.wikipedia.org/w/api.php?action=help&modules=opensearch), it's a completion API, it does not necessary tell you what particular backend engine it's using
[12:08:17] <dcausse>	 there are hopefully other bits in the event that tells you what backend components it's using
[12:28:41] <cormacparle>	 👍
[13:15:00] <inflatador_>	 <o/
[13:17:48] <inflatador_>	 dcausse pfischer the messages are from a failover test we did yesterday w/eqiad depooled (see Ryan's comment above). We are still having the quorum problems in eqiad. We didn't have time to look too closely at the logs yesterday, but we'll do it today
[13:20:29] <dcausse>	 o/
[13:21:20] <dcausse>	 inflatador_: I looked over them a bit, could not find a root cause but there seems to be a lot of concurrent election happening
[13:21:40] <dcausse>	 inflatador_: what was the procedure you applied?
[13:22:46] <inflatador_>	 dcausse we just restarted the active master. Sadly it breaks the cluster every time
[13:24:04] <dcausse>	 you can see in the logs when cirrussearch1074 finally succeeds to be a master that it discards all previous election with plenty of failures like "Caused by: org.opensearch.cluster.NotMasterException: Higher term encountered (current: 221018 > used: 221012), there is a newer master"
[13:24:07] <inflatador_>	 We can fix per this article https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Cluster_Quorum_Loss_Recovery_Procedure
[13:24:30] <dcausse>	 here a single restart happened?
[13:25:04] <inflatador_>	 That's what triggered it. We started/stopped a few masters until the cluster recovered (never more than 2 at a time)
[13:25:36] <dcausse>	 you had to rely on "unsafe-bootstrap"?
[13:25:38] <inflatador_>	 there are some settings we can tweak WRT to master elections ( https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/discovery-cluster-formation-settings#_expert_settings )
[13:25:52] <inflatador_>	 dcausse no, sorry for not being more clear. Just the first part
[13:26:10] <inflatador_>	 `Verify that the openseearch service is started on all master-eligibles. If it is, stop the service on the master-eligible that started its service most recently.`
[13:27:38] <dcausse>	 so restarting the current master causes the mess? what if the current master is simply stopped and not restarted, did you test this as well?
[13:27:56] <inflatador_>	 No, but we should
[13:28:40] <dcausse>	 inflatador_: how much time did you give to the cluster to stabilize before actually trying to restart other masters?
[13:28:49] <inflatador_>	 dcausse about 5 minutes
[13:29:16] <dcausse>	 in other words does restarting other masters have an effect or simply waiting more than 5min would have solved the mess?
[13:29:27] <inflatador_>	 However, one of the times it lost quorum in the past it did recover on its own. I'll have to check the alerts but waiting might be enough
[13:29:37] <dcausse>	 ok
[13:30:49] <dcausse>	 in https://discuss.elastic.co/t/can-not-elect-master-when-restarting-cluster-from-7-3-upgrade/197846/8 they mention a similar situation (competing election) where the root cause was because the cluster state took too long to persist to disk
[13:32:37] <dcausse>	 haven't spotted a similar "publication cancelled before committing: timed out after 30s" error in the logs but perhaps something remotely related to a "slow" process exacerbating some races in the cluster elections
[13:33:08] <inflatador_>	 Perhaps related...I noticed that I forgot to set `node concurrent recoveries` back down to 4 after the migration. That could've added more cluster state to keep track of
[13:33:18] <inflatador_>	 it's currently set to 4, but I'm fixing it now
[13:33:22] <inflatador_>	 sorry...set to 20
[13:34:53] <dcausse>	 those discovery settings are scary I'd rather not touch them unless we understand what's going on
[13:37:41] <inflatador_>	 I agree. The big question to me is why this only happens on chi eqiad
[13:38:09] <dcausse>	 eqiad has 180 more shards than codfw but I doubt that's enough to explain it...
[13:38:26] <ebernhardson>	 \o
[13:38:28] <inflatador_>	 anyway, I just set the incoming and concurrent recoveries back down to their normal values
[13:38:47] <inflatador_>	 4 for incoming recoveries and 2 for concurrent recoveries
[13:39:00] <dcausse>	 o/
[13:40:09] <ebernhardson>	 re SUP backfills, i am running the reindexer script right now, it should be kicking them off
[13:41:36] <dcausse>	 ack
[13:42:55] <ebernhardson>	 scipt is almost done, one remaining question is when i get the pod list from k8s to reconcile known pods vs pods in k8s it doesn't list freshly submitted pods (maybe they have to be assigned first?) and i need to separate out cloudelastic vs eqiad somehow
[13:43:26] <dcausse>	 you mean after submitting the flink job?
[13:43:44] <dcausse>	 or mwsript pods?
[13:43:45] <ebernhardson>	 nah, the mwscript parts. If i submit 8 mwscript executions, then ask k8s for the pod list, it will be empty
[13:43:51] <ebernhardson>	 then in a minute they will be there
[13:44:18] <ebernhardson>	 so far in the script i just print a missing pods list, not sure what to do with it yet :P
[13:44:22] <inflatador_>	 So CODFW had recovery settings of 8, I just set back down to 4. I should probably set up monitors for that
[13:45:15] <dcausse>	 surprising, mwscript should give us back the pod name? perhaps having a pod name does not mean it's available in the list of pods?
[13:45:48] <ebernhardson>	 dcausse: it does give us the pod name, i added a step that gets the pods from k8s, the pods from state, and compares what should exist to what does. It only prints for now, but i was tihnking it could eventually reconcile the two
[13:45:56] <ebernhardson>	 (thats kinda/sorta how the k8s controller pattern would work, iiuc)
[13:46:38] <dcausse>	 but this discrepancy resolves by itself after a couple seconds?
[13:46:57] <ebernhardson>	 after a minute or so yea,  it seems like they dont come back in the pot list until they've actually been started
[13:47:11] <dcausse>	 wow a minute :/
[13:47:20] <ebernhardson>	 it's not a big deal, but it means i can't use that as a signal that some external thing killed our pods
[13:48:13] <ebernhardson>	 or maybe i have to remember the first time it's seen, to know that it's been fully created
[13:48:21] <dcausse>	 and if you query the pod directly instead of listing?
[13:49:00] <ebernhardson>	 hmm, not sure. I just used the pod listing function with label set to the comment we use.  maybe, lemme see
[13:49:52] <ebernhardson>	 it shows up in `kubectl get pods | grep abc123` immediately
[13:51:33] <ebernhardson>	 looks like i also have some error with cloudelastic :S the backfiller keeps trying to start up and fails due to a backfill it thinks wasn't issued by orchestration...so getting there but still problems :)
[13:51:44] <ebernhardson>	 this is only reindxing the small wikis (--exclude-dblist @cirrusearch-big-indices)
[13:52:39] <dcausse>	 ok
[13:54:06] <ebernhardson>	 does anything else issue backfills? I'm realizing now that eqiad and cloudelastic will also recongnize each others backfills, i suppose i need to separate them somehow via the configmaps
[13:55:37] <dcausse>	 nothing else show issue backfills, but possibly there's some delay before the flink-k8s-operator actually decides to start some pods
[13:56:46] <dcausse>	 cloudelastic & eqiad should have different releases?
[13:57:29] <ebernhardson>	 oh silly me, yes the releases are separate and are enough.  I guess the question is `kubectl get cm -o yaml flink-app-consumer-cloudelastic-backfill-flink-app-config` returns a configmap without our `index-aliases-faux-param`  and i'm not sure why
[13:57:34] <ebernhardson>	 any backfill from the reindexer should have it
[13:58:16] <ebernhardson>	 thats the param the reindexer uses to pick up a backfill and finish it, to know what it was working on
[13:59:25] <dcausse>	 weird, this should be in the values files of deployment-chart?
[14:00:06] <ebernhardson>	 we set it from the command line when submitting, with --set
[14:01:22] <dcausse>	 ah, stumbled on some problems when setting params like that but was mainly params with . in them or some serialization issues (string vs int or the like)
[14:01:45] <ebernhardson>	 oh..sigh.  none of them have it anymore, it's an optional argument that i left out :P  It just that we only use that information if the backfiller died and we have to start it back up. So the others are still working
[14:03:59] <ebernhardson>	 i can't decide, but i think this is also a decent bit slower for the tiny wikis than before.   It's probably fine, but since before they were all in threads it could be starting and finishing multiple things at a time if they run quick, now it's all sequentially done. Maybe better and easier to reason about that way.
[14:04:34] <ebernhardson>	 i could still fire them off in threads (or asyncio), but that seems like unnecessary complications
[14:05:38] <ebernhardson>	 looks like they've done ~1500 indexes in the last 10-ish hours, not terrible
[14:12:41] <ebernhardson>	 hmm, actually the missing pods must be something else...codfw is currently printing that it's missing a pod that kubectl says has been running for 15m.  i guess i have to look closer into what the pod listing is giving us 
[14:19:49] * ebernhardson should also write something that reads in a state.json and reports some stats
[14:20:56] <inflatador_>	 this might be a silly suggestion, but would it be worth it to register some kind of external service that the pods report back to periodically?
[14:22:43] <ebernhardson>	 inflatador_: hmm, i suspect that just makes this more complicated. Hopefully we can trust the k8s apis 
[14:24:54] <inflatador_>	 ACK, thanks for indulging me ;)
[14:26:26] <ebernhardson>	 suprisingly not terrible, gave claude.ai the state.py and asked for a report that reads in and reports on it.  This is in-progress eqiad (not perfect, but not the worst): https://phabricator.wikimedia.org/P81269
[14:27:04] <ebernhardson>	 considering it took 2 minutes to write...i kinda like it :)
[14:27:53] <ebernhardson>	 i think the timing analysis must be wrong though
[14:28:31] <inflatador_>	 {◕ ◡ ◕}
[14:38:34] <ebernhardson>	 and it was even able to fix the timing analysis with just a printout of the bad content...surprisingly not terrible
[14:55:50] <Trey314159>	 Wednesday Meeting or P&T Staff meeting?
[14:56:02] <dcausse>	 wed?
[14:56:19] <Trey314159>	 I'm okay with watching the recording of the P&T meeting
[14:56:30] <ebernhardson>	 ya, wed
[16:03:28] <inflatador_>	 Here's the ticket where we tried to get rid of local logstash: T324335 . And it does look like Observability has a log4 file that would probably work for us: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/opensearch/templates/log4j2_1.properties.erb
[16:03:28] <stashbot>	 T324335: Remove logstash from the Search Elasticsearch servers - https://phabricator.wikimedia.org/T324335
[16:14:56] <inflatador_>	 workout, back in ~40
[17:00:33] <dcausse>	 dinner
[17:11:16] <inflatador_>	 back
[17:19:41] <inflatador_>	 e
[17:36:06] <inflatador_>	 So we're already using the puppet template above, and the changes Erik match what we used on Elastic previously. So I think I'm gonna try to mix and match
[18:09:37] <inflatador_>	 Looks like we removed `EnvironmentFile=-/etc/default/opensearch` from our systemd units. We'll need to add the java security options argument somewhere so opensearch can talk to rsyslog
[18:10:02] <inflatador_>	 anyway, lunch...back in ~45 or so
[18:58:02] <inflatador_>	 back
[19:34:47] <ebernhardson>	 oh i'm a dumb dumb...the reason my thing constantly complains about missing pods is i'm looking for a label thats not a label :P
[19:35:08] <ebernhardson>	 for some reason i was sure i saw the comment in labels before...but we can filter it out manually
[19:37:26] <ebernhardson>	 it's an annotation and not a label, apparently
[19:55:24] * ebernhardson sighs ... still have shutdown conditions wrong :P  It doesn't wait for in-progress reindexes after ctrl-c, it just waits for the backfill
[20:24:23] <inflatador_>	 So the log4j template is defined here https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/opensearch/manifests/instance.pp#242 . If I try and redefine it in a more specific place (like `modules/profile/manifests/opensearch/cirrus/server.pp`), will puppet complain?
[20:24:47] <inflatador_>	 I'm trying to avoid messing with the code observability uses if possible
[20:25:02] <ebernhardson>	 hmm
[20:25:48] <ebernhardson>	 inflatador_: you would probably have to acceot the template name as an argument to opensearch::instance?
[20:26:27] <ebernhardson>	 puppet wont like multiple File definitions for the same file
[20:27:22] <inflatador_>	 ACK, I'll check it out
[20:28:14] <ebernhardson>	 i'd be tempted to check with o18y, either their logs work and ours don't which is curious, or they would plausibly be interested
[20:28:26] <ebernhardson>	 might not need a variation
[20:28:51] <inflatador_>	 worth a look
[20:30:42] <ebernhardson>	 being bold, kicked off the reindex on all three clusters against cirrussearch-big-indices now
[20:31:06] <ebernhardson>	 the not-cirrussearch-big-indices were already run last night
[20:33:02] <inflatador_>	 I'll modify the existing file instead and CC 0lly on the patch. I already asked 'em for suggestions earlier today, so they may be interested
[20:35:48] <inflatador_>	 oh, now that I think of it they're using OpenSearch 2. So I don't think that template would even touch their servers. The only other place that would be affected would be the datahubsearch cluster
[20:53:43] <inflatador_>	 actually...that's true for the log4j template, but not for the jvm options file. I'm not sure we could avoid touching the JVM options file. Maybe we could add an environment file, but then we'd have to mess with the systemd unit files
[20:53:48] <inflatador_>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178613 WIP
[20:54:27] <inflatador_>	 ryankemper ^^ if you have ideas on this one LMK, it's for fixing the logstash config like we talked about last week
[20:55:45] <inflatador_>	 I don't love changing the jvm options file for the other roles, even if it's just a comment. If y'all have better ideas LMK
[21:02:57] <ryankemper>	 inflatador_: will take a look. hopping on pairing in ~5m
[21:05:34] <inflatador_>	 np, I think I'm just gonna modify the main opensearch puppet plan, having extra JVM options seems like it'd be useful regardless of role