[09:33:22] <dcausse>	 errand+lunch
[13:12:10] <inflatador>	 <o/
[13:15:37] <inflatador>	 dcausse looks like we forgot this one: https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/1110833 Are we still OK to merge?
[13:15:45] <dcausse>	 o/
[13:15:50] <dcausse>	 inflatador: looking
[13:16:50] <dcausse>	 inflatador: I think so? I might have to rebase perhaps, looking
[13:17:06] <dcausse>	 no it rebased cleanly
[13:17:40] <inflatador>	 cool, I think it requires a grizzly deploy, will work on that with Ryan at our pairing today
[13:24:25] <dcausse>	 sure, thanks!
[13:55:31] <dcausse>	 pfischer: when you have a moment could you take a look at https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/1120205 ? it worked well when I backfilled articecountry, it's not ideal but it's the easiest solution I found at the tie
[13:55:35] <dcausse>	 s/tie/time
[14:09:56] <ebernhardson>	 \o
[14:12:55] * ebernhardson sometimes wishes opensearch would say "I see you've tried to use lucene syntax and i've switched to using it" instead of "Bad human, no lucene for you!"
[14:13:41] <ebernhardson>	 there is a toggle, it could, but instead it just complains :P
[14:38:17] <dcausse>	 o/
[14:44:41] <pfischer>	 dcausse: looking
[14:44:46] <dcausse>	 thx!
[14:45:48] <pfischer>	 dcausse: +2 done
[14:46:29] <dcausse>	 thanks!
[14:50:06] <inflatador>	 .o/
[14:54:23] <inflatador>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133429 yet another puppet patch for fixing var paths
[14:54:35] <dcausse>	 ebernhardson: you mean the query_string failing to parse?
[14:56:23] <ebernhardson>	 dcausse: i mean when you use lucene syntax it pops up a box in the bottom right that says you tried to use lucene syntax but its set to dashboard query language, and links the DQL docs
[14:56:47] <ebernhardson>	 but it never remembers that i set it to lucene syntax, it reverts back to DQL every time i visit
[14:57:04] <dcausse>	 oh in opensearch dashboard, yes I stumble on this frequently
[14:57:31] <ebernhardson>	 i just feel like better UI would be switching it for you, instead of pushing DQL
[14:57:56] <dcausse>	 yes, tbh I'm often lost in opensearch dashboard...
[14:58:36] <ebernhardson>	 i suppose i haven't looked closely enough, maybe DQL has real bools or some such that makes it better
[15:44:40] <ebernhardson>	 gehel: anything interesting at https://app.asana.com/0/0/1209864449777614 ? I don't have access in asana, but it's related to making search bar more prominent
[15:57:00] <inflatador>	 workout, back in ~40
[15:59:21] <inflatador>	 also, on the plugins errors d-causse and I were seeing, there are differences in `plugin.mandatory` vs relforge. For example `extra-analysis-esperanto` is listed as mandatory on cirrussearch, but not on relforge. We have `opensearch-extra-analysis-esperanto` instead
[16:53:50] <inflatador>	 ryankemper another CR to fix some vars: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133481
[17:05:36] <inflatador>	 ^ self-merged that guy
[17:11:35] <inflatador>	 opensearch is up and running, but not joining cluster. Probably due to firewall rules...checking
[17:30:31] <ebernhardson>	 nice! any luck with joining cluster? I can help look if not
[17:32:26] <inflatador>	 feast your eyes on this! https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133487
[17:33:16] <ebernhardson>	 makes sense
[17:34:54] <ebernhardson>	 sadly, it doesn't look like the envoy changes made much difference in the connection failures :(  They aren't high rate, but they continue
[17:35:04] <inflatador>	 ryankemper did you end up reimaging cirrussearch2055 last night? I just wanna make sure we do at least one complete teardown/rebuild before starting the migration
[17:35:41] <inflatador>	 ah, that's too bad...do you know if they deployed mw yet? Maybe it hasn't taken effect?
[17:36:01] <ebernhardson>	 i suppose i was assuming they had but hadn't double checked. They should have had a deployment window for EU
[17:36:43] <ebernhardson>	 yea there was a scap sync-world at 10:53 UTC today
[17:37:15] <ebernhardson>	 we also had a very large spike, ~12k over 30s. Lines up with Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs2013.*,lvs1019.*} and A:lvs
[17:37:26] <ebernhardson>	 so sounds like that restart isn't too graceful :(
[17:37:34] <inflatador>	 LOL
[17:38:13] <ebernhardson>	 i suppose i'll also poke at the cirrus retry logic.  This isn't super important, but it bugs me that we fail user requests :P
[17:39:14] <inflatador>	 OK, looks like the firewall rules will work...running puppet in codfw
[17:39:24] <inflatador>	 this reminds me that we need new cumin aliases for cirrussearch
[17:39:38] <ebernhardson>	 i kinda wish there was a nice way to remove spikes in the opensearch dashboard, it's hard to tell if there was a reduction between yesterday and today since that spike causes the rest of the graph to be almost nothing
[17:42:02] <inflatador>	 ryankemper come to think of it, we probably don't want to teardown cirrussearch2055...at least not if it has primary shards. Might have to do another one
[17:42:55] <inflatador>	 `bking@cirrussearch2055:~$ curl -s http://0:9200/_cat/nodes  | grep cirrus
[17:42:55] <inflatador>	 10.192.23.21   1  21 6 1.12 0.50 0.34 dir  - cirrussearch2055-production-search-codfw`
[17:44:44] <ebernhardson>	 actually the rate might be reduced, looking at 08:00-18:00 the 30th had ~900 31st had ~2k, 1st had ~1k, 2nd (today) had ~270.    But with that kind of variance hard to say
[17:46:38] <ebernhardson>	 27th had 500, 26th (last wednesday) had 400.  So all over the place. Today is lowest, but maybe let it run for a week and hope it stays lower
[17:49:23] <inflatador>	 yeah, it doesn't seem to hurt anything at least
[17:50:30] <inflatador>	 lunch, back in ~45
[18:03:55] <ryankemper>	 inflatador: didn’t reimage
[18:37:16] <inflatador>	 back
[18:40:00] <gehel>	 just fyi: we're getting more traffic on WDQS again... https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs&var-graph_type=%289102%7C919%5B35%5D%29&viewPanel=44
[18:41:29] <inflatador>	 gehel did I miss an alert, or what tipped you off to this?
[18:42:13] <ebernhardson>	 btw also noticed (via sukhe) that cloudelastic1008 is back up and joined the cluster/has shards, but isn't pooled
[18:42:51] <ebernhardson>	 it's basically doing 90%+ of the expected work if it's in the cluster, being pooled is minor
[18:44:31] <inflatador>	 yeah, I just saw that. DC Ops worked some magic 
[18:44:44] <inflatador>	 2.5 TB of data in chi, so definitely in use
[18:51:19] <inflatador>	 ryankemper ACK, let's find another canary we can use to test the cookbook
[18:59:20] <ebernhardson>	 relatedly, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133495 is the patch that should configure envoy to auto-retry connection issues
[19:05:27] <ebernhardson>	 perhaps interesting paper - https://irrj.org/article/view/19625 - "Don't Use LLMs to Make Relevance Judgements", which presents the keynote from the ACM SIGIR 2024 in paper form
[19:06:27] <inflatador>	 :eyes
[19:10:16] <inflatador>	 and...merged
[19:13:39] <ebernhardson>	 thanks!
[19:14:30] <gehel>	 inflatador: we were looking at those graphs with ryankemper. I don't think anything is broken yet, so I don't think we should have had an alert. Bu twe might soon if the trend continues :)
[19:19:41] <inflatador>	 ACK
[20:35:46] <ebernhardson>	 yea it's interesting, also probably relevant to keep in mind he keeps comparing it to TREC (and others), but TREC is a bit unique in that they hire a team of people to do the grading there