[09:33:22] <dcausse> errand+lunch [13:12:10] <inflatador> <o/ [13:15:37] <inflatador> dcausse looks like we forgot this one: https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/1110833 Are we still OK to merge? [13:15:45] <dcausse> o/ [13:15:50] <dcausse> inflatador: looking [13:16:50] <dcausse> inflatador: I think so? I might have to rebase perhaps, looking [13:17:06] <dcausse> no it rebased cleanly [13:17:40] <inflatador> cool, I think it requires a grizzly deploy, will work on that with Ryan at our pairing today [13:24:25] <dcausse> sure, thanks! [13:55:31] <dcausse> pfischer: when you have a moment could you take a look at https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/1120205 ? it worked well when I backfilled articecountry, it's not ideal but it's the easiest solution I found at the tie [13:55:35] <dcausse> s/tie/time [14:09:56] <ebernhardson> \o [14:12:55] * ebernhardson sometimes wishes opensearch would say "I see you've tried to use lucene syntax and i've switched to using it" instead of "Bad human, no lucene for you!" [14:13:41] <ebernhardson> there is a toggle, it could, but instead it just complains :P [14:38:17] <dcausse> o/ [14:44:41] <pfischer> dcausse: looking [14:44:46] <dcausse> thx! [14:45:48] <pfischer> dcausse: +2 done [14:46:29] <dcausse> thanks! [14:50:06] <inflatador> .o/ [14:54:23] <inflatador> https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133429 yet another puppet patch for fixing var paths [14:54:35] <dcausse> ebernhardson: you mean the query_string failing to parse? [14:56:23] <ebernhardson> dcausse: i mean when you use lucene syntax it pops up a box in the bottom right that says you tried to use lucene syntax but its set to dashboard query language, and links the DQL docs [14:56:47] <ebernhardson> but it never remembers that i set it to lucene syntax, it reverts back to DQL every time i visit [14:57:04] <dcausse> oh in opensearch dashboard, yes I stumble on this frequently [14:57:31] <ebernhardson> i just feel like better UI would be switching it for you, instead of pushing DQL [14:57:56] <dcausse> yes, tbh I'm often lost in opensearch dashboard... [14:58:36] <ebernhardson> i suppose i haven't looked closely enough, maybe DQL has real bools or some such that makes it better [15:44:40] <ebernhardson> gehel: anything interesting at https://app.asana.com/0/0/1209864449777614 ? I don't have access in asana, but it's related to making search bar more prominent [15:57:00] <inflatador> workout, back in ~40 [15:59:21] <inflatador> also, on the plugins errors d-causse and I were seeing, there are differences in `plugin.mandatory` vs relforge. For example `extra-analysis-esperanto` is listed as mandatory on cirrussearch, but not on relforge. We have `opensearch-extra-analysis-esperanto` instead [16:53:50] <inflatador> ryankemper another CR to fix some vars: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133481 [17:05:36] <inflatador> ^ self-merged that guy [17:11:35] <inflatador> opensearch is up and running, but not joining cluster. Probably due to firewall rules...checking [17:30:31] <ebernhardson> nice! any luck with joining cluster? I can help look if not [17:32:26] <inflatador> feast your eyes on this! https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133487 [17:33:16] <ebernhardson> makes sense [17:34:54] <ebernhardson> sadly, it doesn't look like the envoy changes made much difference in the connection failures :( They aren't high rate, but they continue [17:35:04] <inflatador> ryankemper did you end up reimaging cirrussearch2055 last night? I just wanna make sure we do at least one complete teardown/rebuild before starting the migration [17:35:41] <inflatador> ah, that's too bad...do you know if they deployed mw yet? Maybe it hasn't taken effect? [17:36:01] <ebernhardson> i suppose i was assuming they had but hadn't double checked. They should have had a deployment window for EU [17:36:43] <ebernhardson> yea there was a scap sync-world at 10:53 UTC today [17:37:15] <ebernhardson> we also had a very large spike, ~12k over 30s. Lines up with Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs2013.*,lvs1019.*} and A:lvs [17:37:26] <ebernhardson> so sounds like that restart isn't too graceful :( [17:37:34] <inflatador> LOL [17:38:13] <ebernhardson> i suppose i'll also poke at the cirrus retry logic. This isn't super important, but it bugs me that we fail user requests :P [17:39:14] <inflatador> OK, looks like the firewall rules will work...running puppet in codfw [17:39:24] <inflatador> this reminds me that we need new cumin aliases for cirrussearch [17:39:38] <ebernhardson> i kinda wish there was a nice way to remove spikes in the opensearch dashboard, it's hard to tell if there was a reduction between yesterday and today since that spike causes the rest of the graph to be almost nothing [17:42:02] <inflatador> ryankemper come to think of it, we probably don't want to teardown cirrussearch2055...at least not if it has primary shards. Might have to do another one [17:42:55] <inflatador> `bking@cirrussearch2055:~$ curl -s http://0:9200/_cat/nodes | grep cirrus [17:42:55] <inflatador> 10.192.23.21 1 21 6 1.12 0.50 0.34 dir - cirrussearch2055-production-search-codfw` [17:44:44] <ebernhardson> actually the rate might be reduced, looking at 08:00-18:00 the 30th had ~900 31st had ~2k, 1st had ~1k, 2nd (today) had ~270. But with that kind of variance hard to say [17:46:38] <ebernhardson> 27th had 500, 26th (last wednesday) had 400. So all over the place. Today is lowest, but maybe let it run for a week and hope it stays lower [17:49:23] <inflatador> yeah, it doesn't seem to hurt anything at least [17:50:30] <inflatador> lunch, back in ~45 [18:03:55] <ryankemper> inflatador: didn’t reimage [18:37:16] <inflatador> back [18:40:00] <gehel> just fyi: we're getting more traffic on WDQS again... https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs&var-graph_type=%289102%7C919%5B35%5D%29&viewPanel=44 [18:41:29] <inflatador> gehel did I miss an alert, or what tipped you off to this? [18:42:13] <ebernhardson> btw also noticed (via sukhe) that cloudelastic1008 is back up and joined the cluster/has shards, but isn't pooled [18:42:51] <ebernhardson> it's basically doing 90%+ of the expected work if it's in the cluster, being pooled is minor [18:44:31] <inflatador> yeah, I just saw that. DC Ops worked some magic [18:44:44] <inflatador> 2.5 TB of data in chi, so definitely in use [18:51:19] <inflatador> ryankemper ACK, let's find another canary we can use to test the cookbook [18:59:20] <ebernhardson> relatedly, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133495 is the patch that should configure envoy to auto-retry connection issues [19:05:27] <ebernhardson> perhaps interesting paper - https://irrj.org/article/view/19625 - "Don't Use LLMs to Make Relevance Judgements", which presents the keynote from the ACM SIGIR 2024 in paper form [19:06:27] <inflatador> :eyes [19:10:16] <inflatador> and...merged [19:13:39] <ebernhardson> thanks! [19:14:30] <gehel> inflatador: we were looking at those graphs with ryankemper. I don't think anything is broken yet, so I don't think we should have had an alert. Bu twe might soon if the trend continues :) [19:19:41] <inflatador> ACK [20:35:46] <ebernhardson> yea it's interesting, also probably relevant to keep in mind he keeps comparing it to TREC (and others), but TREC is a bit unique in that they hire a team of people to do the grading there