[10:09:05] I wanted to release mjolnir 2.5.0, but CI failed https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/jobs/464300 [10:09:23] remote: HTTP Basic: Access denied. If a password was provided for Git authentication, the password was incorrect or you're required to use a token instead of a password. If a token was provided, it was either incorrect, expired, or improperly scoped. See https://gitlab.wikimedia.org/help/topics/git/troubleshooting_git.md#error-on-git-fetch-http-basic-access-denied [10:09:23] fatal: Authentication failed for 'https://gitlab.wikimedia.org/repos/search-platform/mjolnir/' [10:11:26] perhaps the ci token expired, looking [10:11:57] yes no access token [10:12:11] creating one [10:12:27] ack [10:12:41] i see that we only have CI_PROJECT_PASSWORD [10:12:42] dcausse thanks! [10:17:58] the default expiration for tokens is quite short might explain why these tokens go away frequently [10:18:41] dcausse yeah, it's not the first time I've bumped into this [10:18:43] I re-ran the trigger_release job, hopefully this time it passes [10:19:13] thanks! [10:47:40] errand+lunch [11:19:09] lunch + errand [11:48:11] mmm 2.5.0 was released, but conda env failed publishing: https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/jobs/464333#L1448 [11:48:18] errand + lunch, then I'll take a look [13:12:39] I just shared my (extremely WIP) Elastic -> Opensearch migration plan with y'all. I'll keep working on it, but feel free to comment in the doc or ask questions here...or add stuff yourself, if you're feeling bold https://docs.google.com/document/d/1S4p03N_kJAF-tr4qDWi23ZG3LKiSDFLcMSogF02L9L8/edit?tab=t.0 [13:32:37] o/ [13:43:40] o/ [13:44:18] inflatador i don't have meaningful comments, but thanks for sharing. Super interested in learning of this works :) [13:44:53] I created a repo to track experimental vector search code https://gitlab.wikimedia.org/gmodena/vector_search [13:46:10] nothing fancy at all, but I wanted to validate direction re mappings: https://gitlab.wikimedia.org/gmodena/vector_search/-/blob/main/index.py?ref_type=heads#L35 [13:46:29] now context-switching back to the conda env issue [13:48:52] thanks! [14:00:18] gmodena: we have a meeting about "MediaWiki Domains: Search & AI/ML Review" right now. You don't need to be there, but if you're interested, feel free to join! [14:02:29] \o [14:23:55] .o/ [14:50:57] I'm going to miss the Search standup today due to conflicting meeting, but I'm curious to get y'all's thoughts about whether or not we should remove rack anti-affinity during the elastic->opensearch migration [14:53:17] o/ [14:53:24] thanks for the heads up gehel [14:53:26] o/ [15:58:25] thanks for merging that alerts CR ebernhardson ! [16:02:26] dcausse: we're in https://meet.google.com/eki-rafx-cxi [16:02:32] oops [16:48:23] * ebernhardson is mildly impressed video quality was high enough to read tallest leprechan [17:16:59] is there any reason we haven't merged https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/merge_requests/48 yet? [17:18:36] ebernhardson: no, I think I just missed it... [17:19:35] looks simple enough, i'll merge if it passes CI [17:29:09] I'm looking at T387569 ...basically we need the code that creates the LVS pools ( https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/elasticsearch/cirrus.yaml#7 ) to gate on cluster membership. I'm just starting to dig, but if anyone has ideas LMK [17:29:10] T387569: Update Elastic puppet code to filter LVS config based on cluster membership - https://phabricator.wikimedia.org/T387569 [17:29:56] inflatador: is the problem that all servers are in all 3 according to yaml, but only 2 in reality? [17:30:08] ebernhardson exactly [17:30:42] * ebernhardson tries to find where we decide what clusters are on what servers [17:32:01] hmm, it seems we list them out explicitly in hieradata/role/eqiad/elasticsearch/cirrus.yaml and similar for codfw...makes it harder :S [17:32:15] * ebernhardson also didn't realize eqiad is 6 rows now [17:32:28] I've been looking at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/opensearch/cirrus/server.pp#50 [17:33:08] but yeah, looks like that's sourced from the YAML you linked [17:33:26] hmm, i suppose it depends on how the lvs puppet works, but my hope was that we could do something in hieradata/regex.yaml that provides the right thing in the right place...but the regex would have to be a silly thing that selects all the right hosts. quite error prone [17:33:30] * inflatador wishes we had real service discovery [17:34:02] maybe we can use the cluster yaml file, or whatever it's called? Checking [17:36:05] nope...was thinking of hieradata/common/service.yaml , not the right place though [17:36:58] stupid idea: simplify by putting all the clusters everywhere [17:37:22] it's not the best use of memory, thats like 12-ishG of "wasted" memory, * 100 servers [17:37:44] would rather not i suppose, but would work [17:38:44] It's not a bad idea, given our constraints [17:39:30] it's a significant amount of wasted memory though, >1TB [17:39:42] Y'all split the clusters mainly because it was hard to maintain cluster state past a certain number of shards? Or what prompted it [17:40:15] the idea of splitting it was because the clusters contain a bunch of tiny wikis and don't need significant amounts of resources [17:40:39] we wanted 3 clusters to split the state up enough, but we ran 1 small clusters per server to avoid the extra resource usage that was unnecessary [17:42:12] yeah, these are the tough choices we have to make sans a larger VM platform [17:42:47] yea ideally the small clusters would be indepdenant and not care about details like number of servers, we just spin up the right amount of resources...but thats not how we work :P [17:42:53] ideally i could spell too :P [17:45:52] based on how puppet works, and assuming SRE won't want us installing custom magic into profile::lvs::realserver, we have to define the correct set of pools on a per-server basis via hieradata/hosts/{hostname}.yaml [17:46:37] or do it with regex and match all the right hosts [17:46:42] both seem error prone :( [17:48:01] yeah, I think Ben came to a similar conclusion. I do see lookups in other hieradata/role/common YAML files (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/grafana.yaml#6 ), but not sure if that gets us what we need [17:51:24] honestly, adding all clusters to all hosts is sounding like the least bad option. It would simplify a lot of other things as well [17:52:19] i guess i just feel bad, a TB of memory sounds like a lot :P We have ~30TB in total so its like 3.33%, but still [17:56:09] I don't love it, but we're hardly the worse offender when it comes to underutilized resources ;( [17:57:23] we actually do use that memory :P It's shrinking the disk cache, meaning we will have 1TB less space available to keep indices hot. When that runs thin we get into latency trouble. I wonder how full we are these days...i think i have some old notes about how to check [17:57:59] it might requires the kernel page-types tool though which isn't always available [17:59:01] Let me know if you find anything, I'm def interested [18:00:15] we could use cgroups to fence off memory for each instance, not sure that would apply to the file cache though? [18:01:01] it's not about limiting particular memory, its that the primarily purpose of memory in search servers is to cache the disk in memory. The question of how much memory is actually free is about how much of the disk cache is being regularly referenced [18:01:46] we've had a few rounds of having perf hits and debugging to find we weren't using the memory effectively enough, but i think that last occured when we had 128G per host (maybe) [18:02:43] hmm, and indeed we don't have the linux-tools-* packages for the kernels we run in prod. iirc we don't typically build them because they aren't usually used [18:03:38] would we need to install and boot into a debug kernel to get what we need? [18:03:51] dinner [18:04:13] no, the tool works against a plain kernel, it just has to be built. plausibly could locally build if find the right source packages [18:04:52] i think the last time we delt with it was https://phabricator.wikimedia.org/T264053 [18:05:24] ah yes, one of my favorite troubleshooting tickets of all time ;) [18:05:56] actually you are probably referring to an earlier version of that, https://phabricator.wikimedia.org/T169498 [18:06:07] the second ticket just re-applied the learnings from the first one [18:06:41] ACK [18:07:18] i think what we would want to do to check how much memory is actually free is replicate https://phabricator.wikimedia.org/T264053#6519064 which indeed needs that page-types tool [18:08:10] but also that confirms the last time we had trouble was on 128G servers, now we are 256G everywhere iirc and maybe we have space to spare. Maybe could poke moritz about building the linux-tools-5.10.0-30 package? [18:09:51] i suppose an "easier" way that is a little more foolproof would be starting a python script on all the servers that allocates a specified amount of memory and then sleeps. Could verify through changes in IO usage when it starts to matter [18:10:29] ACK, let me get a ticket started. I'll tag you and feel free to up AC [18:10:38] err..update AC, that is [18:10:45] sure [18:16:30] OK, we have T389109 ...heading to lunch [18:16:31] T389109: Determine memory needs of production Elasticsearch/Opensearch processes - https://phabricator.wikimedia.org/T389109 [19:48:14] ryankemper: curious thing, https://gerrit.wikimedia.org/r/c/operations/software/opensearch/plugins/+/1126663 isn't merged but the .deb got built? about to merge now but not sure if it requires a build anywhere [19:52:56] * ebernhardson also can't seem to decide what debian-glue-non-voting was supposed to do ... it failed but on a thing i'm not really understanding: publishing the xUnit test result report [19:55:33] I'm tapping out on conda for today. I can repro mjolnir's CI failure locally. There might be a mismatch because in gitlab ci we declare PACKAGE_NAME=wmf-mjolnir, but in setup.py we name it wmf_mjolnir. Regardless of naming, the package is installed at build time.... but conda / wmf_airflow_utils is not able to re-discover it in the `package_version` codepath [19:57:53] sounds fun :( [19:58:10] I'm a bit out of ideas at the moment, and I'm not familiar with the library package_version uses under the hood (https://docs.red-dove.com/distlib/index.html) [19:58:32] I see mjolnir in site-packages (dist/conda_dist_env/lib/python3.10/site-packages/wmf_mjolnir-2.5.0.dev0.dist-info) [19:59:02] but distlib does not collect it :| [19:59:11] ebernhardson for some definition of fun :) [20:04:08] can't say i'm familiar either sadly, i don't remember doing anything special when we first set it up [20:11:08] ebernhardson: I think I did the build with my patch checked out instead of on master and forgot to merge. so tl;dr is I expect everything should be fine (& thanks for merging) [20:11:50] ryankemper: no worries. One other question, i had expected debian (probably via puppet) to simply install the new package, but when i checked cloudelastic1007 it has the new .deb available via apt but not installed. Is that expected? [20:12:01] i'm just creating the ticket now to restart cloudelastic and relforge for the new plugins [20:13:01] ACK, we can take a look later today [20:13:12] T389119 [20:13:12] T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119 [20:13:50] * ebernhardson is slowly working through the needs review column today ... looks like i might not make it to ready for dev :P [20:14:44] ebernhardson: yup that's expected, we have to tell the rolling-operation cookbook to install plugins https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/elasticsearch/rolling-operation.py#256 [20:15:12] ahh, ok. I suppose i just assumed it would generally update all debian packages as new things become available. Are all packages like that, or just this? [20:18:41] Generally we use 'present' instead of 'latest' with the Puppet package module, same as using 'latest' for Docker images I guess [20:19:18] ahh, ok that makes sense. so `apt upgrade` is never run, just package-by-package. I suppose that makes sense to keep things from unexpectedly breaking [20:20:14] We also don't allow puppet to restart Elastic on updates so we can control things a bit better [20:25:25] Yeah I think it's one of those change management things. if we build a new deb we don't want it to immediately roll out to prod without us being ready to validate it [20:53:05] I'll be nice to run the rolling operation against a 100% opensearch cluster, I'm not sure if we've done that yet [21:39:52] gmodena: i dunno how helpful it is, but i noticed in the v2.4.0 mjolnir release console output it has `Successfully built wmf-mjolnir`, but the v2.5.0 release has `Successfully built wmf_mjolnir`. [21:40:30] According to the commit 485f09606 I never understood what mangled the name, it was supposed to by wmf_mjolnir but something changed the name. Whatever that is is no longer changing the names [21:41:08] When i run it all locally in the same container and pasting in the right commands, it works locally if I use the fixed PACKAGE_NAME with underscores [21:43:25] could be something in conda-analytics i guess? the old build used conda-analytics 0.0.23, the new build is using 0.0.38. But i haven't been able to track down what changed, not sure how to get ahold of the 0.0.23 version [21:44:10] anyway, i guess i would revert the 485f0960697 patch and try and re-ship it [22:08:32] something weird is going on with `commonswiki_file_1727810550` in cloudelastic. 2 of the primary shards (not the same two, by the way) seem to be constantly relocating [22:10:08] just running `_cat/shards?h=index,shard,prirep,state,unassigned.reason | grep -v STARTED\"' "` [22:16:11] hmm, that is kinda odd [22:17:44] but nothing in active recovert... [22:17:47] *revoery [22:17:50] *recovery [22:18:38] yup, that's an odd one [22:20:52] hmm, it changes. a minute ago it was 28 and 29, now its 24 and 30 [22:21:04] i guess it was 28 and 30 [22:21:32] yeah, it keeps changing [22:23:37] reason is in the logs for master [22:23:54] [2025-03-17T22:23:06,842][WARN ][o.o.c.r.a.AllocationService] [cloudelastic1009-cloudelastic-chi-eqiad] failing shard [failed shard, shard [commonswiki_file_1727810550][30], node[JXu_5LlaSEC-aYXCKDBrlg], relocating [endbhWF2TbOnvgy0iNoBag], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=YwgqMgGCRri8Ps1edPCNsA, rId=RBA6y3Q0QrGFBpGeZCbN1A], expected_shard_size[53667600585], [22:23:56] message [failed to create index], failure [IllegalArgumentException[Failed to resolve file: system_core.dic [22:24:17] hmm, so it wants the sudachi dictionary [22:25:55] which brings us back to T389119 [22:25:55] T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119 [22:26:44] it says sudachi is only available on cloudelastic1012, wonder why only there? But also wonder whats going on since commonswiki_file doesn't reference sudachi [22:26:51] I'll update the ticket, but we need a few patches to the rolling cookbook to get it ready for opensearch. I can do the cloudelastic restart manually if this is time sensitive [22:26:52] (yet) [22:27:33] It might be because cloudelastic1012's the last one I reimaged, which was Thursday [22:28:07] hmm, so maybe it got the new plugins package? checking [22:28:53] yea, that one has 1.3.20-2, i wonder if we did something wrong building the package, sec double checking. But it seems plausible [22:30:59] looks to match my local that worked fine, with the dictionary in config/sudachi/system_core.dic [22:33:04] it looks like it keeps trying to migrate shards from elsewhere to cloudelastic1012, and cloudelastic1012 is having issues creating the index due to the sudachi plugin... [22:33:05] Filesystem{base=/etc/opensearch/cloudelastic-chi-eqiad/sudachi} [22:33:48] ??? [22:33:51] hmm, so i guess thats a prod difference? in the test images the dictionary needed to be in /usr/share/opensearch/config/sudachi/system_core.dic [22:34:09] here it wants it in /etc/opensearch/{name}/sudachi. I think we have to symlink it into place, no way we want cluster names in the .deb [22:34:36] where do the plugins typically live in Elastic? [22:34:54] inflatador: it's not about the plugins, it's about the support dictionary. Nothing else we use uses support files [22:35:11] i suppose it's basically "the config directory", and we use three separate config directories [22:35:25] i can write you a quick puppet patch that will fix this i gues [22:36:24] sure, or I can get started on one tomorrow [22:39:56] inflatador: something like this should make it happy: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128547 [22:40:20] otherwise, the "proper" solution might be to figure out if opensearch can be configured to search for the dictionary in a secondary location [22:40:37] jenkins didn't like me :P [22:41:29] missing comma...always [22:45:29] PCC doesn't like the `onlyif` ? [22:45:47] hmm, i thought that was standard everywhere...maybe it's only on exec { ... } ? [22:46:59] yea, docs say onlyif is an exec {} thing and not a default thing that exists everywhere :( [22:47:20] could make the symlink with exec i guess [22:48:31] damn, really? that's odd [22:55:51] well, i think this new patch will pass. I don't know that it's super critical to do now though. Mainly it seems to mean cloudelastic1012 has no shards assigned [22:56:45] so kinda like having 2 nodes out, since 1008 is also out [22:57:07] other solution is force install the older version of the package (if it's still available?) and restart [22:58:06] nope, experimental still doesn't like me :P [22:58:26] no worries, I think we can roll back the package and worry about Puppet later. Let me give it a shot [22:58:39] oh silly me, bad name of the exec, needs per-instance [23:00:25] i suppose it's also debatable where this exec { } belongs...might be more correct to have it in the profile since it's search specific? [23:00:58] hmmm, psi does not like that I stopped cloudelastic1012...checking [23:02:17] forgive me if you already checked, but I guess the file resource doesn't work for the symlink? Like `https://www.puppet.com/docs/puppet/7/types/file.html` [23:02:48] inflatador: that doesn't have an `onlyif` option, and the symlink can only be created without failing puppet if the new plugin version is installed [23:03:01] so it would pass cloudelastic1012, and fail everywhere else [23:03:14] ebernhardson ah OK, that was exactly what you were talking about earlier, sorry [23:03:25] no worries :) is the alerting just due to not depooling? [23:03:39] * ebernhardson guesses randomly [23:03:58] no, it was a loss of quorum I think. I started it again and it's back [23:05:17] confirmed, there are only 2 master eligibles [23:06:05] hmm, with the new election process we should be able to have everything in cloudelastic master capable i think (for another day :P) [23:06:26] yeah, we can do the voting exclusion workaround d-causse found in the meantime [23:08:51] OK , I really don't get this...1012 is not a master eligible anyway [23:11:58] hmm, i dont see the error you see? Just the alerts in -operations about the port not being available [23:12:07] the master log only complained that 1012 left the cluster [23:12:38] sorry, I was making assumptions based on the 503 I got immediately after I stopped the service [23:13:22] that could just be nginx running but elastic not running [23:14:15] ah, true. I also didn't look closely enough at the alerts, was thinking it was all psi, not just cloudelastic1012 [23:14:27] it was also chi, not psi :P [23:14:55] Weird, I saw the 503s on psi [23:15:08] hmm, the alerts were all chi [23:15:14] very strange [23:15:27] anyway, it's late and clearly i should NOT be operating servers at the moment ;P [23:15:43] yea it should be fine, but it does mean we are down 2 servers in cloudelastic. fix tomorrow :) [23:16:43] yeah, I'll get after it tomorrow. Good night! [23:18:21] i suppose i can fix it up, copied the .deb from /var/cache/apt/archives on one of the other servers and installed it to 1012, just need to restart opensearch now [23:21:54] I'm still around if you need me to depool/repool or whatever [23:22:53] oh right, i ran the `depool` command ... and now realizing there isnt a `repool` command...the proper one is probably documented somewhere though :) [23:23:02] it looks to have worked, see 2 shards in active recovery now going into 1012 [23:23:24] `pool` [23:23:33] ok, should be back in action now [23:23:40] alerts are clearing [23:24:32] i'll keep an eye on it for a minute, but assuming these first two finish i'll assume the rest will balance out after an hour or two [23:25:29] ACK, sounds good. Forcing myself to disconnect in 5, 4... [23:27:31] both finishes and appear to have gone live, 2 more shards are moving in. calling it a success