[10:28:24] lunch [13:10:13] greetings [13:20:10] o/ [13:21:58] woohoo, we are down to 7 stretch nodes in CODFW [13:23:54] ^^ checking `elastic2029` as ryan-kemper mentioned above [13:24:26] ah nm, looks like it did work anyway [14:16:13] \o [14:20:36] o/ [14:22:50] if anyone has any feedback on https://phabricator.wikimedia.org/T314301 LMK. Omega <-> psi instances can't hit each other's VIPs, wanted to know if that was intentional [14:25:43] inflatador: do you need this? [14:27:20] dcausse it's a minor annoyance, I'm writing a script that fetches cluster info and it doesn't work from any node because of this [14:27:23] https://gitlab.wikimedia.org/repos/search-platform/sre/elastic-at-a-glance/-/blob/main/elastic-at-a-glance.py [14:27:40] I should say, it doesn't work from **every** node [14:27:58] I vaguely remember something related to lvs, it can't access itself IIRC [14:28:27] yes that's an LVS problem, the gist of it is that the way we use lvs all nodes in the codfw cluster have the `search.svc.codfw.wmnet` ip address [14:28:45] 9243 seems to work, though [14:28:48] so you can't access search.svc.codfw.wmnet: for a port that isn't hosted locally, because linux says "i know that ip address, its me!" [14:29:24] ah [14:29:29] more fun details under the hood, but i suspect thats good enough :) [14:29:54] I guess 9243 works because it's listening everywhere [14:30:01] yes all nodes have the big cluster [14:30:20] but the two small clusters are split with half of the servers running one, half running the other [14:31:43] got it, will close ticket and work around [14:48:42] is there a particular process for merging to operations/software/spicerack? I note gehel had +2'd this patch earlier but jenkins never started its gate-and-submit dance: https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/781009 [14:51:33] not that I know of, v-olans is on vacation but I can run it down with j-bond after sprint planning [15:00:01] * ebernhardson realizes he named a variable foo to avoid naming something till later, then merged it :P [15:00:13] it was code reviewed, no (only) my fault :P [17:00:23] dinner [17:01:00] my python script for reporting non _doc indices finally reports almost nothing. It only lists .ltrstore, .tasks, toolhub_*, ttmserver, and bking_test [17:01:09] inflatador: reminds me, can i delete bking_test on cloudelastic? [17:01:20] https://cloudelastic.wikimedia.org:9443/bking_test [17:02:08] ebernhardson: to your above question about spicerack, I believe the process is just to merge manually when it's been +1/+2'd, and then it of course doesn't actually go live until v.olans runs the next spicerack release [17:02:29] so i suppose my plan today is to go back through cirrus and finally really remove anything that refers to mapping types and hardcoding _doc, that was mostly done already but i think there were some straggling cases, and theres the reindexing code we re-added [17:09:56] Here's the remaining codfw hosts to be reimaged. We'll need to do so manually since their elasticsearch service restart times are after the start datetime we were using (and we can't just change the start time without our rolling operation cookbook reimaging hosts that are already on bullseye): [17:09:58] https://www.irccloud.com/pastebin/t7XVzU6m/ [17:22:06] ebernhardson yes, feel free to delete bking_test [17:22:26] done [17:23:29] Since I'm reimaging the remaining codfw hosts manually here's a sanity check I'm doing with JQ before running the cookbook. Reduces the need to manually check for green cluster status (which if using the rolling-operation cookbook would be done for us ofc): [17:23:32] https://www.irccloud.com/pastebin/Iehsb75l/ [17:23:59] Without the `-e` jq will always return a nonzero exit code if the input is valid [17:46:29] nice! [17:46:36] also lunch, back in ~1h [17:46:48] * ebernhardson resists the urge to code-golf it into a while loop that removes duplication :P [17:46:56] does everything necessary already :) [17:47:13] s/while/for/ [18:41:42] back [18:48:28] hmm, so it turns out cindy started failing with the Vector patch 'Search: Use Codex and Vue 3 instead of WVUI and Vue 2'. Somehow with this replacement the browser integration can no longer reliably type into the search box. Unfortunately this is a huge change, there isn't really any reasonable way we can review the patch and "fix" the piece of it that broke our integration testing. [18:49:33] this is on debian 9.1, which has ancient chromium (74), i suppose will try spin up a new instance with "modern" debian and see if a newer chromium makes it work [18:49:50] (i would ideally like a working cindy to verify the 7.10 branch before we go live :) [18:51:02] it aligns with a general idea i saw when searching for similar issues, where webdriverio doesn't manage to type into a search box: the javascript attached to the box was described as too slow and "eating" events [18:51:19] s/a search box/a form input/ [18:55:16] hmm, i also wonder why the existing cirrus-integ image name in horizon is debian-10.0-buster, but /etc/debian_version says 9.1 and sources.list refers to stretch [18:55:29] * ebernhardson could also be silly and dist-upgrade, might work but we generally avoid that [19:08:19] doh, of course i'm being extra silly. chromedriver runs inside a docker container inside the horizon instance, its debian 9.1 because that's the container. sigh i should have realized this earlier :P [19:12:49] * ebernhardson instead trys port forwarding chromium-driver 89 in from the host to vagrant [19:52:25] ryankemper: the eligible masters alert just fired [19:53:32] RhinosF1: thanks, this is actually expected since the host being reimaged is a master; I'll ack that alert though so that it's clear [20:01:05] hopefully switching to 5 masters we can make it only alert when we get to 3, so that 4 masters doesn't ping anyone [20:01:16] (now it alerts any time a master-capable node restarts) [20:05:25] ebernhardson: don't we still need to worry any time we have an even number due to the possibility of split-brain? [20:05:43] although if we could have some logic like (if it's 4 only alert after a few hours) that'd be great [20:06:05] ryankemper: not really, 3 have to be on the same side of the split, we have to tell elastic how many masters are needed to form consensus [20:06:31] Oh I see [20:06:34] Right that makes sense [20:07:37] We have those informational emails for airflow, I'd like to see that for any master failure. And then like Ryan said, a more urgent alert after a few hours if it doesn't come back [20:22:30] yea that would make sense [20:53:43] lunch [21:38:33] fyi, just bulk closed ~140 phab tickets that were low/est priority and over 6 months old with no activity in the last 6 months. some have been reopened already, but hopefully most can remain closed [21:40:57] back [21:41:05] surprised i was only subscribed to ~60 of those :) [21:59:58] mpham: I'm not sure mass closing tickets is the best way to go about things. You might want to see some of the comments in #wikimedia-sre [22:00:42] s/-sre-/-tech [22:00:45] https://wm-bot.wmcloud.org/browser/index.php?start=08%2F01%2F2022&end=08%2F01%2F2022&display=%23wikimedia-tech [22:01:34] thanks for the heads up RhinosF1 [22:04:50] mpham: I suggest responding soon [22:04:59] working on it [23:21:08] taking off for the day