[03:22:29] Here's the settings I set back, can revert tomorrow when necessary: [03:22:45] https://www.irccloud.com/pastebin/sPznjq6t/tuning [03:23:10] I think running the max restore bytes per sec on the snapshot was probably not necessary but worst case scenario it won't hurt anything [03:55:09] inflatador: left off on `cloudelastic1004`, the initial rolling operation reimage failed (think I missed the window where the interactive option is presented cause internet is spotty where I'm at right now); I've been trying to manually do a`sudo cookbook sre.hosts.reimage --os bullseye -t T309343 --new cloudelastic1004` but I get `spicerack.ipmi.IpmiError: Remote IPMI for cloudelastic1004.mgmt.eqiad.wmnet failed (exit=1): b''` [03:55:09] T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 [12:56:49] ryankemper ACK [13:22:39] Trey314159 What's the other team besides us that uses java frequently? g-ehel mentions this a lot , was going to reach out to them for inspiration week [14:53:03] inflatador: the main one i think of is analytics, but there might be more [14:57:25] ebernhardson Thanks, will hit them up. BTW, almost done with the cloudelastic reimage [14:58:37] sweet! [15:16:03] will not make the unmtg, relocating to my parents' house atm [15:17:17] huh, i hadn't even considered that not finishing the ITC was an option :) -"We have set a new record for the Foundation with 88% of staff completing their ITCs in the last quarter!" [15:22:11] ebernhardson: one of the slides in yesterday's staff meeting said that merit increases are linked to completing your Q3 ITC.. maybe that will increase compliance in the future [15:24:32] lol, that would probably help :) [15:24:47] what's ITC [15:25:16] RhinosF1: "individual tuning conversation", it's basically a quarterly meeting where you and your manager answer three questions about what you did / what went wrong / how can you do better and then talk about it [15:25:45] i am not surprised that people don't like finishing it [15:26:02] i'm not a huge fan either :) But i've done it every quarter they ask [15:30:21] they always have the same questions, but they've been reworded almost every time to now sound a bit less like "explain why you should still be employed" [16:30:13] I'm back [16:31:52] cloudelastic1005 is failing reimage during the DHCP step, this is new. Not sure why, and I also don't see any available firmware updates for its external NIC [16:32:11] So I'm skipping it for now and will attempt a reimage of 1001 once 1005 rejoins cluster/status green [16:57:09] finding the spark issue tracker tedious ... they close issues as won't fix without any justification for why they wont fix the thing [17:19:15] imperious! [17:25:35] browsing through the related bits, it looks like in this case what they mean by wont fix is "someone submitted an issue and a pr, noone looked at it for 6+ months so we closed it". Seems like an odd definition of wont fix [17:25:48] but i guess it's somewhat true, they had an opportunity to address it and noone cared enough :P [17:43:41] Not saying I agree with the practice itself but that's not a horrible definition of "won't fix" [17:43:56] Often the >6 month no-progress tickets never get worked [18:03:39] cluster is back to green, reimg of cloudelastic1001 started [19:22:13] inflatador: looks like cluster is done upgrading! still waiting on it to return to green but yeah [19:22:28] inflatador: also puppet is still disabled on `cloudelastic1005`, we probably want to lift that because currently it's not in the puppetdb [19:23:08] ryankemper cloudelastic1005 is still on Stretch [19:23:31] ah, yeah I only ran `cat /etc/os-release` on the 5 hosts cumin knew about [19:24:47] okay, one left to go then (once it gets green) [19:25:09] Once the cluster goes back to green, we can try 1005 again, but I'm not optimistic wrt the DHCP failures [19:25:55] the NIC config wrt PXE booting looks right [19:26:46] FYI you can do stuff like `racadm get nic.nicconfig` to see nic config from the DRAC, probably the web GUI is easier though [19:40:06] went ahead and set the max bytes per sec to how we had it previously (the cmd I ran yesterday probably didn't do it globally) `curl -H 'Content-Type: application/json' -XPUT https://cloudelastic.wikimedia.org:9243/_cluster/settings -d '{"persistent":{"indices.recovery.max_bytes_per_sec": "756mb"}}'` [20:04:27] Nice, thanks [20:16:44] got a scary alert in operations for unassigned shards in prod eqiad, but it just looks like replica shards [20:19:28] [pro-tip] you can use the OS-named aliases in cumin that use facter, no need to run a command on the hosts to know the OS ;) (e.g. A:buster and A:youralias) [20:22:50] oh yeah, did you see me running ansible against the hosts a bunch of times? ;P [20:23:04] oh that was re: ryan's earlier comment, gotcha [20:24:11] yes, was for the /etc/os-release one [20:24:17] sorry for the confusion :) [20:25:18] It's good info regardless [20:32:10] ebernhardson and ryankemper I'm relocating, back in ~1 hr. I don't think the eqiad stuff is serious but if either of y'all wanna look into it, here's what explain says https://phabricator.wikimedia.org/P30995 [20:51:53] Will look in 15 mins [21:33:33] Went ahead and ran `curl -s -X POST localhost:9200/_cluster/reroute?retry_failed=true` to make elasticsearch try allocating unassigned shards [21:33:55] Seems to have worked, still in yellow status but the two missing shards are initializing now [21:34:05] Should be healthy in 5 mins [21:41:52] ACK, thanks ryankemper ! I'm headed out...will not finish cloudelastic1005 this wk. See you Mon [21:42:05] inflatador: have a good weekend! [21:42:20] (eqiad cluster health back to green btw) [21:42:28] {◕ ◡ ◕} [21:47:35] :)