[05:46:30] the RAID10 regression has been identified: https://phabricator.wikimedia.org/T393366#10798820 [05:48:30] nice! [07:23:59] hello folks! I am going to rollout a change for raid::perccli (it will be renamed to raid::broadcom), with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1142518. It affects ~330 hosts, I disabled puppet and I am going to proceed very slowly, but if you see some raid-related noise please ping me [08:02:48] elukey: is that changes to support newer models? [08:03:13] I see yes, nice [08:07:19] jynus: yes yes exactly! [08:35:20] rollout completed [08:37:38] ah no sorry something is still off, checking [08:38:31] lol something unexpected [08:38:48] so we asked to supermicro a new controller that was more JBOD friendly for the ms-be use case [08:39:05] and they proposed to us SAS 38xx, that is installed on ms-be1091 [08:39:31] so I added support in raid.rb for it, but now I see that it is already present on Wikikube workers [08:40:16] oh [08:40:50] Dell ones though [08:41:36] but given that we also match on the manufacturer fact, it will continue to use perccli, right? [08:42:00] in theory yes, but it complains that neither perccli nor storcli is installed, so I probably missed something [08:43:24] it was probably a puppet run that was missing [08:43:28] elukey: a bunch of our nodes started alerting on that too (ex. cloudceph*, cloudrabbit*, cloudvirt*, ...) [08:44:05] (fyi. let me know if I can help) [08:44:10] yeah it should be puppet related, I am going to re-run again [08:45:29] there's two different errors, a warning `summary: communication: 0 OK` and an unknown `Failed to execute ['/usr/local/lib/nagios/plugins/get-raid-status-broadcom']: RuntimeError Neither storcli nor perccli64 present on the host, please check.` [08:45:33] apparently "Broadcom / LSI Fusion-MPT 12GSAS/PCIe Secure SAS38xx" is widely used [08:46:15] maybe nrpe does not like the prefix `communication: ` that the cli shows? [08:46:26] https://www.irccloud.com/pastebin/Rx0rdJp0/ [08:46:43] I think it wants all of it yeah [08:46:58] so we do have the 38xx controller, but it is not used [08:50:45] yes yes confirmed, the nagios check expects multiple things, if it doesn't find them it emits a warning [08:52:33] going to do some tests to understand what's best, for the moment ignore the spam (sorry) [08:54:34] ack, thanks! (quick question, is any of that info in prometheus? might be nicer long-term) [08:55:06] what kind of info? [08:59:09] raid status and such, essentially what's used for the alert [09:01:25] ATM these are still NRPE-based [09:02:12] ack [09:02:51] that's discussed in T350360 [09:02:51] T350360: Evaluate "drop in" replacement for nrpe scripts - https://phabricator.wikimedia.org/T350360 [09:03:34] 👍 [09:07:01] this is interesting [09:07:09] https://www.irccloud.com/pastebin/HFsZAqtr/ [09:07:34] the status is OK but it's considered as error xd [09:08:00] maybe `139 errors = [status for status in status_list if status['Controller Status'] != 'Optimal']` should also filter 'Ok' [09:09:09] anyhow, I'll let you do :) [09:13:16] I also just found the curious case of ganeti1035, which puppet enabled the broadcom raid class [09:13:28] but it's a config B, which don't have hw raid [09:13:54] and "perccli64 show J" shows "Number of controllers: 0" in fact [09:22:17] dcaro: I think that the error stems from physical_device_status, where it doesn't find 'PD LIST' [09:22:28] so it returns NAGIOS_WARN [09:22:44] lemme quickly check on a node [09:22:53] because we could make it bail out early if no PD is listed [09:23:01] the controller is ok, then we are good [09:25:46] doing the change I mentioned (to filter out 'OK') on cloudcephosd1040 seems to do the trick there (might be a different issue) [09:25:50] https://www.irccloud.com/pastebin/9QtPMoDH/ [09:26:12] ah right it says controller 1 OK [09:26:14] didn't see it [09:26:15] ufff [09:26:25] let me re-read your idea [09:26:59] so general_state is failing [09:27:06] because it doesn't find "Status" [09:27:23] it's because ` [09:27:46] it's because `'Controller Status': 'OK'` is not what it expects, it expects `Optimal`, so it considers it an error [09:28:28] so if you use "status_list = lookup_by_key('Controller Status', data)" [09:28:28] but then it extracts the status to show the error, but it's the string `OK`, so it shows `1 OK` [09:28:29] it works [09:29:18] it may be a different output from the perccli sigh [09:29:35] it does find the status, it just interprets a status with `Controller Status: OK` as error [09:29:39] https://www.irccloud.com/pastebin/MELx3Tx1/ [09:30:08] because it's expecting only `Optimal`: [09:30:11] `errors = [status for status in status_list if status['Controller Status'] != 'Optimal']` [09:30:39] right okok! [09:31:58] :) it's confusing because when showing the error, it extracts the status string, and ends up being `1 OK` [09:32:05] yes yes [09:33:28] so imho it is fine to accept OK and Optimal in there [09:33:47] if anything is weird we'll detect it later on [09:36:52] like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143023 [09:39:25] moritzm, volans thoughts? [09:40:29] if you don't like it is dcaro's fault :D [09:40:35] (joking of course) [09:40:37] hahahaha [09:41:32] looking [09:42:01] sure why not :D [09:42:39] everybody trusts David [09:42:55] me not a lot, I should have already guessed it [09:43:29] rolling out the change! [09:53:07] 🤞 [09:54:31] thanks for the help! [09:54:41] the number of warnings is reducing, so it seems working [09:57:07] I think there might be an error in the patch, left a comment there [09:57:40] you are totally right of course [09:57:56] not my best day [09:58:00] lol [09:58:05] I blame luca [09:58:15] sorry not my finest hour either [09:59:55] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143026/1/modules/raid/files/get-raid-status-broadcom.py should be enough [10:00:57] E501 line too long (103 > 100 characters) [10:01:30] hahahaha, of course, 4 chars added, line limit exceeded [10:01:46] I am going to quit and dedicate the rest of my life to zen meditation [10:02:27] I might join you xd [10:02:41] * taavi starting a zen-meditation-as-a-service startup [10:03:34] using a meditation-LLM ethically scraped from the internet [10:07:25] dcaro: do you have a moment to review the patch so my mental sanity stays into reasonable boundaries? [10:07:48] sure [10:08:44] elukey: LGTM [10:08:51] <3 [10:09:08] last rollout [10:09:12] hopefully [10:11:44] 🎉 [10:13:45] Works for me :) [10:13:49] https://www.irccloud.com/pastebin/Dlbp0u5N/ [10:16:25] thanks for the support dcaro! [10:17:42] np, thanks for fixing/improving stuff :) [10:18:10] I hope to stay away from raid for a long time [10:34:00] all wmcs alerts are cleared now, thanks! [15:35:58] inflatador: is your cookbook run still going or wedged? I'm stuck waiting on a lock on the sre.dns.netbox cookbook which is held by your sre.hosts.rename cookbook that started at 15:14... [15:37:11] the new ping feature worked, as it did ping the owner in -operations: [15:37:18] Wed 15:17:10 logmsgbot| bking@cumin2002 rename (PID 1578957) is awaiting input [15:37:45] Emperor i'm running a few at the time, let me check [15:38:19] inflatador: looks like my cookbook decided your lock was too old 'Releasing expired lock for key /spicerack/locks/cookbooks/sre.dns.netbox: {'concurrency': 1, 'created': '2025-05-07 15:31:00.906841', 'owner': 'bking@cumin2002 [1578957]', 'ttl': 300}' [15:39:18] though the cookbook itself failed :( [15:39:51] Is there a reason we prompt for DNS and network device changes? It seems like our validation is good enough we could just apply [15:41:13] which validation? [15:41:34] you would apply *any* pending changes in netbox [15:42:06] The automation picks the right IP address and network device changes, as far as I can tell [15:45:34] Does Netbox have a limitation around making multiple changes to unrelated objects? We used Infoblox at an old job and it could do that OK [15:52:21] FWiW, if I have a cookbook sitting at a prompt and it's blocking you from doing something, feel free to sudo to me and answer if it's something reasonable like DNS changes [16:51:14] rzl if you have time to look at another Envoy patch, I added you to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1142693 ...if I'm bugging you too much LMK and I'm happy to find someone else [17:00:33] inflatador: not feeling over-bugged, but I don't know as much about the historical context of that setup -- _joe_ is who I'd ask if I were you [17:03:24] rzl ACK, thanks...it's clear I need to educate myself a bit more about Envoy too