[08:17:13] Yay, my swift upstream fix has died in CI dependency hell (the 2022-10-16 release of PasteDeploy dropped py2 support, so the py2 gating tests now fail) [08:43:02] :/ [08:49:54] let's see what they make of https://review.opendev.org/c/openstack/swift/+/861583 (the obvious dumb "fix") [09:27:29] Amir1: Re: T320786 I think the command line failed to execute- although it could be (or not) a real issue [09:27:29] T320786: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 [09:27:49] hmm, okay [09:27:58] it feels that it's not a real issue [09:28:04] is it a new host? [09:28:17] because maybe the raid code needs tweaking [09:28:42] aka it is not being properly monitored, which would be a more interesting issue- consider involving riccardo [09:29:52] yeah probably [09:35:43] if I run "megacli -AdpAllInfo -aALL" it gets stuck [09:36:12] or maybe you restarted it? [09:40:06] I didn't [09:40:49] then it crashed! [09:41:31] AIUI the host should have 10 disks, all your outputs have only 9 in T320786 [09:41:31] T320786: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 [09:42:23] and my ssh attempt is stuck right now so I can't debug it more, but I would guess that a disk has disappeared from the controller, would not be the first time [09:43:34] the host is up, though [09:44:02] will try to see if it kernel panicked [09:44:17] I get ao to Authenticated to db1202.eqiad.wmnet (via proxy) using "publickey". [09:44:25] oh [09:44:27] then my ssh session get's stuck and I don't get the prompt [09:45:11] even login doesn't work, I don't get asked for a password [09:45:47] I'll leave the ssh attempt running, in case it makes it through at some point [09:45:59] your permission to do a soft reboot, then a hard one from management? [09:46:00] powercycle time 😈 [09:46:06] Amir1: doing [09:46:11] let's do skip transaction for good measure [09:46:17] ha ha ha ha [09:46:27] don't worry, I will be the one doing the recovery here :-D [09:46:43] mwhaha [09:46:47] my bet is a controller issue, maybe triggered by a disk failure [09:47:19] Amir1: the host is depooled/alerts disabled? [09:47:46] btw, ICYMI master of s6 in eqiad became fully unreachable during the weekend, we did a powercycle and brought back online then an emergency switchover [09:48:01] but that was memory, right? [09:48:02] see the incident doc in _security [09:48:25] yeah, I think ipmi-sel syas uncorrectable memroy error [09:48:33] just FYI, not related to this [09:49:01] it is not rare to have unstable hw with new one [09:49:22] it took a few iterations/bios upgrades/hw replacements to get all stable [09:49:54] yeah [09:50:05] I did a soft powercycle, if I don't see any response on console in a few minutes I will do a hard one [09:50:22] I think it is coming up now [09:52:04] bootup didn't complain about anything, but I can check the logs now [09:54:06] see hw logs on task [09:55:58] thanks jynus [09:58:52] let me know if you want me to recover it after hw serviced [09:59:19] sure. Thanks [10:55:16] ...of course to make a change to requirements.txt in an openstack project you first have to make a change in the global-requirements file in a core project... https://review.opendev.org/c/openstack/requirements/+/861599 [10:59:05] ...but they generally don't allow version caps, so I fear I am going no-where near space today :( [16:11:52] looks like we're missing team updates in the SRE meeting, please put in updates for the past two weeks for your own areas :) [16:23:45] Amir1: I Just realized it is the same thing Robh mentioned on the task I brought out in our meeting, so this is just an additional heads up [16:24:12] yeah [16:24:14] so it is already queued for him at T319443 [16:47:54] sigh, I always talk too fast when presenting [16:48:09] we all do [16:48:11] it's really weird giving a talk just looking at your slides not the audience [16:49:16] Emperor: :empathy: it is very hard for me to give a talk where I can't see the audience. I crave the head nods/confused looks to know when to speed up/slow down/rephrase. [16:49:21] oh, if you have space, I normally set the slides on front and the audience/chat on the side [16:50:29] * Emperor has but one monitor (and the slide-viewer on presentation mode) [16:51:18] you definitelly should buy one, I think the foundation gave you money to have laptop + extra screen? [16:51:59] Meh, I have a decent enough monitor for my day-to-day :) [16:53:03] I started with 2, then now with 3 monitors and now I cannot live with only having a tiny screen (e.g. just the laptop) [16:53:24] for me it makes me way more productive than just alt-tabing/tiny windows [16:55:07] also, thanks for the talk, when I recently started playing with dquilt I thought who could have designed that command line api [17:41:06] hi data persistence 👋 I'm following up from the db1131 breakage -- Amir1, or anybody else, have you already started a dcops task or would you like me to? [17:41:12] (or whatever followup is needed) [17:42:35] rzl: I haven't, is there a ticket for the general incident? I thought community made one [17:43:04] I didn't see one, I was about to open one to attach the IR to, but if one exists I'll use that [17:44:51] I have only the urgent switchover ticket T320879 [17:44:51] T320879: Switchover s6 master (db1131 -> db1173) - https://phabricator.wikimedia.org/T320879 [17:44:55] nod [17:46:24] doesn't look like anything relevant was opened in phab on 2022-10-15 so I'll start a fresh one for tracking, if I missed it we can just merge later [17:46:53] do you have anything that you all want to check into, first? or should I just open a dcops request with like "we think a DIMM is bad, please have a look" [17:48:36] sounds good to me [17:49:23] 👍 [17:58:56] Amir1: "db1131 is currently depooled, feel free to shut it down as needed" -- I can safely tell dcops this, right? [17:59:32] rzl: a heads up is needed so we stop replication and shut down mysql [17:59:40] got it thanks [18:04:30] rzl: not sure if was already mentioned, DIMM_A6 is the one that caused the failure [18:05:38] volans: oh thanks, no I didn't have that -- where'd you get it? [18:05:49] HW logs [18:05:55] https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Show_logs [18:06:15] ah cheers, I was just using the ipmi-sel excerpt in the doc [18:06:19] will include, appreciate it [21:42:28] My kludging of upstream CI was successful, my actual CR got merged now :) [21:50:47] jy.nus: seriously, if you're not a Debian expert, leave quilt well alone :)