[08:17:13] <Emperor>	 Yay, my swift upstream fix has died in CI dependency hell (the 2022-10-16 release of PasteDeploy dropped py2 support, so the py2 gating tests now fail)
[08:43:02] <Amir1>	 :/
[08:49:54] <Emperor>	 let's see what they make of https://review.opendev.org/c/openstack/swift/+/861583 (the obvious dumb "fix")
[09:27:29] <jynus>	 Amir1: Re: T320786 I think the command line failed to execute- although it could be (or not) a real issue
[09:27:29] <stashbot>	 T320786: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786
[09:27:49] <Amir1>	 hmm, okay
[09:27:58] <Amir1>	 it feels that it's not a real issue
[09:28:04] <jynus>	 is it a new host?
[09:28:17] <jynus>	 because maybe the raid code needs tweaking
[09:28:42] <jynus>	 aka it is not being properly monitored, which would be a more interesting issue- consider involving riccardo
[09:29:52] <Amir1>	 yeah probably
[09:35:43] <jynus>	 if I run "megacli -AdpAllInfo -aALL" it gets stuck
[09:36:12] <jynus>	 or maybe you restarted it?
[09:40:06] <Amir1>	 I didn't
[09:40:49] <jynus>	 then it crashed!
[09:41:31] <volans>	 AIUI the host should have 10 disks, all your outputs have only 9 in T320786
[09:41:31] <stashbot>	 T320786: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786
[09:42:23] <volans>	 and my ssh attempt is stuck right now so I can't debug it more, but I would guess that a disk has disappeared from the controller, would not be the first time
[09:43:34] <jynus>	 the host is up, though
[09:44:02] <jynus>	 will try to see if it kernel panicked
[09:44:17] <volans>	 I get ao to Authenticated to db1202.eqiad.wmnet (via proxy) using "publickey".
[09:44:25] <jynus>	 oh
[09:44:27] <volans>	 then my ssh session get's stuck and I don't get the prompt
[09:45:11] <jynus>	 even login doesn't work, I don't get asked for a password
[09:45:47] <volans>	 I'll leave the ssh attempt running, in case it makes it through at some point
[09:45:59] <jynus>	 your permission to do a soft reboot, then a hard one from management?
[09:46:00] <Amir1>	 powercycle time 😈
[09:46:06] <jynus>	 Amir1: doing
[09:46:11] <Amir1>	 let's do skip transaction for good measure
[09:46:17] <jynus>	 ha ha ha ha
[09:46:27] <jynus>	 don't worry, I will be the one doing the recovery here :-D
[09:46:43] <Amir1>	 mwhaha
[09:46:47] <jynus>	 my bet is a controller issue, maybe triggered by a disk failure
[09:47:19] <jynus>	 Amir1: the host is depooled/alerts disabled?
[09:47:46] <Amir1>	 btw, ICYMI master of s6 in eqiad became fully unreachable during the weekend, we did a powercycle and brought back online then an emergency switchover 
[09:48:01] <jynus>	 but that was memory, right?
[09:48:02] <Amir1>	 see the incident doc in _security
[09:48:25] <Amir1>	 yeah, I think ipmi-sel syas uncorrectable memroy error
[09:48:33] <Amir1>	 just FYI, not related to this
[09:49:01] <jynus>	 it is not rare to have unstable hw with new one
[09:49:22] <jynus>	 it took a few iterations/bios upgrades/hw replacements to get all stable
[09:49:54] <Amir1>	 yeah
[09:50:05] <jynus>	 I did a soft powercycle, if I don't see any response on console in a few minutes I will do a hard one
[09:50:22] <jynus>	 I think it is coming up now
[09:52:04] <jynus>	 bootup didn't complain about anything, but I can check the logs now
[09:54:06] <jynus>	 see hw logs on task
[09:55:58] <Amir1>	 thanks jynus 
[09:58:52] <jynus>	 let me know if you want me to recover it after hw serviced
[09:59:19] <Amir1>	 sure. Thanks
[10:55:16] <Emperor>	 ...of course to make a change to requirements.txt in an openstack project you first have to make a change in the global-requirements file in a core project... https://review.opendev.org/c/openstack/requirements/+/861599
[10:59:05] <Emperor>	 ...but they generally don't allow version caps, so I fear I am going no-where near space today :(
[16:11:52] <question_mark>	 looks like we're missing team updates in the SRE meeting, please put in updates for the past two weeks for your own areas :)
[16:23:45] <jynus>	 Amir1: I Just realized it is the same thing Robh mentioned on the task I brought out in our meeting, so this is just an additional heads up
[16:24:12] <Amir1>	 yeah
[16:24:14] <jynus>	 so it is already queued for him at T319443
[16:47:54] <Emperor>	 sigh, I always talk too fast when presenting
[16:48:09] <volans>	 we all do
[16:48:11] <Emperor>	 it's really weird giving a talk just looking at your slides not the audience
[16:49:16] <bd808>	 Emperor: :empathy: it is very hard for me to give a talk where I can't see the audience. I crave the head nods/confused looks to know when to speed up/slow down/rephrase.
[16:49:21] <jynus>	 oh, if you have space, I normally set the slides on front and the audience/chat on the side
[16:50:29] * Emperor has but one monitor (and the slide-viewer on presentation mode)
[16:51:18] <jynus>	 you definitelly should buy one, I think the foundation gave you money to have laptop + extra screen?
[16:51:59] <Emperor>	 Meh, I have a decent enough monitor for my day-to-day :)
[16:53:03] <jynus>	 I started with 2, then now with 3 monitors and now I cannot live with only having a tiny screen (e.g. just the laptop)
[16:53:24] <jynus>	 for me it makes me way more productive than just alt-tabing/tiny windows
[16:55:07] <jynus>	 also, thanks for the talk, when I recently started playing with dquilt I thought who could have designed that command line api
[17:41:06] <rzl>	 hi data persistence 👋 I'm following up from the db1131 breakage -- Amir1, or anybody else, have you already started a dcops task or would you like me to?
[17:41:12] <rzl>	 (or whatever followup is needed)
[17:42:35] <Amir1>	 rzl: I haven't, is there a ticket for the general incident? I thought community made one
[17:43:04] <rzl>	 I didn't see one, I was about to open one to attach the IR to, but if one exists I'll use that
[17:44:51] <Amir1>	 I have only the urgent switchover ticket T320879
[17:44:51] <stashbot>	 T320879: Switchover s6 master (db1131 -> db1173) - https://phabricator.wikimedia.org/T320879
[17:44:55] <rzl>	 nod
[17:46:24] <rzl>	 doesn't look like anything relevant was opened in phab on 2022-10-15 so I'll start a fresh one for tracking, if I missed it we can just merge later
[17:46:53] <rzl>	 do you have anything that you all want to check into, first? or should I just open a dcops request with like "we think a DIMM is bad, please have a look"
[17:48:36] <Amir1>	 sounds good to me
[17:49:23] <rzl>	 👍
[17:58:56] <rzl>	 Amir1: "db1131 is currently depooled, feel free to shut it down as needed" -- I can safely tell dcops this, right?
[17:59:32] <Amir1>	 rzl: a heads up is needed so we stop replication and shut down mysql
[17:59:40] <rzl>	 got it thanks
[18:04:30] <volans>	 rzl: not sure if was already mentioned, DIMM_A6 is the one that caused the failure
[18:05:38] <rzl>	 volans: oh thanks, no I didn't have that -- where'd you get it?
[18:05:49] <volans>	 HW logs
[18:05:55] <volans>	 https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Show_logs
[18:06:15] <rzl>	 ah cheers, I was just using the ipmi-sel excerpt in the doc
[18:06:19] <rzl>	 will include, appreciate it
[21:42:28] <Emperor>	 My kludging of upstream CI was successful, my actual CR got merged now :)
[21:50:47] <Emperor>	 jy.nus: seriously, if you're not a Debian expert, leave quilt well alone :)