[10:38:25] <taavi>	 hi, sorry for taking a while to follow up on https://gerrit.wikimedia.org/r/c/operations/puppet/+/964871 - is there some automation I can use to run the required commands on all the mariadb instances on the clouddb hosts?
[10:40:07] <marostegui>	 taavi: nope
[11:40:19] <marostegui>	 btullis: The alerts on -operations, is that you?
[11:40:36] <marostegui>	 Ah, I saw your message there in between all of them, thanks
[11:41:11] <btullis>	 Yes, mariadb on an-coord100[1-2] is being migrated to an-mariadb100[1-2] right now. Apologies for the noise.
[13:05:36] <arnaudb>	 running T348183 on s4
[13:05:37] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[13:08:52] <marostegui>	 heh that's going to take a while with the image table in s4
[13:11:36] <arnaudb>	 we'll run a switchover for the masters in anycase imho
[13:11:49] <marostegui>	 Yeah, I meant the slaves
[13:12:06] <marostegui>	 Data type changes cannot be run online, thus the master cannot run them
[13:26:12] <urandom>	 btullis: fyi: the bullseye upgrade of aqs1012 did not go well, it is currently down. :(
[13:28:04] <urandom>	 there was an error in the gerrit for partition reuse (https://gerrit.wikimedia.org/r/c/operations/puppet/+/974259) —and long-story-short— it read `reuse-parts.cfg partman/custom/aqs-cassandra-8ssd-2srv.cfg` when it should have read `reuse-parts.cfg partman/custom/reuse-aqs-cassandra-8ssd-2srv.cfg`
[13:30:24] <urandom>	 the install bombed with a mismatched preseed values error, and so I fixed it (the installer config), and retried, but I'm guessing it may have done something to the partition table
[13:31:59] <urandom>	 I'm not sure what rescue options are (or if rescue is even possible), so if you have any suggestions, I'm all ears :)
[13:35:31] <marostegui>	 urandom: Is that related to the netboot changes that have been done? (see email)
[13:37:40] <urandom>	 marostegui: I don't guess so, this happened yesterday
[14:03:51] <btullis>	 urandom: I've just arrived back at my desk. Having a look in a minute.
[14:04:32] <urandom>	 btullis: I'm currently attempting another reimage after "trying something" :)
[14:04:44] <urandom>	 o.O
[14:10:01] <btullis>	 OK, keep pinging me if you'd like to look at it together. 
[14:10:11] <urandom>	 btullis: so... 
[14:10:14] <urandom>	 hrmm
[14:10:36] <urandom>	 I ended up seeing something different after being dropped into the partitioner this time
[14:10:51] <urandom>	 and it doesn't look...right
[14:11:40] <btullis>	 You can always use the `reuse-parts-test.cfg` if you want, which drops you into the partitioner on purpose.
[14:11:53] <urandom>	 it seems like it's missing one block device —sdh— and one raid array —md2—
[14:11:57] <urandom>	 yeah, that's what I didn
[14:12:00] <urandom>	 did
[14:13:25] <urandom>	 and it's missing the mount points
[14:21:05] <urandom>	 btullis: ok, if I'm understanding what I'm seeing, it's this: For whatever reason, `/dev/sdh` isn't showing, and  as a result, /dev/md2 is configured.
[14:21:26] <urandom>	  /dev/md2 *isn't* configured
[14:22:06] <urandom>	 it's made up of sd[e-h]2, and get's mounted as /srv/cassandra-b
[14:22:39] <btullis>	 Is it worth trying a cold boot? Maybe the SATA controller has got wedged or something. Maybe the disk has just failed altogether, which would be an inconvenient time for it to happen.
[14:24:05] <btullis>	 Is it possible that the disks are just being detected out of order. Are disks sd[a-g] detected, but no sdh and no sdi ?
[14:25:25] <urandom>	 a-g are there, h is missing (it should go from [a-h])
[14:26:34] <urandom>	 I don't see sdh detected at boot (via dmesg)
[14:27:27] <urandom>	 so yeah, maybe a cold boot? 🤷‍♂️
[14:27:50] <urandom>	 then if that doesn't work, maybe dcops and reseat it?  not sure what else to try.
[14:28:00] <urandom>	 s/and/can/
[14:28:27] <urandom>	 I could setup / and /srv/cassandra-a and maybe sort of the rest once it's up (if other steps fail)
[14:28:56] <btullis>	 I'm logging into the web interface of the drac with `ssh -N -L 8443:aqs1012.mgmt.eqiad.wmnet:443 cumin1001.eqiad.wmnet` and see if I can see any events relating to strorage. But yes, I think a cold boot is a good idea.
[14:31:48] <urandom>	 btullis: i'm trying that now
[14:34:49] <btullis>	 urandom: Cool. If it's one disk missing in a raid 10 we should also be able to get it working with one missing, then get DCOps to replace the disk and we can rebuild it.
[14:35:28] <urandom>	 Ok, same thing after a cold-boot
[14:35:56] <urandom>	 Should I explore having dcops try to reseat it, or just see if I can get it back up with / and /srv/cassandra-a?
[14:36:45] <urandom>	 i.e. leave the rest as-is until after it's been imaged
[14:37:33] <btullis>	 Looks like one missing here too. https://usercontent.irccloud-cdn.com/file/ZxVl6fZd/image.png
[14:38:48] <btullis>	 Yeah, why not try asking dcops to reseat it first, that's the simplest option isn't it? We're not pressed for time. 
[14:39:48] <btullis>	 If that doesn't work, I think we should try to assemble `/srv/cassandra-b`with only 3 out of 4 disks, even if we have to do it manually in the installer.
[14:42:09] <btullis>	 Comparing that with aqs1011 - which clearly has 8 disks detected https://usercontent.irccloud-cdn.com/file/plOhItUv/image.png
[14:57:17] <urandom>	 my confidence that I could do that using the d-i interface is...low(ish)
[14:58:22] <urandom>	 s/do that/do that properly/
[16:04:21] <volans>	 FYI dbctl alert in icinga is firing (see -operations) about db1130
[16:13:49] <marostegui>	 arnaudb: ^
[16:15:22] <arnaudb>	 on it
[16:16:10] <arnaudb>	 fixed sorry!