[10:38:25] hi, sorry for taking a while to follow up on https://gerrit.wikimedia.org/r/c/operations/puppet/+/964871 - is there some automation I can use to run the required commands on all the mariadb instances on the clouddb hosts? [10:40:07] taavi: nope [11:40:19] btullis: The alerts on -operations, is that you? [11:40:36] Ah, I saw your message there in between all of them, thanks [11:41:11] Yes, mariadb on an-coord100[1-2] is being migrated to an-mariadb100[1-2] right now. Apologies for the noise. [13:05:36] running T348183 on s4 [13:05:37] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [13:08:52] heh that's going to take a while with the image table in s4 [13:11:36] we'll run a switchover for the masters in anycase imho [13:11:49] Yeah, I meant the slaves [13:12:06] Data type changes cannot be run online, thus the master cannot run them [13:26:12] btullis: fyi: the bullseye upgrade of aqs1012 did not go well, it is currently down. :( [13:28:04] there was an error in the gerrit for partition reuse (https://gerrit.wikimedia.org/r/c/operations/puppet/+/974259) —and long-story-short— it read `reuse-parts.cfg partman/custom/aqs-cassandra-8ssd-2srv.cfg` when it should have read `reuse-parts.cfg partman/custom/reuse-aqs-cassandra-8ssd-2srv.cfg` [13:30:24] the install bombed with a mismatched preseed values error, and so I fixed it (the installer config), and retried, but I'm guessing it may have done something to the partition table [13:31:59] I'm not sure what rescue options are (or if rescue is even possible), so if you have any suggestions, I'm all ears :) [13:35:31] urandom: Is that related to the netboot changes that have been done? (see email) [13:37:40] marostegui: I don't guess so, this happened yesterday [14:03:51] urandom: I've just arrived back at my desk. Having a look in a minute. [14:04:32] btullis: I'm currently attempting another reimage after "trying something" :) [14:04:44] o.O [14:10:01] OK, keep pinging me if you'd like to look at it together. [14:10:11] btullis: so... [14:10:14] hrmm [14:10:36] I ended up seeing something different after being dropped into the partitioner this time [14:10:51] and it doesn't look...right [14:11:40] You can always use the `reuse-parts-test.cfg` if you want, which drops you into the partitioner on purpose. [14:11:53] it seems like it's missing one block device —sdh— and one raid array —md2— [14:11:57] yeah, that's what I didn [14:12:00] did [14:13:25] and it's missing the mount points [14:21:05] btullis: ok, if I'm understanding what I'm seeing, it's this: For whatever reason, `/dev/sdh` isn't showing, and as a result, /dev/md2 is configured. [14:21:26] /dev/md2 *isn't* configured [14:22:06] it's made up of sd[e-h]2, and get's mounted as /srv/cassandra-b [14:22:39] Is it worth trying a cold boot? Maybe the SATA controller has got wedged or something. Maybe the disk has just failed altogether, which would be an inconvenient time for it to happen. [14:24:05] Is it possible that the disks are just being detected out of order. Are disks sd[a-g] detected, but no sdh and no sdi ? [14:25:25] a-g are there, h is missing (it should go from [a-h]) [14:26:34] I don't see sdh detected at boot (via dmesg) [14:27:27] so yeah, maybe a cold boot? 🤷‍♂️ [14:27:50] then if that doesn't work, maybe dcops and reseat it? not sure what else to try. [14:28:00] s/and/can/ [14:28:27] I could setup / and /srv/cassandra-a and maybe sort of the rest once it's up (if other steps fail) [14:28:56] I'm logging into the web interface of the drac with `ssh -N -L 8443:aqs1012.mgmt.eqiad.wmnet:443 cumin1001.eqiad.wmnet` and see if I can see any events relating to strorage. But yes, I think a cold boot is a good idea. [14:31:48] btullis: i'm trying that now [14:34:49] urandom: Cool. If it's one disk missing in a raid 10 we should also be able to get it working with one missing, then get DCOps to replace the disk and we can rebuild it. [14:35:28] Ok, same thing after a cold-boot [14:35:56] Should I explore having dcops try to reseat it, or just see if I can get it back up with / and /srv/cassandra-a? [14:36:45] i.e. leave the rest as-is until after it's been imaged [14:37:33] Looks like one missing here too. https://usercontent.irccloud-cdn.com/file/ZxVl6fZd/image.png [14:38:48] Yeah, why not try asking dcops to reseat it first, that's the simplest option isn't it? We're not pressed for time. [14:39:48] If that doesn't work, I think we should try to assemble `/srv/cassandra-b`with only 3 out of 4 disks, even if we have to do it manually in the installer. [14:42:09] Comparing that with aqs1011 - which clearly has 8 disks detected https://usercontent.irccloud-cdn.com/file/plOhItUv/image.png [14:57:17] my confidence that I could do that using the d-i interface is...low(ish) [14:58:22] s/do that/do that properly/ [16:04:21] FYI dbctl alert in icinga is firing (see -operations) about db1130 [16:13:49] arnaudb: ^ [16:15:22] on it [16:16:10] fixed sorry!