[07:20:08] <Emperor>	 ms-be1040 put its filesystems back together at least enough that they mounted OK...
[07:23:28] <Emperor>	 (I'm off to see the physio again in ~an hour, so not going to start another reimage until I'm back from that)
[08:14:27] <Emperor>	 trying reset of ms-be1051 bmc
[08:17:14] <jynus>	 marostegui: remember when you had to transfer around 12TB of data and it was painful?
[08:17:27] <marostegui>	 don't even remind me that
[08:17:41] <jynus>	 well: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=backup1008&var-datasource=thanos&var-cluster=misc&from=1651479261645&to=1651588809679&viewPanel=28
[08:17:59] <marostegui>	 jynus: :-////////
[08:18:32] <marostegui>	 jynus: I don't remember how long it took but I am interested in knowing how long that will take XD
[08:18:45] <jynus>	 the good news is that I am on a 10G (although shared) network
[08:18:58] <marostegui>	 Ah smart!
[08:19:00] <jynus>	 so 30TiB will take less than 24 hours
[08:19:02] <marostegui>	 Mine was 1G!
[08:19:20] <jynus>	 we really need to move all stateful servers to 10G
[08:19:46] <jynus>	 90% of the time it is not needed, but the pending 10% is really painful
[08:35:54] <jynus>	 did the bmc-reset work?
[08:36:30] <jynus>	 I see 8 interfaces down still
[09:13:53] <jynus>	 going for a breakfast while I wait for the transference to complete
[09:13:55] <Emperor>	 jynus: yes, ms-be1051 now up, I'll work through the remaining ones
[09:14:15] <jynus>	 oh, not trying to pressure, just hoping my suggestion worked for you
[09:14:37] <Emperor>	 I didn't want to do more than one before my physio appt in case of 🔥
[09:14:43] <jynus>	 +1
[09:20:10] <Emperor>	 that's them all green again
[09:20:24] <jynus>	 there is 2 other hosts that may have the same issue: relforge -is that releng? wmcs?
[09:20:36] <jynus>	 so we can ping them for th easy fix
[09:22:19] <Emperor>	 IHNI, sorry, maybe ask in -sre? [I Have No Idea]
[09:22:58] <taavi>	 relforge is search
[09:23:45] <jynus>	 Emperor: sorry, wasn't asking you, just speaking aloud hoping someone in the channel would know
[09:23:59] <jynus>	 and I was right :-)
[09:49:47] <nemo-yiannis>	 godog: Hey! We are trying to improve the maps tile pregeneration performance as described here: https://phabricator.wikimedia.org/T307182
[09:50:17] <nemo-yiannis>	 Do you see anything problematic with listing all the object keys from the container once a day?
[09:50:29] <nemo-yiannis>	 like `swift list <container name>`
[09:50:50] <nemo-yiannis>	 We just run it manually and it takes like ~10 mins
[09:58:11] <godog>	 nemo-yiannis: no I don't think so, SGTM
[09:58:19] <nemo-yiannis>	 cool, thanks!
[09:58:32] <godog>	 sure np!
[10:30:14] <kormat>	 Amir1: re: skipping db.run_mysql lines, i'm going to look at a better solution in the afternoon.
[10:30:29] <kormat>	 (namely providing -BN to db-mysql, and checking that all calling locations are updated accordingly)
[10:30:45] <Amir1>	 noted.  Thanks.
[12:53:04] <kormat>	 Emperor: just out of idle curiosity: swift servers don't touch the spinning rust at all during install; mounting those disks/partitions is done afterwards by puppet?
[12:53:12] <kormat>	 (just wanting to check if i understood)
[12:54:46] <Emperor>	 kormat: yes, though the late_command briefly mounts one to inspect it for uid/gid 
[12:55:11] <kormat>	 gotcha
[13:00:09] <kormat>	 Amir1: are there any unit/integration/etc tests for auto_schema that i can use to make sure i'm not breaking things?
[13:00:38] <kormat>	 i see a `test.py`, but i think that's probably not it, right?
[13:23:58] * jynus is close to celebrating a succesful, fully automated reimage at last
[13:24:10] <sobanski>	 \o/
[13:24:46] <jynus>	 it is not like it required 7 firmware reimages...
[13:25:00] <jynus>	 s/reimages/updates/
[13:27:53] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 8.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321
[13:28:33] <jynus>	 I wonder however why the manual recipe fails critically when running grub-install- if it assumes it has to install it on every mounted partition?
[13:31:24] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321
[14:09:34] <Emperor>	 godog: ms-be2043 has now ended up with two partitions labelled swift-sda3 (likewise swift-sda4) because of instability in drive ordering. I think the current approach is just too fragile
[14:09:53] <Emperor>	 ...since I'm now going to have to fix this by hand in a fiddly and error prone way. Again.
[14:10:17] <Emperor>	 Also my ssh connections to the host keep freezing, which is just making everything even more aggravating
[14:13:07] <godog>	 Emperor: ah I'm guessing because the ssd swapped on the first reboot ? I'd expect puppet to label all or nothing though the first time it runs
[14:14:07] <godog>	 Emperor: did the other reimaged hosts come up with the correct ordering eventually ?
[14:14:37] <Emperor>	 godog: often, no.
[14:14:55] <Emperor>	 puppet only falls in a heap if sd{a,b} are wrong, though, it seems
[14:15:27] <godog>	 ah
[14:16:48] <godog>	 yeah agreed though with the order ~persistently wrong that's fragile, I'm happy to brainstorm ideas if that'd help
[14:45:04] <jynus>	 I don't think this will impact you much, but es backups will come this week 1 or 2 days late, but the important thing is they will finally happen
[14:46:59] <Emperor>	 grub install failed. I'll try again. IWBNI one single solitary swift host reimaged successfully.
[15:13:11] <Emperor>	 Ugh, and a filesystem is nadgered too
[15:13:17] <Emperor>	 2
[15:26:22] <Emperor>	 reimage is going to fail because it's going to timeout long before these xfs_repairs complete.
[15:32:57] <Emperor>	 ...which I think will mean I'll need to remove the icinga downtime myself, and the other thing in the reimage cookbook is "Updated Netbox data from PuppetDB"; is that going to be straightforward to do myself later?
[15:35:04] <volans>	 although I'm not here... the reimage cookbook does few more things, but you can run it again with the --no-pxe option to resume from a late-finished debian-installer where the previous one failed
[15:38:05] <Emperor>	 volans: thanks! [you mean that reimage --no-pxe doesn't do any reinstallation, just the later stuff?]
[15:39:26] <volans>	 exactly, it assumes you've fixed d-i manually one way or another and the host got rebooted into the new OS by d-i and is there waiting to be setup
[15:39:29] <volans>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/hosts/reimage.py#45
[15:41:08] <volans>	 the Icinga old downtime should be removed anyway, the old Alertmanager one instead might survive (for lack of full support yet)
[15:41:35] <Emperor>	 Thank you :)
[16:26:09] <jynus>	 I think I will be winding down early today, as I am blocked by long running processes to finish and come back early tomorrow instead when those are finished