[07:20:08] ms-be1040 put its filesystems back together at least enough that they mounted OK... [07:23:28] (I'm off to see the physio again in ~an hour, so not going to start another reimage until I'm back from that) [08:14:27] trying reset of ms-be1051 bmc [08:17:14] marostegui: remember when you had to transfer around 12TB of data and it was painful? [08:17:27] don't even remind me that [08:17:41] well: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=backup1008&var-datasource=thanos&var-cluster=misc&from=1651479261645&to=1651588809679&viewPanel=28 [08:17:59] jynus: :-//////// [08:18:32] jynus: I don't remember how long it took but I am interested in knowing how long that will take XD [08:18:45] the good news is that I am on a 10G (although shared) network [08:18:58] Ah smart! [08:19:00] so 30TiB will take less than 24 hours [08:19:02] Mine was 1G! [08:19:20] we really need to move all stateful servers to 10G [08:19:46] 90% of the time it is not needed, but the pending 10% is really painful [08:35:54] did the bmc-reset work? [08:36:30] I see 8 interfaces down still [09:13:53] going for a breakfast while I wait for the transference to complete [09:13:55] jynus: yes, ms-be1051 now up, I'll work through the remaining ones [09:14:15] oh, not trying to pressure, just hoping my suggestion worked for you [09:14:37] I didn't want to do more than one before my physio appt in case of 🔥 [09:14:43] +1 [09:20:10] that's them all green again [09:20:24] there is 2 other hosts that may have the same issue: relforge -is that releng? wmcs? [09:20:36] so we can ping them for th easy fix [09:22:19] IHNI, sorry, maybe ask in -sre? [I Have No Idea] [09:22:58] relforge is search [09:23:45] Emperor: sorry, wasn't asking you, just speaking aloud hoping someone in the channel would know [09:23:59] and I was right :-) [09:49:47] godog: Hey! We are trying to improve the maps tile pregeneration performance as described here: https://phabricator.wikimedia.org/T307182 [09:50:17] Do you see anything problematic with listing all the object keys from the container once a day? [09:50:29] like `swift list ` [09:50:50] We just run it manually and it takes like ~10 mins [09:58:11] nemo-yiannis: no I don't think so, SGTM [09:58:19] cool, thanks! [09:58:32] sure np! [10:30:14] Amir1: re: skipping db.run_mysql lines, i'm going to look at a better solution in the afternoon. [10:30:29] (namely providing -BN to db-mysql, and checking that all calling locations are updated accordingly) [10:30:45] noted. Thanks. [12:53:04] Emperor: just out of idle curiosity: swift servers don't touch the spinning rust at all during install; mounting those disks/partitions is done afterwards by puppet? [12:53:12] (just wanting to check if i understood) [12:54:46] kormat: yes, though the late_command briefly mounts one to inspect it for uid/gid [12:55:11] gotcha [13:00:09] Amir1: are there any unit/integration/etc tests for auto_schema that i can use to make sure i'm not breaking things? [13:00:38] i see a `test.py`, but i think that's probably not it, right? [13:23:58] * jynus is close to celebrating a succesful, fully automated reimage at last [13:24:10] \o/ [13:24:46] it is not like it required 7 firmware reimages... [13:25:00] s/reimages/updates/ [13:27:53] PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 8.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [13:28:33] I wonder however why the manual recipe fails critically when running grub-install- if it assumes it has to install it on every mounted partition? [13:31:24] RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [14:09:34] godog: ms-be2043 has now ended up with two partitions labelled swift-sda3 (likewise swift-sda4) because of instability in drive ordering. I think the current approach is just too fragile [14:09:53] ...since I'm now going to have to fix this by hand in a fiddly and error prone way. Again. [14:10:17] Also my ssh connections to the host keep freezing, which is just making everything even more aggravating [14:13:07] Emperor: ah I'm guessing because the ssd swapped on the first reboot ? I'd expect puppet to label all or nothing though the first time it runs [14:14:07] Emperor: did the other reimaged hosts come up with the correct ordering eventually ? [14:14:37] godog: often, no. [14:14:55] puppet only falls in a heap if sd{a,b} are wrong, though, it seems [14:15:27] ah [14:16:48] yeah agreed though with the order ~persistently wrong that's fragile, I'm happy to brainstorm ideas if that'd help [14:45:04] I don't think this will impact you much, but es backups will come this week 1 or 2 days late, but the important thing is they will finally happen [14:46:59] grub install failed. I'll try again. IWBNI one single solitary swift host reimaged successfully. [15:13:11] Ugh, and a filesystem is nadgered too [15:13:17] 2 [15:26:22] reimage is going to fail because it's going to timeout long before these xfs_repairs complete. [15:32:57] ...which I think will mean I'll need to remove the icinga downtime myself, and the other thing in the reimage cookbook is "Updated Netbox data from PuppetDB"; is that going to be straightforward to do myself later? [15:35:04] although I'm not here... the reimage cookbook does few more things, but you can run it again with the --no-pxe option to resume from a late-finished debian-installer where the previous one failed [15:38:05] volans: thanks! [you mean that reimage --no-pxe doesn't do any reinstallation, just the later stuff?] [15:39:26] exactly, it assumes you've fixed d-i manually one way or another and the host got rebooted into the new OS by d-i and is there waiting to be setup [15:39:29] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/hosts/reimage.py#45 [15:41:08] the Icinga old downtime should be removed anyway, the old Alertmanager one instead might survive (for lack of full support yet) [15:41:35] Thank you :) [16:26:09] I think I will be winding down early today, as I am blocked by long running processes to finish and come back early tomorrow instead when those are finished