[08:57:37] now I cannot unsee it, https://gerrit.wikimedia.org/r/c/operations/puppet/+/987401 [08:58:40] haha [08:58:48] <_joe_> -1 [08:58:52] <_joe_> I like the typo [08:59:05] <_joe_> Emperor: think of how much will volans cringe every time he logs into that server [08:59:05] you'll like my branch name then ;p [08:59:36] <_joe_> we should just leave like that to add some spicerack to his life [08:59:56] it is a lovely day in the global village and you are a horrible goose [09:03:24] lol [09:36:48] lolol [09:37:52] that's "untitled goose" to you! [10:56:18] <_joe_> I loved that game [10:56:34] <_joe_> I did all the hard tasks minus the hardest, and I'm not a gamer [10:56:38] /honk [11:48:18] Ah, I've worked out why we didn't get an alert about the failed drive in ms-be2068 - our alerting thinks the megaraid is in an optimal state because when the drive failed it was effectively removed from the system - so megacli thinks we have 23 happy drives, not 23 happy and 1 sad [11:48:32] [well, we got an alert from failing puppet, but YGTI] [11:53:48] * Emperor adds another entry to the "why I hate hardware RAID controllers" list [11:54:02] that's a known issue, we do have some detection for those but IIRC depends on how many disks there are in a virtual drive [11:54:11] example T316565 [11:54:12] T316565: Degraded RAID on db2149 - https://phabricator.wikimedia.org/T316565 [11:55:08] volans: right, and these are all single-drive RAID-0 arrays, so when the PD goes the VD goes likewise. Is there a phab for the known issue so I can add it to my notes-to-self? [11:55:42] running sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli -a is fairly easy to see that VD 7 is missing [11:56:48] I had achieved as much hence T354180 :) [11:56:49] T354180: Disk (sdh) failed in ms-be2068 - https://phabricator.wikimedia.org/T354180 [11:57:59] I don't see a VD 1 either, but might be intentional [11:58:30] without kwoning in advance how many disks there are it's hard to make the check detect this in all cases [11:58:56] I'm sure it was discussed in the past, not sure if there was a task, probably yes, but I couldn't find it right away [12:00:37] NP, don't spent time on it, I'm mostly curious :) [12:10:15] Emperor:The Hadoop workers have a similar problem. i.e. RAID0 for each data volume, no real way for the MegaRAID check to know if the live config matches what it is supposed to. [12:11:23] It's perhaps not the most elegant solution, but they way it's handled currently is to check for a minimum number of mounted data volumes and fail a puppet run if it doesn't meet the minimum. https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/hadoop/common.pp#L394-L405 [13:03:28] btullis: we're moving away from RAID0 for swift drives in any case (and puppet failures likewise pick up some) [15:44:25] it seems that some of the certs that we provide through puppet (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/base/files/ca/) seem to need updating as they expired, is that something known? [15:44:38] specifically the GlobalSign_Organization_Validation_CA_-_SHA256_-_G2.crt [15:46:16] those are intermediates [15:46:32] if they're dead, then hopefully we have nothing linking through them anymore [15:47:06] it's non-trivial to figure that out, though :) [15:48:22] I guess basically do an openssl CLI to output who the signer is on all the certs present in modules/profile/files/ssl/ , and see if there's intermediates from the list you're looking at, for which we don't have any certs using them there? [15:48:52] even then, those certs might be historical and dead (e.g. we still have digicert-2021 there, which is definitely dead) [15:48:56] we are getting errors on cloud vps instances when trying to verify vk.com as it uses that one as intermediate [15:50:28] yeah that makes a certain sense [15:50:44] we deploy those intermediates to have local copies in /etc/ (for chain-building purposes), but once they're there... [15:51:14] it could easily be the case that vk.com is cross-signed through multiple paths, and one of them is the expired intermediate, but we still deploy it locally so it's picked up [15:52:33] cleaning it up in the general case is a little tricky. but we could start with cleaning up certs from files/ssl/ that are sufficiently expired, and then see if any remaining ones use the deployed intermediates that are expired, etc [15:53:11] most of those certs are internal ones signed from palladium [15:54:58] AFAICS, for this one particular case you mentioned: the only cert we have in there that uses it is "star.tools.wmflabs.org.crt", which itself expired in 2020 [15:55:05] I'm guessing that's not deployed anywhere [15:55:52] (I don't see any refs in production puppet anyways, and it can't be useful) [15:56:20] so maybe do a patch to remove star.tools.wmflabs.org.crt, and then to remove the expired GlobalSign_Organization_Validation_CA_-_SHA256_-_G2.crt ? [15:57:49] note when removing the sslcert::ca part in base/certificates.pp, you should step through an ensure=>absent deployment first, so that it can clean up messes. [15:59:13] yep, was looking at the same stuff [15:59:18] https://www.irccloud.com/pastebin/cENRJBae/ [16:00:29] I'll send a patch yep, thanks! [16:02:45] === DigiCert_High_Assurance_CA-3.crt Not After : Apr 3 00:00:00 2022 GMT [16:02:45] === GeoTrust_Global_CA.crt Not After : Aug 21 04:00:00 2018 GMT [16:02:45] === GlobalSign_Organization_Validation_CA_-_SHA256_-_G2.crt Not After : Aug 2 10:00:00 2022 GMT [16:02:48] === RapidSSL_SHA256_CA_-_G3.crt Not After : May 20 21:39:32 2022 GMT [16:02:50] === wmf_ca_2017_2020.crt Not After : Jul 18 20:43:26 2020 GMT [16:02:53] ^ all of these sslcert::ca intermediates are expired, FWIW [17:32:31] created T354295 to follow up 👍 [17:32:32] T354295: [puppet] Remove expired and unused certs from modules/profile/files/ssl/ and modules/base/files/ca - https://phabricator.wikimedia.org/T354295 [18:36:35] will "icu67" (the APT component/icu67) that is in distro "buster-wikimedia" not be needed in distro "bullseye-wikimedia"? [18:37:07] as in: [apt1001:~] $ sudo -i reprepro list bullseye-wikimedia | grep icu has no output. but sudo -i reprepro list buster-wikimedia | grep icu does [18:37:40] (https://icu.unicode.org/download/67) [18:41:33] hmmm https://phabricator.wikimedia.org/T350767 says "Now that the ICU67 migration has completed, the PHP 7.4 packages need to be rebuilt" [18:41:52] but does that mean I should expect a component for icu67 [20:48:04] um, it looks like I broke something and I'm not sure how (or what to do about it) [20:49:05] I was reimaging some mw hosts in eqiad, mw1377-1383 (all row D) and after a reboot the network didn't come back [20:49:30] no route to host on the normal interface and I get permission denied on mgmt [20:49:53] (yes I shouldn't have run so many at once, sorry...) [20:49:56] any pointers? [20:50:38] kamila_: I can connect to mw1377 [20:50:47] uptime 3 min [20:50:49] kamila_: I can connect to mw1377.mgmt.eqiad.wmnet just fine, and the serial console seems accessible [20:51:01] did it _just_ come back? [20:51:22] that's hilarious [20:51:34] I'd been staring at it for what felt like half an hour [20:51:39] ok, nvm, sorry for the noise... [20:51:41] weird [20:51:52] probably the first puppet run took that long [20:52:03] sometimes it is like that, always when you ask [20:52:11] okay, well [20:52:23] if I hadn't made noise it would be hanging there all night [20:52:31] yea:) [20:53:25] fwiw, 4 hosts at a time always was my go-to number as well, split the terminal into 4 feels like you can still follow it but not too slow either [20:53:37] let's wire the reimage cookbook to automatically ask on IRC to make things faster? :D [20:54:14] :D [20:54:28] the part that the reimage cookbook claims it failed might be another story [20:54:43] maybe it was only "failed to downtime" but maybe not [20:56:26] kamila_: if you reimage while the host name is already assigned to a puppet role in site.pp it might take long and fail unless the puppet code is really working on the very first run, which isnt always the case, it might need 2 runs [20:56:54] mutante: so my best bet would be to re-run the reimage and then it actually has a chance of working this time? [20:56:55] so if that's the case you could assign the "insetup" role, then reimage (quickly) and then assign the prod role and run puppet [20:57:23] I am not sure what state the hosts are in [20:57:31] are you changing the role from appserver to k8s? [20:57:35] yes [20:58:10] I think I would first move them just from appserver to the "insetup" role, do the reimage, which should be failsafe [20:58:22] then apply new role on one host, run puppet and see what happens on the first run [20:58:24] ok [20:58:30] it might be that it works only after the second run [20:58:51] thank you [20:58:59] which doesnt matter after that second run happened.. but does for the cookbook [20:59:02] yw [21:09:04] kamila_: a k8s worker does this whole calico networking setup that non-k8s servers dont get. that adds new network interfaces, and then also it has LVS that adds all those lo: interfaces, that all happens on the first puppet run, so started by cookbook, so I could imagine that's why it was down, maybe restarted networking [21:10:06] maybe... but it's a bit awkward because the same thing worked fine for the codfw hosts [21:10:16] not the same HW but still [21:10:28] mw1377 did get the calico interfaces eventually, so: [21:10:29] mw1377:/# ip a s | grep cali [21:12:11] just saying maybe it's temp down during the initial run [21:14:00] maybe... I am very confused now :D [21:15:08] you will know more when you run puppet manually after changing the role, I think [21:15:19] yeah [21:15:22] or you can dig more in the cookbook logs [21:15:59] well, I'll probably still be confused given that I wanted to babysit these over dinner and suddenly it's 10pm '^^ [21:16:37] oh, yea, then dont do that tonight [21:17:03] I would either do nothing and just downtime them with downtime cookbook [21:17:09] or do reimage with insetup and then stop [21:17:28] yeah, I'll give the reimage a try and if not then downtime it is... [21:17:53] I just don't want to leave it in a state where it'll start alerting [21:19:30] right! +1. the downtime cookbook should do it. assuming these are already set to inactive or completely removed from confctl then deployers should not be affected either [21:20:31] good to know, thank you [21:20:51] so no renaming onf the machines when they change role? [21:22:02] yeah, renaming was deemed to be a silly amount of work considering that we have to do this for a lot of hosts [21:22:47] gotcha [21:26:39] 21:24:39 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-01-03-212406-publish (ran as mwdeploy@mw1377.eqiad.wmnet) returned [255]: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ [21:26:39] @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ [21:26:39] @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ [21:27:02] 21:26:46 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-01-03-212406-publish (ran as mwdeploy@mw1382.eqiad.wmnet) returned [255]: ssh: connect to host mw1382.eqiad.wmnet port 22: Connection timed out [21:27:12] should those hosts have been depooled? [21:36:21] mw1382 is pooled: inactive [21:36:51] and in kubernetes cluster [21:38:12] maybe this is the part where not renaming actually created an extra step, heh [21:38:48] I dont see it in another "scap / dsh group" [21:41:08] deployment and reimage of appserver might not be the best combo [21:42:03] I did attempt to depool the hosts [21:42:31] but I'm not sure what state it is in now [21:43:16] yes, they are depooled. just needs a downtime [21:43:16] sorry for the mess zabe, trying to fix it now... [21:43:52] I think the list of k8s nodes to pull the images from is pulled via puppetdb [21:43:57] ack, downtiming