[08:57:37] <Emperor>	 now I cannot unsee it, https://gerrit.wikimedia.org/r/c/operations/puppet/+/987401
[08:58:40] <marostegui>	 haha
[08:58:48] <_joe_>	 -1 
[08:58:52] <_joe_>	 I like the typo
[08:59:05] <_joe_>	 Emperor: think of how much will volans cringe every time he logs into that server
[08:59:05] <Emperor>	 you'll like my branch name then ;p
[08:59:36] <_joe_>	 we should just leave like that to add some spicerack to his life
[08:59:56] <Emperor>	 it is a lovely day in the global village and you are a horrible goose
[09:03:24] <volans>	 lol
[09:36:48] <apergos>	 lolol
[09:37:52] <apergos>	 that's "untitled goose" to you!
[10:56:18] <_joe_>	 I loved that game
[10:56:34] <_joe_>	 I did all the hard tasks minus the hardest, and I'm not a gamer
[10:56:38] <claime>	 /honk
[11:48:18] <Emperor>	 Ah, I've worked out why we didn't get an alert about the failed drive in ms-be2068 - our alerting thinks the megaraid is in an optimal state because when the drive failed it was effectively removed from the system - so megacli thinks we have 23 happy drives, not 23 happy and 1 sad
[11:48:32] <Emperor>	 [well, we got an alert from failing puppet, but YGTI]
[11:53:48] * Emperor adds another entry to the "why I hate hardware RAID controllers" list
[11:54:02] <volans>	 that's a known issue, we do have some detection for those but IIRC depends on how many disks there are in a virtual drive
[11:54:11] <volans>	 example T316565
[11:54:12] <stashbot>	 T316565: Degraded RAID on db2149 - https://phabricator.wikimedia.org/T316565
[11:55:08] <Emperor>	 volans: right, and these are all single-drive RAID-0 arrays, so when the PD goes the VD goes likewise. Is there a phab for the known issue so I can add it to my notes-to-self?
[11:55:42] <volans>	 running sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli -a is fairly easy to see that VD 7 is missing
[11:56:48] <Emperor>	 I had achieved as much hence T354180 :)
[11:56:49] <stashbot>	 T354180: Disk (sdh) failed in ms-be2068 - https://phabricator.wikimedia.org/T354180
[11:57:59] <volans>	 I don't see a VD 1 either, but might be intentional
[11:58:30] <volans>	 without kwoning in advance how many disks there are it's hard to make the check detect this in all cases
[11:58:56] <volans>	 I'm sure it was discussed in the past, not sure if there was a task, probably yes, but I couldn't find it right away
[12:00:37] <Emperor>	 NP, don't spent time on it, I'm mostly curious :)
[12:10:15] <btullis>	 Emperor:The Hadoop workers have a similar problem. i.e. RAID0 for each data volume, no real way for the MegaRAID check to know if the live config matches what it is supposed to.
[12:11:23] <btullis>	 It's perhaps not the most elegant solution, but they way it's handled currently is to check for a minimum number of mounted data volumes and fail a puppet run if it doesn't meet the minimum. https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/hadoop/common.pp#L394-L405
[13:03:28] <Emperor>	 btullis: we're moving away from RAID0 for swift drives in any case (and puppet failures likewise pick up some)
[15:44:25] <dcaro>	 it seems that some of the certs that we provide through puppet (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/base/files/ca/) seem to need updating as they expired, is that something known?
[15:44:38] <dcaro>	 specifically the GlobalSign_Organization_Validation_CA_-_SHA256_-_G2.crt
[15:46:16] <bblack>	 those are intermediates
[15:46:32] <bblack>	 if they're dead, then hopefully we have nothing linking through them anymore
[15:47:06] <bblack>	 it's non-trivial to figure that out, though :)
[15:48:22] <bblack>	 I guess basically do an openssl CLI to output who the signer is on all the certs present in modules/profile/files/ssl/ , and see if there's intermediates from the list you're looking at, for which we don't have any certs using them there?
[15:48:52] <bblack>	 even then, those certs might be historical and dead (e.g. we still have digicert-2021 there, which is definitely dead)
[15:48:56] <dcaro>	 we are getting errors on cloud vps instances when trying to verify vk.com as it uses that one as intermediate
[15:50:28] <bblack>	 yeah that makes a certain sense
[15:50:44] <bblack>	 we deploy those intermediates to have local copies in /etc/ (for chain-building purposes), but once they're there...
[15:51:14] <bblack>	 it could easily be the case that vk.com is cross-signed through multiple paths, and one of them is the expired intermediate, but we still deploy it locally so it's picked up
[15:52:33] <bblack>	 cleaning it up in the general case is a little tricky.  but we could start with cleaning up certs from files/ssl/ that are sufficiently expired, and then see if any remaining ones use the deployed intermediates that are expired, etc
[15:53:11] <bblack>	 most of those certs are internal ones signed from palladium
[15:54:58] <bblack>	 AFAICS, for this one particular case you mentioned: the only cert we have in there that uses it is "star.tools.wmflabs.org.crt", which itself expired in 2020
[15:55:05] <bblack>	 I'm guessing that's not deployed anywhere
[15:55:52] <bblack>	 (I don't see any refs in production puppet anyways, and it can't be useful)
[15:56:20] <bblack>	 so maybe do a patch to remove star.tools.wmflabs.org.crt, and then to remove the expired GlobalSign_Organization_Validation_CA_-_SHA256_-_G2.crt ?
[15:57:49] <bblack>	 note when removing the sslcert::ca part in base/certificates.pp, you should step through an ensure=>absent deployment first, so that it can clean up messes.
[15:59:13] <dcaro>	 yep, was looking at the same stuff
[15:59:18] <dcaro>	 https://www.irccloud.com/pastebin/cENRJBae/
[16:00:29] <dcaro>	 I'll send a patch yep, thanks!
[16:02:45] <bblack>	 === DigiCert_High_Assurance_CA-3.crt Not After : Apr  3 00:00:00 2022 GMT
[16:02:45] <bblack>	 === GeoTrust_Global_CA.crt Not After : Aug 21 04:00:00 2018 GMT
[16:02:45] <bblack>	 === GlobalSign_Organization_Validation_CA_-_SHA256_-_G2.crt Not After : Aug  2 10:00:00 2022 GMT
[16:02:48] <bblack>	 === RapidSSL_SHA256_CA_-_G3.crt Not After : May 20 21:39:32 2022 GMT
[16:02:50] <bblack>	 === wmf_ca_2017_2020.crt Not After : Jul 18 20:43:26 2020 GMT
[16:02:53] <bblack>	 ^ all of these sslcert::ca intermediates are expired, FWIW
[17:32:31] <dcaro>	 created T354295 to follow up 👍
[17:32:32] <stashbot>	 T354295: [puppet] Remove expired and unused certs from  modules/profile/files/ssl/ and modules/base/files/ca - https://phabricator.wikimedia.org/T354295
[18:36:35] <mutante>	 will "icu67" (the APT component/icu67) that is in distro "buster-wikimedia" not be needed in distro "bullseye-wikimedia"?
[18:37:07] <mutante>	 as in:  [apt1001:~] $ sudo -i reprepro list bullseye-wikimedia | grep icu     has no output.    but   sudo -i reprepro list buster-wikimedia | grep icu   does
[18:37:40] <mutante>	 (https://icu.unicode.org/download/67)
[18:41:33] <mutante>	 hmmm  https://phabricator.wikimedia.org/T350767  says "Now that the ICU67 migration has completed, the PHP 7.4 packages need to be rebuilt" 
[18:41:52] <mutante>	 but does that mean I should expect a component for icu67 
[20:48:04] <kamila_>	 um, it looks like I broke something and I'm not sure how (or what to do about it)
[20:49:05] <kamila_>	 I was reimaging some mw hosts in eqiad, mw1377-1383 (all row D) and after a reboot the network didn't come back
[20:49:30] <kamila_>	 no route to host on the normal interface and I get permission denied on mgmt
[20:49:53] <kamila_>	 (yes I shouldn't have run so many at once, sorry...)
[20:49:56] <kamila_>	 any pointers?
[20:50:38] <mutante>	 kamila_: I can connect to mw1377 
[20:50:47] <mutante>	 uptime 3 min
[20:50:49] <taavi>	 kamila_: I can connect to mw1377.mgmt.eqiad.wmnet just fine, and the serial console seems accessible
[20:51:01] <kamila_>	 did it _just_ come back?
[20:51:22] <kamila_>	 that's hilarious
[20:51:34] <kamila_>	 I'd been staring at it for what felt like half an hour
[20:51:39] <kamila_>	 ok, nvm, sorry for the noise...
[20:51:41] <kamila_>	 weird
[20:51:52] <mutante>	 probably the first puppet run took that long
[20:52:03] <mutante>	 sometimes it is like that, always when you ask
[20:52:11] <kamila_>	 okay, well
[20:52:23] <kamila_>	 if I hadn't made noise it would be hanging there all night
[20:52:31] <mutante>	 yea:)
[20:53:25] <mutante>	 fwiw, 4 hosts at a time always was my go-to number as well, split the terminal into 4 feels like you can still follow it but not too slow either
[20:53:37] <taavi>	 let's wire the reimage cookbook to automatically ask on IRC to make things faster? :D
[20:54:14] <kamila_>	 :D
[20:54:28] <mutante>	 the part that the reimage cookbook claims it failed might be another story 
[20:54:43] <mutante>	 maybe it was only "failed to downtime" but maybe not
[20:56:26] <mutante>	 kamila_: if you reimage while the host name is already assigned to a puppet role in site.pp it might take long and fail unless the puppet code is really working on the very first run, which isnt always the case, it might need 2 runs
[20:56:54] <kamila_>	 mutante: so my best bet would be to re-run the reimage and then it actually has a chance of working this time?
[20:56:55] <mutante>	 so if that's the case you could assign the "insetup" role, then reimage (quickly) and then assign the prod role and run puppet
[20:57:23] <kamila_>	 I am not sure what state the hosts are in 
[20:57:31] <mutante>	 are you changing the role from appserver to k8s?
[20:57:35] <kamila_>	 yes
[20:58:10] <mutante>	 I think I would first move them just from appserver to the "insetup" role, do the reimage, which should be failsafe
[20:58:22] <mutante>	 then apply new role on one host, run puppet and see what happens on the first run
[20:58:24] <kamila_>	 ok
[20:58:30] <mutante>	 it might be that it works only after the second run
[20:58:51] <kamila_>	 thank you
[20:58:59] <mutante>	 which doesnt matter after that second run happened.. but does for the cookbook
[20:59:02] <mutante>	 yw
[21:09:04] <mutante>	 kamila_: a k8s worker does this whole calico networking setup that non-k8s servers dont get. that adds new network interfaces, and then also it has LVS that adds all those lo: interfaces, that all happens on the first puppet run, so started by cookbook, so I could imagine that's why it was down, maybe restarted networking 
[21:10:06] <kamila_>	 maybe... but it's a bit awkward because the same thing worked fine for the codfw hosts
[21:10:16] <kamila_>	 not the same HW but still
[21:10:28] <mutante>	 mw1377 did get the calico interfaces eventually, so:
[21:10:29] <mutante>	 mw1377:/# ip a s | grep cali
[21:12:11] <mutante>	 just saying maybe it's temp down during the initial run
[21:14:00] <kamila_>	 maybe... I am very confused now :D
[21:15:08] <mutante>	 you will know more when you run puppet manually after changing the role, I think
[21:15:19] <kamila_>	 yeah
[21:15:22] <mutante>	 or you can dig more in the cookbook logs
[21:15:59] <kamila_>	 well, I'll probably still be confused given that I wanted to babysit these over dinner and suddenly it's 10pm '^^
[21:16:37] <mutante>	 oh, yea, then dont do that tonight
[21:17:03] <mutante>	 I would either do nothing and just downtime them with downtime cookbook
[21:17:09] <mutante>	 or do reimage with insetup and then stop 
[21:17:28] <kamila_>	 yeah, I'll give the reimage a try and if not then downtime it is...
[21:17:53] <kamila_>	 I just don't want to leave it in a state where it'll start alerting
[21:19:30] <mutante>	 right! +1. the downtime cookbook should do it. assuming these are already set to inactive or completely removed from confctl then deployers should not be affected either
[21:20:31] <kamila_>	 good to know, thank you
[21:20:51] <mutante>	 so no renaming onf the machines when they change role?
[21:22:02] <kamila_>	 yeah, renaming was deemed to be a silly amount of work considering that we have to do this for a lot of hosts
[21:22:47] <mutante>	 gotcha
[21:26:39] <zabe>	 21:24:39 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-01-03-212406-publish (ran as mwdeploy@mw1377.eqiad.wmnet) returned [255]: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[21:26:39] <zabe>	 @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
[21:26:39] <zabe>	 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[21:27:02] <zabe>	 21:26:46 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-01-03-212406-publish (ran as mwdeploy@mw1382.eqiad.wmnet) returned [255]: ssh: connect to host mw1382.eqiad.wmnet port 22: Connection timed out
[21:27:12] <zabe>	 should those hosts have been depooled?
[21:36:21] <mutante>	 mw1382 is pooled: inactive
[21:36:51] <mutante>	 and in kubernetes cluster
[21:38:12] <mutante>	 maybe this is the part where not renaming actually created an extra step, heh
[21:38:48] <mutante>	 I dont see it in another "scap / dsh group" 
[21:41:08] <mutante>	 deployment and reimage of appserver might not be the best combo
[21:42:03] <kamila_>	 I did attempt to depool the hosts
[21:42:31] <kamila_>	 but I'm not sure what state it is in now
[21:43:16] <mutante>	 yes, they are depooled. just needs a downtime
[21:43:16] <kamila_>	 sorry for the mess zabe, trying to fix it now...
[21:43:52] <taavi>	 I think the list of k8s nodes to pull the images from is pulled via puppetdb
[21:43:57] <kamila_>	 ack, downtiming