[10:45:44] Stupid question - when doing an iDRAC update (on a system too old for the firmware update cookbook to work), which format should I be using? None of the offered formats seem obviously right (and the default is a windows EXE) [14:20:18] qq about registering a new k8s service under ingress. I'm following https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Add_a_new_service_under_Ingress, and have checked that the service can be reached with https://superset-next.svc.eqiad.wmnet:30443/health (which is going to be the health check URL). Once I merge [14:20:18] https://gerrit.wikimedia.org/r/c/operations/puppet/+/997857, I just want to confirm that I only have to run `sudo cumin 'O:lvs::balancer' 'run-puppet-agent'`, with no subsequent pybal restart [14:28:10] brouberol: yep, no pyball restart with ingress [14:28:41] thanks, so only running puppet on the LVS servers, or letting puppet on its own then [14:29:18] yeah, and when you go to state: production a run on the alerting nodes iirc [14:30:50] and again, just to confirm, I can go from state: service_setup to state: production and skip state: lvs_setup altogether, right? [14:31:59] why do you need to run puppet on lvs hosts if it's not touching lvs config? [14:35:57] I think you're right. I might not need to run puppet on them at all. I'm just not 100% sure where the side effect of that config change will be reflected TH [14:35:59] * TBH [14:37:17] hmmmmm [14:37:58] the most obvious thing (to me) the catalog entry is adding are the prometheus probes, but that only happens on state: production [14:41:06] alright, thanks! so if it's not making any LVS changes, the change itself is pretty low-risk I take it [14:41:37] (again, sorry if this is pretty basic, I just don't want to risk anything LVS-related on a friday) [14:44:25] No you're right to ask. taavi is right though, I misrembered, there's no need for a puppet run on the lvs servers for new services on an existing ingress [17:28:30] oh man, it looks like I broke puppetmaster [17:29:37] urandom: do you need help? [17:29:38] I inadvertently ran rm -rf /var/lib/puppet/ssl [17:29:44] cdanis: I think I do [17:30:54] earlier I was fixing some broken hosts, I apparently ran the rollback script from T349619. I think everything there would be a no-op, except the rm -rf [17:30:55] T349619: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 [17:30:56] on which puppetmaster? 1001? [17:30:59] 1001 [17:32:12] ok [17:32:47] taavi: do you know what the fix is for this? I'd know what to do if it were any host other than the puppetmaster, but ... [17:32:58] jhathaway: around? [17:33:07] yup, happy to help as well [17:33:13] /var/lib/puppet/ssl/ houses the client certs, so in theory the fix is to regenerate the client cert for 1001 [17:33:28] but given 1001 is the active CA host that might be a bit tricky [17:33:33] right [17:33:36] yeah [17:33:49] chicken-egg [17:34:57] Backed up on this host: var-lib-puppet-ssl [17:35:03] does this mean we can just restore it from backups? [17:35:18] * taavi looks [17:35:28] ooh nice, that would great [17:35:37] yeah, that would be a life-saver [17:37:30] I'm attempting to do so right now, following https://wikitech.wikimedia.org/wiki/Bacula#Restore_(aka_Panic_mode) [17:37:39] cdanis: I am already trying that [17:37:57] haha, I was just about to ask so I didn't trample anyone [17:38:01] oops [17:38:08] I just started a restore before I saw this [17:38:18] I saw "panic mode" and thought: hell yeah [17:38:22] anyway it did not seem to work [17:38:28] Feb 09 17:37:33 puppetmaster1001 bacula-fd[20219]: openssl.c:68 Error loading private key: ERR=error:0B080074:x509 certificate routines:X509_check_private_key:key values mismatch [17:38:34] since all the other certs are in /var/lib/puppet/server/ssl, I think only puppetting puppetmaster1001 should be broken, I tried a buster host and there was no issue [17:38:50] taavi: amusing [17:38:59] don't we encrypt with the puppet keys? [17:39:18] jhathaway: yeah, I broke it, but the hosts I was fixing were...fixed, and only then realized what I had done [17:39:26] we do, but bacula keeps its own copy of the keys in /etc/bacula/ssl/ [17:39:45] https://wikitech.wikimedia.org/wiki/Bacula#Restore_from_a_non-existent_host_(missing_private_key) [17:39:45] `PKI Master Key = "/var/lib/puppet/ssl/certs/ca.pem"` in bacula config references the now-broken directory [17:39:47] as well [17:40:53] so I think we have two options [17:41:27] first is to re-construct that directory from the copies of the keys in /etc/bacula/ssl/ [17:41:41] second is to try to follow that restore without keys process [17:42:13] would it be easier to reconstruct just enough of it for a standard bacula restore to work? [17:42:22] sgtm [17:42:33] should I try that? [17:42:37] please [17:42:46] sounds good, as a first attempt [17:43:46] hold on, puppetmaster1001:/var/lib/puppet/ssl seems to exist [17:43:54] what happened? [17:44:05] I think the agent has run since it was deleted [17:44:27] so it would have recreated the directory and contents, no? [17:44:40] oh indeed, it might have re-generated it [17:45:26] I don't think so from puppet.log on puppetmaster1001 [17:45:48] something recreated /var/lib/puppet/ssl/certs/ca.pem [17:45:51] oh, actually, maybe [17:45:56] because I was able to read it at one point [17:46:12] it looks like a run at 15:58:33 recreated it? [17:46:46] https://phabricator.wikimedia.org/P56596 [17:47:21] the run at 15:58 *also* overwrote /etc/rsyslog/ssl/server.key and /etc/bacula/ssl/server.key [17:47:31] because of expose_agent_certs [17:47:31] yeah [17:47:37] ugh [17:47:39] buh [17:47:54] well, the good (bad) news is that the old key is in the log for puppet-agent-timer.service [17:47:58] yes... [17:48:18] ?? [17:48:37] so we can restore /etc/bacula/ssl/server.key based on that, and then restore the rest from bacula [17:48:52] the files should be in the puppetclient bucket as well [17:49:06] oh true [17:49:09] I'll do that? [17:49:40] we should disable the puppet timer first [17:49:59] taavi: journalctl -u puppet-agent-timer.service | fgrep '/Stage[main]/Bacula::Client/Puppet::Expose_agent_certs[/etc/bacula]/File[/etc/bacula/ssl/server.key]/content) -' | fgrep -v -- '---' | cut -d- -f3 [17:50:03] stopped puppet-agent-timer.service [17:50:30] taavi: if you know how to restore from the clientbucket go for it, I was just googling for the right command invocation [17:50:40] we can restore from the puppet agent log, heh [17:50:47] root@puppetmaster1001:/etc/bacula/ssl# cp /var/lib/puppet/clientbucket/b/e/1/f/7/d/0/0/be1f7d009a01adcf9969a571e5930f89/contents server.key [17:50:51] aha [17:50:54] even better [17:51:04] nice [17:51:57] Feb 09 17:51:37 puppetmaster1001 bacula-fd[26419]: Failed to load master key certificate from file /var/lib/puppet/ssl/certs/ca.pem for File daemon "puppetmaster1001.eqiad.wmnet-fd" in /etc/bacula/bacula-fd.conf. [17:52:45] fixed [17:53:45] taavi: like, fixed-fixed? [17:53:52] fixed that cert error [17:53:55] bacula is now running [17:53:57] oh. [17:54:48] I'll trigger a new restore now [17:56:10] 552297 Restore 20 104.9 K OK 09-Feb-24 17:55 RestoreFiles [17:56:53] root@puppetmaster1001:/var/lib/puppet# mv /var/tmp/bacula-restores/var/lib/puppet/ssl/ ssl [17:57:08] cdanis: jhathaway: want to double-check everything looks good before I re-enable the agent timer? [17:57:16] nod [17:57:56] looks good, I would probably just do a manual run, then enable the timer [17:58:02] +1 [17:58:15] a manual run will enable it anyway I think. but doing [17:59:20] it seems to be working [17:59:23] taavi: looks like your run 1) succeeded, 2) updated the rsync key, but 3) didn't update contents (but did tweak permissions) of the bacula key [17:59:26] so that all seems great :) [17:59:37] \o/ [17:59:45] nice work [17:59:49] yeah [17:59:51] and yeah it did re-enable the agent timer as well [18:00:02] thanks taavi! [18:00:07] indeed thanks taavi :) [18:00:09] and jhathaway and cdanis ofc [18:00:29] I owe ${beverages_of_choice} all around :) [18:00:34] happy to help [18:00:42] just a little bit of Friday afternoon fun :) [18:00:49] ha [18:01:25] cdanis: jhathaway: https://gerrit.wikimedia.org/r/c/operations/puppet/+/999964/ [18:01:39] I went to check that all the puppet failures I was fixing were all recovered, saw the puppetmaster failure, thought that was strange and had a look [18:01:57] it was only then that I realized that I'd gotten terminals crossed at some point [18:02:16] not a great move, worse for a friday [18:02:48] urandom: about 20 years ago on some friday i was sitting at the console for the disk image server at the college computer lab i worked at [18:03:22] and i was ssh'd into a random client workstation i was fiddling with on one terminal and had a root shell on the image server machine on another terminal [18:03:34] and i ran the command to reimage with the workstation image [18:03:44] oh damn. [18:03:49] that was much more painful to clean up 😅 [18:05:51] LOL, we've all been there ;) [18:06:16] * inflatador remembers rm -rfing a whole host due to the command crossing bind mounts