[08:40:25] <godog>	 jayme: mmhh currently that knob is not exposed in service::catalog no, probably the easiest "fix" is to turn the probe into a tcp one instead
[08:41:19] <jayme>	 godog: but there is a knob? :)
[08:44:23] <godog>	 jayme: I misspoke, there is for prometheus::blackbox::check::http but not for service::catalog ATM
[08:44:56] <godog>	 the knob being auto-generating the alert(s) for the probe from puppet, as opposed to a generic alert in alerts.git
[08:45:34] <godog>	 setting auto renewal to say 10/12 days also would work of course
[08:48:09] <jayme>	 hmm....
[09:00:25] <jayme>	 ah, the probe is using the LVS IP, not the endpoints. I was wondering why the expiry time is flapping :D
[09:03:35] <godog>	 service catalog is indeed lvs based
[09:04:12] <jayme>	 yeah, makes sense mostly. Until it does not when the different hosts have different certs :)
[09:05:26] <godog>	 heheh working as intended
[09:10:48] <jayme>	 ah...I now see it
[09:11:09] <jayme>	 it's not even about the short lifetime of the certs, those are valid for a month
[09:14:41] <jayme>	 but the PKI code does not refresh them early enough for the alert to not trigger
[11:57:33] <moritzm>	 I'm currently doing a pass over hosts in insetup* roles to identify any Puppet5/Puppet7 gotchas (like insetup roles defaulting to Puppet 7 which then get a role applied which doesn't default to Puppet 5 yet, causing cert issues)
[11:58:07] <moritzm>	 for o11y there's just one thing:
[11:58:46] <moritzm>	 logging-hd2* are currently installed with the insetup::observability role which defaults to Puppet 7
[11:59:13] <moritzm>	 but we don't default to Puppet 7 in the general case yet and these will likely use a new role
[12:00:04] <moritzm>	 so let's already go ahead and create a stub role for them (using the base profiles) and mark it as configured for Puppet 7 on the Hiera role level
[12:00:20] <moritzm>	 then the later service build out will continue to use Puppet 7
[12:00:40] <moritzm>	 I can create a stub role, just tell me the name you prefer
[12:57:43] <godog>	 moritzm: thank you, I'd say insetup::observability::puppet5 or _puppet5 I'm not attached
[12:58:02] <godog>	 other than that I think we're fine to default insetup::observability to p7
[12:59:00] <moritzm>	 insetup::observability already uses Puppet 7
[12:59:20] <godog>	 indeed
[12:59:36] <moritzm>	 this is done to make sure that when the actual rol eventually gets applied we don't run into any issues, IOW to ensure that the later role gets configured for P7 as well
[13:01:05] <godog>	 yeah makes sense to me, thank you for reaching out!
[13:01:11] <godog>	 can't wait for the migration to be over
[13:05:56] <moritzm>	 It'll unfortunatetly drag on for a few more months, since we need to wait for all buster migrations to be completed
[13:06:08] <moritzm>	 any preference for the name of the new role?
[13:07:52] <godog>	 moritzm: insetup::observability::puppet5 SGTM
[13:08:23] <godog>	 ack re: buster, we're making good progress on the alerting hosts and I'm expecting early next Q to start reimaging
[13:09:57] <moritzm>	 no, you're missing my point: these hosts _are_ already using Puppet 7 by means of the insetup::observability role (which defaults to Puppet 7), we need to create a placeholder entry for the eventual logging-hd role to make sure they continue to use Puppet 7 when the actual service rampup happens
[13:13:28] <godog>	 moritzm: my bad, thank you for the explanation, since it is a placeholder please go with logging::opensearch::data::hd
[13:14:09] <moritzm>	 ack, thanks! I'll prepare something and will add you and Cole as reviewers
[13:16:28] <godog>	 cheers
[14:46:01] <andrewbogott>	 Hey kids!  I'm trying to throw together optional puppet-free workflows on cloud-vps and I'm ready for someone besides me to test. Would it be useful for me to enable a puppetless debian image in the pontoon project? Or somewhere else?
[14:48:09] <godog>	 andrewbogott: hey, yes please, the project is 'monitoring', thank you! I'm not sure when I'll be able to test though I'm subscribed to the "unmanaged instances" task
[14:49:04] <godog>	 that's https://phabricator.wikimedia.org/T326818
[14:49:06] <andrewbogott>	 godog: bookworm ok?
[14:49:13] <godog>	 andrewbogott: yeah totally
[14:49:28] <andrewbogott>	 oh boy, you've gotten a lot of emails from that task lately :)
[14:50:11] <godog>	 heheh indeed, thank you for working on that btw
[14:50:31] <andrewbogott>	 gotta run one more test and then I'll ping you again
[14:50:48] <godog>	 cheers
[14:55:51] <jayme>	 godog: coming back to the service::catalog probe thing from this morning: I could just not add the probes and instead include prometheus::blackbox::check::http to my profiles to more or less get the same, right?
[14:56:19] <godog>	 jayme: yes that's correct
[14:58:23] <andrewbogott>	 godog: ok, now you have a new available image in that project, 'debian-12.0-nopuppet'.  You'll want to explore the 'Key Pair' tab for access; keys selected there will be added to the 'debian' user.
[14:58:26] <andrewbogott>	 lmk how it goes!
[14:58:55] <godog>	 andrewbogott: lovely, thank you very much
[14:59:40] <andrewbogott>	 Hope it works!
[15:00:49] <jayme>	 godog: ok. I think I'll do that then
[15:38:43] <godog>	 jayme: yeah that's fair, especially for kinda of a corner case wrt cert expiration
[15:40:51] <jayme>	 it also solves my problem of not checking the servers individually
[15:41:09] <godog>	 indeed that too
[15:41:14] <jayme>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/982819 - if you have a minute
[15:41:28] <jayme>	 pcc is still running
[15:45:18] <godog>	 jayme: as I'm reviewing the patch I realized this is more or less equivalent to "prometheus can / can't scrape the apiserver(s)" isn't it ?
[15:45:35] <jayme>	 hum
[15:45:42] <jayme>	 yeah...I suppose
[15:46:01] <jayme>	 is there a way to check if that fired in the past?
[15:46:06] <jayme>	 https://phabricator.wikimedia.org/T353233
[15:46:39] <jayme>	 or maybe it's not...no
[15:47:01] <jayme>	 prometheus can't scrape for other reasons (like it's client-cert not being refreshed)
[15:47:14] <jayme>	 but that does not mean the apiserver is in trouble
[15:48:04] <godog>	 that's true yeah, FWIW next Q we'll likely be upgrading prometheus so that problem should be fixed
[15:48:16] <godog>	 anyways, patch LGTM
[15:48:19] <jayme>	 🎉
[15:50:43] <godog>	 jayme: something else that occurred to me, you can customise things like runbook, dashboard, etc
[15:50:50] <godog>	 all for later tho
[15:51:11] <jayme>	 I would have, if only I'd had a proper runbook 😇
[15:52:02] <jayme>	 updated the patch setting severity to critical only for now, thanks!
[15:52:45] <godog>	 neato, +1
[15:55:33] <jayme>	 let's find out :-p
[16:06:22] <jayme>	 godog: ran puppet on a apiserver and prometheus1005 afterwards but I don't see a config change regarding that
[16:08:06] <jayme>	 oh and..does the resource name need to be unique in whole puppet? (as it is exported)
[16:08:22] <godog>	 that is a good question, I don't know
[16:08:54] <jayme>	 eheh
[16:09:06] <godog>	 re: prometheus1005 not updating, a bit of a shot in the dark but could you try again?
[16:09:18] <jayme>	 lol, sure :)
[16:09:24] <jayme>	 should I reboot it before? :p
[16:09:36] <godog>	 hahah pls no
[16:10:20] <godog>	 to explain: how much time did pass between the apiserver puppet run and the prometheus one? I don't know but wonder how atomic puppetdb updates are
[16:11:45] <taavi>	 jayme: godog: I suspect the k8s prometheus instance is missing the config to import blackbox stuff from puppetdb
[16:12:19] <godog>	 taavi: ah that'd be much more reasonable than my guess
[16:13:17] <jayme>	 that is very plausible :)
[16:13:41] <jayme>	 which rises the question: should this be in ops?
[16:13:52] <jayme>	 I just put it into k8s prometheus instances because I can...
[16:13:59] <taavi>	 yeah prometheus::blackbox::import_checks is in ops and tools only
[16:15:19] <godog>	 jayme: yeah ops SGTM
[16:15:44] <godog>	 taavi: ack, I'll make a note to validate that via puppet types on 'prometheus_instance'
[16:16:06] <jayme>	 ack, changing to "ops"
[16:18:46] <godog>	 jayme: feel free to self merge
[16:19:08] <jayme>	 wilco
[16:24:37] <godog>	 jayme: I have to go shortly and will check back later too
[16:25:42] <jayme>	 ack. I'll leave a message on status
[16:41:39] <inflatador>	 We're getting PuppetFailure alerts for elastic1107 , which is in insetup...is this expected? Don't think I've seen that before
[16:42:58] <jayme>	 godog: "works" now but it seems like all apiservers merged together module wise - which makes sense in a way. But with different parameters (certfificate_expiry_days) it probably depends on the order of resoureces as to which one "wins"
[16:52:47] <jayme>	 ah, not true - everything gets pretty mixed up aparently
[16:53:06] <jayme>	 target=https://[10.192.16.48]:6443/readyz msg="Error for HTTP request" err="Get \"https://10.192.16.48:6443/readyz\": x509: certificate is valid for kubemaster2002, kubemaster2002.codfw.wmnet, kubemaster.svc.codfw.wmnet, kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster, kubernetes.default.svc.cluster.local, kube-apiserver, not ml-staging-ctrl2001.codfw.wmnet"
[17:06:47] <jayme>	 so basically "last defined resource wins" there as well (blackbox module that is)
[17:08:11] <jayme>	 def. good call on not making this paging initialls :D
[17:08:16] <jayme>	 *initially
[17:31:06] <jayme>	 all good now with https://gerrit.wikimedia.org/r/c/operations/puppet/+/982858 merged
[17:33:25] <jayme>	 apart fromt he fact that I messed up the certificate_expiry_days... 🤦
[17:45:10] <jayme>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/982889 - not merging this today. Was messy enough already :)
[17:47:33] <inflatador>	 I'm getting PuppetFailed alerts on a host that's inservice (elastic1107)...is this expected?
[17:48:40] <inflatador>	 err...insetup that is
[19:47:48] <cwhite>	 inflatador: puppetboard might have some more info about what's happening on that host: https://puppetboard.wikimedia.org/node/elastic1107.eqiad.wmnet
[19:50:00] <inflatador>	 cwhite Gotcha. This is a new host that's still in with DC Ops and is insetup, so I wouldn't expect it to alert. There are a few other hosts (elastic1104-06) in the same state that don't seem to be alerting
[19:50:23] <cwhite>	 seems some ssl issue `SSL_read: sslv3 alert certificate unknown`  - I know there's some puppet upgrade work going on; might get more detailed info from Infrastructure Foundations team
[19:50:33] <inflatador>	 Should I just set the host hiera variable that suppresses all alerts in the meantime?
[19:54:07] <cwhite>	 If you think it's reasonable to suppress the alerts for elastic1107, I've no issue with that.
[19:56:37] <inflatador>	 I'll suppress in AM for now, I might forget if I do it in puppet. Mainly I was curious about the insetup role and its effect on alerts. Prior to this, I had assumed alerts did not fire on insetup hosts
[20:04:48] <cwhite>	 I don't see anything on the Alertmanager WT page about role::insetup and silenced alerts.  It seems notifications are disabled automatically in icinga, though.
[20:06:32] <inflatador>	 Ah, thanks for clearing that up. Sorry for not phrasing my question clearly
[20:13:14] <cwhite>	 No problem, it's unclear to me as well if that behavior was replicated to alertmanager.  I'd suspect not, but maybe g.odog will be able to confirm for us later. :)