[03:32:18] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter.service on db2202:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:32:18] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter.service on db2202:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:53:39] why are these going every 4 hours again? [08:28:00] this host has been reimaged yesterday, I guess there has been some issue on the silence part [08:30:34] (but also, didn't we get them move to only every 24h?) [08:31:18] T357333 [08:31:18] T357333: SystemdUnitFailed alerts are too noisy for data-persistence - https://phabricator.wikimedia.org/T357333 [08:33:23] I've asked there (since that ticket matches my memory that these were meant to be every 24h now) [08:36:39] anyway, good thing is it's an alert that pops up on alerts.wm.o so I'll mute it for a week ( cc jynus :wav [08:36:45] 👋 ) [08:37:16] {{done}} [08:48:18] TY :) [08:48:27] those need to be manually disabled, as puppet logic cannot do it [08:49:19] there is a race condition for which that shouldn't run, but I think it is detected, then disabled, then the autorestart keeps haning [08:49:24] *hanging [08:50:21] I don't like those autorestart services that moritz setup (I understand they are needed, but should be architectured differently) [09:34:24] arnaudb: are you going to work with db2107 or can I reimage it? [09:34:34] let me check [09:34:42] it is the old codfw s2 master [09:35:10] yep, I have to clone it to db2207 but I guess I can switch with any other? [09:35:30] is that host going to be decommissioned? (db2107) [09:35:35] yes [09:35:44] ok then, use it to clone the other one [09:35:48] I won't reimage it then [09:35:53] ok :) [13:28:19] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017057/comments/b5a25ff5_731b11ca [13:29:54] Emperor: o/ I noticed two Puppet-CA-based certs in warning state for swift_{eqiad,codfw}, I didn't follow if we already use PKI or not, just raising the alert in here for awareness [13:30:24] elukey: sorry, I think I need more context [13:31:09] (ms-fe swift uses sslcert for TLS) [14:01:50] Emperor: back sorry, I was in a meeting [14:02:29] so IIUC swift.discovery.wmnet points to a proxy that uses a cert issued by the Puppet CA, almost surely with cergen [14:02:56] and it is going to expire in some days [14:02:57] Not Before: Apr 15 14:35:21 2019 GMT [14:02:57] Not After : Apr 14 14:35:21 2024 GMT [14:03:22] the CN = swift_eqiad (or swift_codfw) [14:03:43] there is an alert for it in alerts.w.o, this is why I pinge you [14:04:52] ah yes I see https://phabricator.wikimedia.org/T356412 [14:10:20] does the sslcert puppetry not handle renewals automatically? [14:11:11] (this strikes me as the sort of thing that ought to happen automatically, but maybe I'm just to used to dehydrated from elsewhere & home) [14:11:49] IIRC nothing that uses the Puppet CA renewes automatically, this is why we are moving everything to the new PKI and cfssl :) [14:12:23] but lemme check sslcert [14:13:11] elukey: my understanding was the cfssl _also_ required manual refreshes (of intermediate certs)? [14:14:46] Emperor: puppet is able to contact the right PKI intermediate to renew a cert when needed (on bare metal), and on k8s this happens via specific controllers, but all automatically. We have some daemons like Kafka that, in our current version, are not able to pick up the new cert without a restart so we have to schedule some downtime [14:15:32] https://wikitech.wikimedia.org/wiki/PKI/CA_Operations#Renewing_a_new_intermediate is what I found when I last looked at cfssl [14:16:35] I wasn't aware of that, TIL, but on clients we use the Rook PKI CA cert to validate [14:17:19] on Kafka/Cassandra/etc.. the daemons are instructed to return the chained cert, so the client only needs to know the Root PKI to trust it [14:17:47] (I see modules/secret/secrets/certificates/certificate.manifests.d/swift.certs.yaml in the puppet private and the cert is issued by the puppet CA, so it is definitely manual) [14:18:42] ah, that's from 2019 long before I started here [14:20:10] elukey: so, um, what docs are there that tell me how to sort this out? [14:22:34] since the failure modes seem quite bad if I get it wrong [14:22:52] I think https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate but now that I think about it, it has been a while since I renewed one [14:24:57] lemme check some things [14:25:09] I'm going to make a phab task [14:29:46] the only other big example that I found is https://phabricator.wikimedia.org/T276029 [14:31:12] the procedure to follow should probably be discussed with the broader SRE team, maybe in the #sre channel [14:32:11] in theory clients do not check if the Puppet-CA based certs are revoked or not by the puppet CA, so the procedure in wikitech should be sound [14:32:19] but I don't think I have ever done it [14:32:49] moving to pki is also an option, but it is surely more work [14:33:41] elukey: AFAICT that would still leave us having to do a manual update at some point? [14:33:50] ah yes yes that for sure [14:34:04] :sadpanda: [14:34:35] another example https://phabricator.wikimedia.org/T304237 [14:35:00] I've opened T361844 [14:35:01] T361844: Swift TLS certificates will expire soon (14 April) - https://phabricator.wikimedia.org/T361844 [14:35:42] we could also think about pki, but the clients will need to be instructed to trust the Root PKI as well if they don't do it now [14:36:25] also I am not sure if swift reloads its certs automatically [14:36:45] I'm happy in principle to think about doing something about T356412, but it seemed like it was rather work for the sake of it (i.e. I'm not sure it saves us much) [14:36:46] T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412 [14:37:01] elukey: it's envoy that does the TLS termination [14:37:13] ah right right sorry, there is a proxy [14:37:26] I don't know if that does or not, but we have a rolling-reload cookbook in any case [14:38:14] envoy should be able to reload [14:38:39] ok if we move the conversation to #sre? [14:39:09] sure