[06:36:31] there have been lots of issues with puppet overnight according to -operations, right? [06:40:00] https://phabricator.wikimedia.org/P53069 [07:43:33] <_joe_> looks like someone didn't do their homework [07:44:02] <_joe_> marostegui: yes all debmonitor things fail on all hosts AFAICT? [07:44:13] <_joe_> but it seems it goes deeper [07:45:18] <_joe_> yes seems impossible to renew certs with the pki? [07:45:53] <_joe_> I hope it's not a general failure or soon enough we'll see applications have expired certs [07:52:11] <_joe_> but yeah, puppet is failing everywhere [07:53:30] <_joe_> I would guess this issue is tied to https://gerrit.wikimedia.org/r/c/operations/puppet/+/969937 [07:53:52] <_joe_> jbond: ^^ I think switching pki to puppet7 might have broken it [07:58:05] <_joe_> marostegui: I think we should probably page, basically we're all blocked on any work that requires puppet to work [07:58:41] _joe_: do you want me to page john? [07:58:51] <_joe_> marostegui: yes [07:58:55] ok [07:58:56] one sec [07:58:58] <_joe_> tbh i'm not sure what's wrong [07:59:12] <_joe_> I'm trying to understand what is not ok with the PKI [08:00:21] <_joe_> but I see things like [08:00:23] <_joe_> Oct 30 18:44:41 pki1001 cfssl-ocsprefresh[812929]: ERROR:root:debmonitor issue with SQL query: (2003, "Can't connect to MySQL server on 'm1-master.eqiad.wmnet' ([SSL: CERTIFICATE_VERIFY_FAILED] certificate veri> [08:00:31] <_joe_> which are *quite worrisome* [08:00:40] <_joe_> ahhh I think I know what the problem is [08:00:53] <_joe_> probably the m1-master cert is still the old puppet CA [08:01:05] <_joe_> and for some reason it might be not installed on puppet7? [08:01:11] <_joe_> I'm grasping at straws here [08:01:30] I've sent him a splunk page [08:01:37] <_joe_> but yeah timing coincides with the merging of that change [08:01:59] what if we just revert? [08:02:32] <_joe_> marostegui: not sure it's enough, trying to understand that [08:02:43] <_joe_> marostegui: what's the TLS port on m1master ? [08:03:53] <_joe_> to be clear we will need to keep the old puppet CA around for a good chunk of the forseeable future and until then we need it on every server [08:04:59] TLS port? [08:05:30] I am going to try a normal page to john [08:06:43] done [08:07:22] I also wonder if a manual incident doesn't actually page people (even if it says it does) [08:08:30] most of the puppet failures seem related to debmonitor so far right marostegui ? [08:08:35] I see stuff like "Post \"https://pki.discovery.wmnet:443/api/v1/cfssl/authsign\": x509: issuer name does not match subject from issuing certificate" [08:10:53] <_joe_> elukey: the problem is that it seems all of the pki infra has certificate problems of various kind [08:11:16] <_joe_> so I think the ocsp stuff is failing because the mysql client doesn't find the puppet CA in ca-certificates.crt [08:11:21] <_joe_> anymore [08:11:50] <_joe_> sadly I have no idea if it's possible to just revert the move to puppet7 as-is [08:12:49] <_joe_> frankly, I had other stuff to work on [08:13:32] <_joe_> so I think I diagnosed the issue [08:13:41] <_joe_> do we have a task? [08:14:34] I don't think so [08:16:22] <_joe_> ok let me create it [08:17:19] let us know how we can help, I haven't debugged anything atm but I can work on it [08:19:04] the rollback procedure of a role migrated to puppet7 is described in the the task description here T349619 [08:19:04] T349619: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 [08:21:09] <_joe_> volans: ok, can your team take care of it? thanks. [08:21:12] <_joe_> https://phabricator.wikimedia.org/T350111 is the task [08:21:32] I'm still reading backlog... [08:21:38] thanks for the task [08:22:19] i think in addition to the db connection issue identified above, the cfssl clients are also failing when trying to validate the cert used by the cfssl server [08:22:51] <_joe_> taavi: which might be related to ocsp refresh failing [08:23:04] <_joe_> that's my point basically, it all seems tied [08:23:10] yeah, true [08:23:21] <_joe_> now I'm tempted to just copy over the puppet CA to /etc/ssl/certs [08:23:30] <_joe_> and run update-ca-certificates [08:23:40] <_joe_> and see if that's enough to fix the issue, as I suspect [08:24:27] IIRC boh CAs should be there,let me check few things [08:24:33] also I'm no expert in the above changes [08:24:57] hi _joe_ just got messag looking now [08:24:58] pki1001:/etc/cfssl/db.conf somehow has 'tls=skip-verify' in the connection string [08:25:00] <_joe_> volans: the CA is still there as part of "our" bundle but not the default one [08:25:30] <_joe_> but yeah a rollback to the old puppet should solve the issue [08:26:16] ah, /etc/cfssl/db.conf.json does specify /var/lib/puppet/ssl/certs/ca.pem explicitely. I think that should be using /etc/ssl/certs/wmf-ca-certificates.crt instead? [08:28:03] ok im just going to do the roll back of the switch. sorry i thught i had checked a renew on both installations [08:28:18] jbond: ack lmk if I can help [08:28:30] <_joe_> taavi: probably [08:28:49] it seems an easy test to do before rolling back completely no? [08:29:10] <_joe_> +1 [08:29:16] jbond: --^ [08:29:17] <_joe_> taavi: I'm pretty sure that's it [08:29:27] <_joe_> or at least one piece of the puzzle [08:29:43] ack ill test [08:29:49] <_joe_> jbond: so if it worked before, I'd bet the problem is indeed due to the failure to refresh ocsp changes [08:29:56] https://gerrit.wikimedia.org/r/c/operations/puppet/+/970267/ [08:31:27] yes could be that the .json file is usd by the python script [08:36:01] * jbond deployed change starting all timeres now [08:46:57] ok we are hitting https://github.com/ikapelyukhin/go-x509-issuer-name-does-not-match-subject, we hit this before and thought we solved it not sure what art is casig issues now. im going to depool pki2001 and rollback pki1001 [08:47:00] I am just catching up, but why did it start happening at 3 am ? [08:47:13] is it an expiration issue? [08:48:39] <_joe_> yes [08:52:36] jbond: did you ever get the incident page I sent you via splunk? or it never worked? [08:55:01] marostegui: yes i got it at ~8:20 [08:55:34] can you ack it on splunk? [08:55:37] or resolve even [08:56:42] ahh yes sorry [09:09:07] ok now with the roll back of pki1001 and running run-puppet-agent --failed-only, verything is looking healthy again (cc joe jynus hnowlan ) [09:10:49] im going to grab a shower and coffee then look at then try to fix the root cause [09:11:43] <_joe_> ack thanks [09:12:31] thanks [09:13:02] later (not now) it would be nice to writeup on the full explanation for those of us that arrived late [09:13:36] as it will be useful for later work and other potential incidents [09:13:45] on the ticket [09:14:48] marostegui: I am guessing you didn't do the scheduled s4 maintenance ? [09:38:53] jynus: i have done a brief write up https://phabricator.wikimedia.org/T350118 and will continue the investigation on that task [09:39:47] thank you! [09:42:37] np [09:56:18] jynus: it was done, why? [09:59:06] oh, was watching for impact, and thought you may have postponed it because of the puppet failures. All good then [09:59:25] as in, good it didn't affect you [09:59:33] jynus: no, we didn't, the only thing that puppet runs is pt-heartbeat, but the script also brings it up, so it was all good :) [10:02:52] marostegui: I am happy that terrible script keeps being useful to you so many years later! [10:03:08] :-D [11:27:51] jbond: I skipped your patch in labs/private that needs merging [11:28:28] taavi: ahh thanks for the reminder :) [13:15:11] jynus: hnowlan: i think i have fixed the root cause of the pki issue. im going to go grab food shortly but when im back i plan to add pki2002 back into service. ill aim for about 14:30 [13:15:21] UTC [13:23:15] jbond: I spotted a new issue with puppet7 tls validation: https://phabricator.wikimedia.org/T350147 [13:25:16] jbond: ack [13:29:05] thank taavi looking [14:47:21] hnowlan: jynus: bringing pki2002 back now [14:47:31] ack [14:52:29] going for lunch, but I will be seconds away from the terminal [17:13:00] hello on-callers [17:13:20] as FYI I just deployed a new version of change propagation in eqiad [17:13:27] task is https://phabricator.wikimedia.org/T348950 [17:13:48] so far it looks good, if you see any issue please rollback https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/969758 [17:14:15] (high backlog etc..) [17:26:55] elukey: thanks!