[05:27:58] 10Release-Engineering-Team (Deployment Training Requests): Deployment training request for JKieserman - https://phabricator.wikimedia.org/T296024 (10Ladsgroup) Noted. I will be there unless issues happen (sick, etc.) [06:36:35] Project beta-scap-sync-world build #27991: 04FAILURE in 2 min 10 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27991/ [06:46:35] Project beta-scap-sync-world build #27992: 04STILL FAILING in 2 min 10 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27992/ [06:56:40] Project beta-scap-sync-world build #27993: 04STILL FAILING in 2 min 15 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27993/ [07:06:32] Project beta-scap-sync-world build #27994: 04STILL FAILING in 2 min 7 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27994/ [07:16:41] Project beta-scap-sync-world build #27995: 04STILL FAILING in 2 min 14 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27995/ [07:26:33] Project beta-scap-sync-world build #27996: 04STILL FAILING in 2 min 9 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27996/ [07:36:42] Project beta-scap-sync-world build #27997: 04STILL FAILING in 2 min 16 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27997/ [07:46:40] Project beta-scap-sync-world build #27998: 04STILL FAILING in 2 min 13 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27998/ [07:56:30] Project beta-scap-sync-world build #27999: 04STILL FAILING in 2 min 5 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27999/ [08:06:39] Project beta-scap-sync-world build #28000: 04STILL FAILING in 2 min 12 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/28000/ [08:11:27] Fatal error: Uncaught ConfigException: Failed to load configuration from etcd: (curl error: 60) SSL peer certificate or SSH remote key was not OK in /srv/mediawiki-staging/php-master/includes/config/EtcdConfig.php:205 [08:11:49] Last worked 06:24 [08:15:37] Can't see any config / mediawiki changes so possibly a beta issue [08:16:37] Project beta-scap-sync-world build #28001: 04STILL FAILING in 2 min 11 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/28001/ [08:17:30] Puppet is failing on 16 hosts [08:26:43] Project beta-scap-sync-world build #28002: 04STILL FAILING in 2 min 16 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/28002/ [08:35:24] Project beta-scap-sync-world build #28003: 15ABORTED in 59 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/28003/ [08:36:34] !log disabled beta-scap-sync-world [08:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:39:58] thcipriani: is it worth a task for someone to pick up or are you looking? [08:41:09] RhinosF1: good call as usual :) I believe this has something to do with recent changes to cdb generatation in sync master. Lemme find the change and I'll file that task. [08:42:10] for now, my log message was mostly about stopping the noise [08:43:41] Ok [08:44:16] Yeah noise isn't helpful [08:48:52] 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10thcipriani) [08:50:23] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10thcipriani) [08:51:11] hrm, while annoying, it does seem like scap was "successful" except at keeping the other deployment host in sync [08:52:18] Ah [08:58:32] thcipriani: "just" rebuilding i18n cache failed because of some cert issues [09:00:19] so, puppet at etcd fails... [09:07:33] ...puppet at all of beta fails [09:07:52] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10Urbanecm) I want to note that puppet apparently fails at the etcd host. ` urbanecm@deployment-etcd02:~$ sudo run-pu... [09:09:04] sigh [09:10:05] alright...I guess--given it's like 2am--it's less disruptive to let beta-scap-sync-world continue to make noise and look into cert issues when I have time :( [09:10:22] I'd leave it disabled honestly [09:10:57] Just leave it disabled [09:10:58] I have no strong opinions, so it doesn't take much to convince me in this case :) [09:11:07] `Caused by: org.postgresql.util.PSQLException: SSL error: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target` is the best i can find now [09:11:12] (at puppetdb) [09:11:55] that might be related to elukey's recent ca work [09:12:08] might be [09:12:16] I'd bet if we could get puppet to run it'd take care of itself [09:12:16] it's like a hour or two old [09:12:22] I'll have a look later, not at my laptop at the moment [09:12:44] <3 majavah [09:13:28] * majavah points towards the general direction of T215217 [09:14:31] I swear this is on my mind. [09:14:40] thcipriani: pretty sure it would fix it [09:17:02] urbanecm: might be worth pasting the puppet db error on task [09:17:14] yeah, i'm just looking if i can find more [09:17:28] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap, 10Infrastructure-Foundations, 10Puppet: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10RhinosF1) [09:18:36] FWIW, https://goals.releng.team are my Big Goals™. Then trying to figure out something better than "majavah takes care of it" for beta is at the top of my list. [09:20:34] might https://gerrit.wikimedia.org/r/c/operations/puppet/+/739266 be related? [09:20:56] urbanecm: sounds sus [09:22:22] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap, 10Infrastructure-Foundations, 10Puppet: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10Urbanecm) `Caused by: org.postgresql.util.PSQLException: SSL error: PKIX... [09:22:24] That's 4 days old though [09:22:31] So why only fail 3 hours ago [09:22:36] not sure [09:25:17] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap, 10Infrastructure-Foundations, 10Puppet: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10Urbanecm) ` Nov 20 07:16:02 deployment-puppetdb03 puppet-agent[18152]: Lo... [09:25:34] might be a red herring. There's a 500 from the beta puppet server that's a bit inscrutable to me... [09:25:57] I'm on my laptop now, let's see [09:26:39] so we have multiple systems (etcd, puppetdb) relying on the puppet ca failing at the same time [09:26:55] * urbanecm is to disembark the train soon [09:26:58] so i'll leave it to you :) [09:27:00] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap, 10Infrastructure-Foundations, 10Puppet: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10thcipriani) The puppet failure on the second deployment host is: ` thcip... [09:31:00] I think puppetdb is having certificate issues with connecting to its postgres database on the same host [09:32:14] and postgres uses the puppet host certs [09:38:48] umm, the puppet log urbanecm pasted to the task does not make any sense [09:38:54] Nov 20 07:16:09 deployment-puppetdb03 puppet-agent[18152]: (/Stage[main]/Sslcert::Trusted_ca/File[/etc/ssl/localcerts/WMF_TEST_CA.pem]/ensure) ensure changed 'file' to 'link' [09:39:13] um, hold on [09:39:16] trusted_ca? [09:40:22] majavah: i swear i didn't make it up! [09:40:58] btw.. I just realized what's going on [09:41:49] Yes? [09:41:52] * urbanecm is curoous [09:41:54] *curious [09:42:44] so, as a part of the mw-on-k8s work, serviceops have made a debian package called wmf-certificates containing the internal CAs (cfssl and puppet) so that they could be installed in docker containers too [09:42:58] that package is installed on all servers via puppet too [09:43:25] ...so we started to suddenly use prod certs from the package? That...can't be right, can it? [09:43:38] and as a part of today's unattended-updates run, it overwrote the puppet ca with prod's one [09:44:07] Assuming we have old CA somewhere, shouldn't be terribly hard to fix [09:44:09] https://phabricator.wikimedia.org/P17782 [09:44:35] ... why has this not broken more things [09:44:37] This will slowly break mostly everything, at least in beta [09:44:53] I'm more worried about things like toolforge [09:44:53] And I _hope_ it got installed only in beta (and not cloudvps wide) [09:45:01] * majavah looks [09:49:20] wmcs-wide cumins take a while sadly [09:52:12] it has overridden /etc/ssl/certs/Puppet_Internal_CA.pem on all wmcs instances except toolforge [09:52:40] that sounds bad [09:52:53] but why not toolforge [09:53:46] because I haven't got around fixing T290494 yet [09:53:47] T290494: Revisit Toolforge automated package updates and version pinnings - https://phabricator.wikimedia.org/T290494 [09:53:54] ah [09:54:38] majavah: can you update ticket with the findings and/or create a new one? This is like an UBN for cloud [09:55:11] Doing [09:56:47] Thanks [09:57:40] i left message in #wikimedia-cloud-admin [09:58:28] T296127 [09:58:29] T296127: /etc/ssl/certs/Puppet_Internal_CA.pem overridden to production certs on cloud vps - https://phabricator.wikimedia.org/T296127 [10:00:09] in theory I could do a cloud-wide cumin for `cp /var/lib/puppet/ssl/certs/ca.pem /etc/ssl/certs/Puppet_Internal_CA.pem`, but I'm not going to do that without feedback for others who know what they're doing [10:01:43] majavah: who would that be? [10:01:53] !log root@deployment-puppetdb03:~# cp /var/lib/puppet/ssl/certs/ca.pem /etc/ssl/certs/Puppet_Internal_CA.pem && systemctl restart puppetdb.service # T296125 [10:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [10:01:57] T296125: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 [10:02:12] and what's the best way to get them on a Saturday [10:02:32] (also we need to make sure it doesn't override again) [10:04:19] majavah: do we have an idea of which patch exactly was the breaking one? [10:04:25] i can attempt to get someone to rv it [10:05:13] urbanecm: it wasn't a puppet patch, it was a release of the wmf-certificates debian package being uploaded to apt.wm.o [10:05:28] but something had to install it? [10:05:32] hi folks, just seen the pings [10:05:43] I believe that the change majavah refers to is https://gerrit.wikimedia.org/r/c/operations/debs/wmf-certificates/+/740119 [10:05:54] hey elukey! [10:06:02] o/ [10:06:11] urbanecm: yes, that's done by the unattended-upgrades cronjob [10:06:20] elukey@an-test-client1001:~$ ls -l /etc/ssl/certs/Puppet_Internal_CA.pem [10:06:23] lrwxrwxrwx 1 root root 59 Nov 19 14:22 /etc/ssl/certs/Puppet_Internal_CA.pem -> /usr/share/ca-certificates/wikimedia/Puppet_Internal_CA.crt [10:06:23] hey elukey, turns out it wasn't your change that broke things, sorry for the ping [10:06:26] yeah [10:07:13] majavah: no problem, thanks as always for spending time on this :) I was working with Jaime yesterday to release the new package, so I am kinda responsible as well :D I have to say that nobody really expected this problem [10:07:25] that isn't replicated https://github.com/wikimedia/operations-debs-wmf-certificates fwiw [10:08:24] toolforge's k8s cluster uses puppet certs for etcd traffic, thankfully that seems unaffected [10:08:48] for now majavah :D [10:08:53] https://gerrit.wikimedia.org/g/operations/puppet/+/24cc1258080076d140eaf705b26f0f8ac63563c0/modules/profile/manifests/wmcs/kubeadm/control.pp#27 [10:10:40] * RhinosF1 off [10:12:13] the list of things relying on that certificate file is https://codesearch.wmcloud.org/search/?q=%2Fetc%2Fssl%2Fcerts%2FPuppet_Internal_CA.pem&i=nope&files=&excludeFiles=&repos= [10:15:20] majavah: an alternative to the cp could be a simply ln -s, basically flipping the links via cumin to the right location [10:17:16] but anything with /etc/ssl/certs/Puppet_Internal_CA.pem -> /usr/share/ca-certificates/wikimedia/Puppet_Internal_CA.crt is clearly broken in cloud [10:17:41] mm.. I don't think I want to be doing that either cloud-wide on a saturday instead if I can avoid it [10:18:51] I can try to ping the cloud folks, this seems to be a big problem [10:19:10] I totally agree that you shouldn't be the one responsible to fixing this [10:20:04] majavah: it may not be all cloud though, but I suspect only hosts using the profile::base blabla that installs wmf-certificates [10:20:45] so a cumin command like 'cumin "R:package = 'wmf-certificates'" should return the list [10:21:12] (no idea where to run cumin from on cloud) [10:21:25] we don't have a wmcs-wide puppetdb [10:21:49] but the package gets applied to all instances via site.pp -> node default -> role::wmcs::instance -> profile::base::labs -> profile::base [10:22:56] we ensure the package in profile::base::certificates [10:23:33] 679 instances with the wrong production one, 136 in toolforge with the correct one, then a few with other project local CAs [10:23:36] and in profile::base contain profile::base::certificates [10:23:39] lovely [10:25:59] I am checking the package's debian config, and removing it may do the right thing (of course we'd need to make sure that puppet doesn't re-deploy it) [10:26:02] lemme check [10:26:45] nope the link gets removed [10:27:25] majavah: how did you come up with the 679 number? (to understand the process) [10:28:05] taavi@cloud-cumin-03:~$ sudo cumin "A:all" "openssl x509 -in /etc/ssl/certs/Puppet_Internal_CA.pem -text -noout | grep CN" [10:33:53] do we have a "healthy" node on which we can check what Puppet_Internal_CA.pem links to? [10:33:59] (just to double check) [10:35:24] majavah: --^ [10:40:37] anyway, I think that the best course of action is to open a task [10:40:43] and add people [10:40:50] so that we can track work etc.. [10:40:55] going to open one in a bit [10:43:39] elukey: there's one already [10:43:49] elukey: https://phabricator.wikimedia.org/T296127 [10:43:56] I see thanks! [10:46:02] elukey: any node in toolforge is untouched, we don't do unattended-updates for wmf packages there [10:46:31] super [10:46:39] (I have no clue how much access you have in the cloud realm, let me know if you need me to do something) [10:50:05] thanks np :) [11:11:05] need to go but it seems that the situation is not worth a page [16:15:25] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap, 10Infrastructure-Foundations, 10Puppet: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10LucasWerkmeister) Web requests to the Beta cluster (e.g. https://en.wikip... [20:11:49] urbanecm: looks like beta now fully down [20:12:18] RhinosF1: I'm aware, but the (likely?) root cause task is already an UBN ;) [20:12:23] Could you have a glance and see if the bad certificate is back or if it's a new bad issue [20:12:31] urbanecm: it is ye [20:12:41] I can imagine it might still be Monday before a fix [20:12:52] Just saw you were online [20:13:00] it's definitely same issue: ` Uncaught ConfigException: Failed to load configuration from etcd: (curl error: 60) SSL peer certificate or SSH remote key was not OK in /srv/mediawiki/php-master/includes/config/EtcdConfig.php:205` [20:13:04] that's literally the same error message [20:13:49] honestly the only thing left is the pager, but it's not _that_ urgent :) [20:13:49] I wondered why it now causing it go down [20:14:06] No definitely not [20:14:07] leaving it as "first thing to do on Monday" is enough for me [20:14:11] K [20:14:58] RhinosF1: the bad cert was never removed from anywhere except puppetdb [20:15:43] majavah: It apparently only went down half way through today though [20:15:52] That's what confused me [21:44:12] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook