[05:27:58] <wikibugs>	 10Release-Engineering-Team (Deployment Training Requests): Deployment training request for JKieserman - https://phabricator.wikimedia.org/T296024 (10Ladsgroup) Noted. I will be there unless issues happen (sick, etc.)
[06:36:35] <wmf-insecte>	 Project beta-scap-sync-world build #27991: 04FAILURE in 2 min 10 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27991/
[06:46:35] <wmf-insecte>	 Project beta-scap-sync-world build #27992: 04STILL FAILING in 2 min 10 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27992/
[06:56:40] <wmf-insecte>	 Project beta-scap-sync-world build #27993: 04STILL FAILING in 2 min 15 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27993/
[07:06:32] <wmf-insecte>	 Project beta-scap-sync-world build #27994: 04STILL FAILING in 2 min 7 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27994/
[07:16:41] <wmf-insecte>	 Project beta-scap-sync-world build #27995: 04STILL FAILING in 2 min 14 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27995/
[07:26:33] <wmf-insecte>	 Project beta-scap-sync-world build #27996: 04STILL FAILING in 2 min 9 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27996/
[07:36:42] <wmf-insecte>	 Project beta-scap-sync-world build #27997: 04STILL FAILING in 2 min 16 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27997/
[07:46:40] <wmf-insecte>	 Project beta-scap-sync-world build #27998: 04STILL FAILING in 2 min 13 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27998/
[07:56:30] <wmf-insecte>	 Project beta-scap-sync-world build #27999: 04STILL FAILING in 2 min 5 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27999/
[08:06:39] <wmf-insecte>	 Project beta-scap-sync-world build #28000: 04STILL FAILING in 2 min 12 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/28000/
[08:11:27] <RhinosF1>	 Fatal error: Uncaught ConfigException: Failed to load configuration from etcd: (curl error: 60) SSL peer certificate or SSH remote key was not OK in /srv/mediawiki-staging/php-master/includes/config/EtcdConfig.php:205
[08:11:49] <RhinosF1>	 Last worked 06:24
[08:15:37] <RhinosF1>	 Can't see any config / mediawiki changes so possibly a beta issue
[08:16:37] <wmf-insecte>	 Project beta-scap-sync-world build #28001: 04STILL FAILING in 2 min 11 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/28001/
[08:17:30] <RhinosF1>	 Puppet is failing on 16 hosts
[08:26:43] <wmf-insecte>	 Project beta-scap-sync-world build #28002: 04STILL FAILING in 2 min 16 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/28002/
[08:35:24] <wmf-insecte>	 Project beta-scap-sync-world build #28003: 15ABORTED in 59 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/28003/
[08:36:34] <thcipriani>	 !log disabled beta-scap-sync-world
[08:36:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[08:39:58] <RhinosF1>	 thcipriani: is it worth a task for someone to pick up or are you looking?
[08:41:09] <thcipriani>	 RhinosF1: good call as usual :) I believe this has something to do with recent changes to cdb generatation in sync master. Lemme find the change and I'll file that task.
[08:42:10] <thcipriani>	 for now, my log message was mostly about stopping the noise
[08:43:41] <RhinosF1>	 Ok
[08:44:16] <RhinosF1>	 Yeah noise isn't helpful
[08:48:52] <wikibugs>	 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10thcipriani)
[08:50:23] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10thcipriani)
[08:51:11] <thcipriani>	 hrm, while annoying, it does seem like scap was "successful" except at keeping the other deployment host in sync
[08:52:18] <RhinosF1>	 Ah
[08:58:32] <urbanecm>	 thcipriani: "just" rebuilding i18n cache failed because of some cert issues
[09:00:19] <urbanecm>	 so, puppet at etcd fails...
[09:07:33] <urbanecm>	 ...puppet at all of beta fails
[09:07:52] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10Urbanecm) I want to note that puppet apparently fails at the etcd host.   ` urbanecm@deployment-etcd02:~$ sudo run-pu...
[09:09:04] <thcipriani>	 sigh
[09:10:05] <thcipriani>	 alright...I guess--given it's like 2am--it's less disruptive to let beta-scap-sync-world continue to make noise and look into cert issues when I have time :(
[09:10:22] <urbanecm>	 I'd leave it disabled honestly
[09:10:57] <RhinosF1>	 Just leave it disabled
[09:10:58] <thcipriani>	 I have no strong opinions, so it doesn't take much to convince me in this case :)
[09:11:07] <urbanecm>	 `Caused by: org.postgresql.util.PSQLException: SSL error: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target` is the best i can find now
[09:11:12] <urbanecm>	 (at puppetdb)
[09:11:55] <majavah>	 that might be related to elukey's recent ca work
[09:12:08] <urbanecm>	 might be
[09:12:16] <thcipriani>	 I'd bet if we could get puppet to run it'd take care of itself
[09:12:16] <urbanecm>	 it's like a hour or two old
[09:12:22] <majavah>	 I'll have a look later, not at my laptop at the moment
[09:12:44] <thcipriani>	 <3 majavah 
[09:13:28] * majavah points towards the general direction of T215217
[09:14:31] <thcipriani>	 I swear this is on my mind.
[09:14:40] <RhinosF1>	 thcipriani: pretty sure it would fix it
[09:17:02] <RhinosF1>	 urbanecm: might be worth pasting the puppet db error on task
[09:17:14] <urbanecm>	 yeah, i'm just looking if i can find more
[09:17:28] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap, 10Infrastructure-Foundations, 10Puppet: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10RhinosF1)
[09:18:36] <thcipriani>	 FWIW, https://goals.releng.team are my Big Goals™. Then trying to figure out something better than "majavah takes care of it" for beta  is at the top of my list.
[09:20:34] <urbanecm>	 might https://gerrit.wikimedia.org/r/c/operations/puppet/+/739266 be related?
[09:20:56] <RhinosF1>	 urbanecm: sounds sus
[09:22:22] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap, 10Infrastructure-Foundations, 10Puppet: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10Urbanecm) `Caused by: org.postgresql.util.PSQLException: SSL error: PKIX...
[09:22:24] <RhinosF1>	 That's 4 days old though
[09:22:31] <RhinosF1>	 So why only fail 3 hours ago
[09:22:36] <urbanecm>	 not sure
[09:25:17] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap, 10Infrastructure-Foundations, 10Puppet: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10Urbanecm) ` Nov 20 07:16:02 deployment-puppetdb03 puppet-agent[18152]: Lo...
[09:25:34] <thcipriani>	 might be a red herring. There's a 500 from the beta puppet server that's a bit inscrutable to me...
[09:25:57] <majavah>	 I'm on my laptop now, let's see
[09:26:39] <majavah>	 so we have multiple systems (etcd, puppetdb) relying on the puppet ca failing at the same time
[09:26:55] * urbanecm is to disembark the train soon
[09:26:58] <urbanecm>	 so i'll leave it to you :)
[09:27:00] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap, 10Infrastructure-Foundations, 10Puppet: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10thcipriani) The puppet failure on the second deployment host is:  ` thcip...
[09:31:00] <majavah>	 I think puppetdb is having certificate issues with connecting to its postgres database on the same host
[09:32:14] <majavah>	 and postgres uses the puppet host certs
[09:38:48] <majavah>	 umm, the puppet log urbanecm pasted to the task does not make any sense
[09:38:54] <majavah>	 Nov 20 07:16:09 deployment-puppetdb03 puppet-agent[18152]: (/Stage[main]/Sslcert::Trusted_ca/File[/etc/ssl/localcerts/WMF_TEST_CA.pem]/ensure) ensure changed 'file' to 'link'
[09:39:13] <majavah>	 um, hold on
[09:39:16] <majavah>	 trusted_ca?
[09:40:22] <urbanecm>	 majavah: i swear i didn't make it up!
[09:40:58] <majavah>	 btw.. I just realized what's going on
[09:41:49] <urbanecm>	 Yes?
[09:41:52] * urbanecm is curoous
[09:41:54] <urbanecm>	 *curious
[09:42:44] <majavah>	 so, as a part of the mw-on-k8s work, serviceops have made a debian package called wmf-certificates containing the internal CAs (cfssl and puppet) so that they could be installed in docker containers too
[09:42:58] <majavah>	 that package is installed on all servers via puppet too
[09:43:25] <urbanecm>	 ...so we started to suddenly use prod certs from the package? That...can't be right, can it?
[09:43:38] <majavah>	 and as a part of today's unattended-updates run, it overwrote the puppet ca with prod's one
[09:44:07] <urbanecm>	 Assuming we have old CA somewhere, shouldn't be terribly hard to fix
[09:44:09] <majavah>	 https://phabricator.wikimedia.org/P17782
[09:44:35] <majavah>	 ... why has this not broken more things
[09:44:37] <urbanecm>	 This will slowly break mostly everything, at least in beta
[09:44:53] <majavah>	 I'm more worried about things like toolforge
[09:44:53] <urbanecm>	 And I _hope_ it got installed only in beta (and not cloudvps wide)
[09:45:01] * majavah looks
[09:49:20] <majavah>	 wmcs-wide cumins take a while sadly
[09:52:12] <majavah>	 it has overridden /etc/ssl/certs/Puppet_Internal_CA.pem on all wmcs instances except toolforge
[09:52:40] <RhinosF1>	 that sounds bad
[09:52:53] <RhinosF1>	 but why not toolforge
[09:53:46] <majavah>	 because I haven't got around fixing T290494 yet
[09:53:47] <stashbot>	 T290494: Revisit Toolforge automated package updates and version pinnings - https://phabricator.wikimedia.org/T290494
[09:53:54] <RhinosF1>	 ah
[09:54:38] <urbanecm>	 majavah: can you update ticket with the findings and/or create a new one? This is like an UBN for cloud
[09:55:11] <majavah>	 Doing
[09:56:47] <urbanecm>	 Thanks
[09:57:40] <RhinosF1>	 i left message in #wikimedia-cloud-admin 
[09:58:28] <majavah>	 T296127
[09:58:29] <stashbot>	 T296127: /etc/ssl/certs/Puppet_Internal_CA.pem overridden to production certs on cloud vps - https://phabricator.wikimedia.org/T296127
[10:00:09] <majavah>	 in theory I could do a cloud-wide cumin for `cp /var/lib/puppet/ssl/certs/ca.pem /etc/ssl/certs/Puppet_Internal_CA.pem`, but I'm not going to do that without feedback for others who know what they're doing
[10:01:43] <RhinosF1>	 majavah: who would that be?
[10:01:53] <majavah>	 !log root@deployment-puppetdb03:~# cp /var/lib/puppet/ssl/certs/ca.pem /etc/ssl/certs/Puppet_Internal_CA.pem && systemctl restart puppetdb.service # T296125
[10:01:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[10:01:57] <stashbot>	 T296125: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125
[10:02:12] <RhinosF1>	 and what's the best way to get them on a Saturday
[10:02:32] <RhinosF1>	 (also we need to make sure it doesn't override again)
[10:04:19] <urbanecm>	 majavah: do we have an idea of which patch exactly was the breaking one?
[10:04:25] <urbanecm>	 i can attempt to get someone to rv it
[10:05:13] <majavah>	 urbanecm: it wasn't a puppet patch, it was a release of the wmf-certificates debian package being uploaded to apt.wm.o
[10:05:28] <urbanecm>	 but something had to install it?
[10:05:32] <elukey>	 hi folks, just seen the pings
[10:05:43] <elukey>	 I believe that the change majavah refers to is https://gerrit.wikimedia.org/r/c/operations/debs/wmf-certificates/+/740119
[10:05:54] <urbanecm>	 hey elukey! 
[10:06:02] <elukey>	 o/
[10:06:11] <majavah>	 urbanecm: yes, that's done by the unattended-upgrades cronjob
[10:06:20] <elukey>	 elukey@an-test-client1001:~$ ls -l /etc/ssl/certs/Puppet_Internal_CA.pem 
[10:06:23] <elukey>	 lrwxrwxrwx 1 root root 59 Nov 19 14:22 /etc/ssl/certs/Puppet_Internal_CA.pem -> /usr/share/ca-certificates/wikimedia/Puppet_Internal_CA.crt
[10:06:23] <majavah>	 hey elukey, turns out it wasn't your change that broke things, sorry for the ping
[10:06:26] <elukey>	 yeah
[10:07:13] <elukey>	 majavah: no problem, thanks as always for spending time on this :) I was working with Jaime yesterday to release the new package, so I am kinda responsible as well :D I have to say that nobody really expected this problem
[10:07:25] <RhinosF1>	 that isn't replicated https://github.com/wikimedia/operations-debs-wmf-certificates fwiw
[10:08:24] <majavah>	 toolforge's k8s cluster uses puppet certs for etcd traffic, thankfully that seems unaffected
[10:08:48] <urbanecm>	 for now majavah :D
[10:08:53] <majavah>	 https://gerrit.wikimedia.org/g/operations/puppet/+/24cc1258080076d140eaf705b26f0f8ac63563c0/modules/profile/manifests/wmcs/kubeadm/control.pp#27
[10:10:40] * RhinosF1 off
[10:12:13] <majavah>	 the list of things relying on that certificate file is https://codesearch.wmcloud.org/search/?q=%2Fetc%2Fssl%2Fcerts%2FPuppet_Internal_CA.pem&i=nope&files=&excludeFiles=&repos=
[10:15:20] <elukey>	 majavah: an alternative to the cp could be a simply ln -s, basically flipping the links via cumin to the right location
[10:17:16] <elukey>	 but anything with /etc/ssl/certs/Puppet_Internal_CA.pem -> /usr/share/ca-certificates/wikimedia/Puppet_Internal_CA.crt is clearly broken in cloud
[10:17:41] <majavah>	 mm.. I don't think I want to be doing that either cloud-wide on a saturday instead if I can avoid it
[10:18:51] <elukey>	 I can try to ping the cloud folks, this seems to be a big problem
[10:19:10] <elukey>	 I totally agree that you shouldn't be the one responsible to fixing this
[10:20:04] <elukey>	 majavah: it may not be all cloud though, but I suspect only hosts using the profile::base blabla that installs wmf-certificates
[10:20:45] <elukey>	 so a cumin command like 'cumin "R:package = 'wmf-certificates'" should return the list
[10:21:12] <elukey>	 (no idea where to run cumin from on cloud)
[10:21:25] <majavah>	 we don't have a wmcs-wide puppetdb
[10:21:49] <majavah>	 but the package gets applied to all instances via site.pp -> node default -> role::wmcs::instance -> profile::base::labs -> profile::base
[10:22:56] <elukey>	 we ensure the package in profile::base::certificates
[10:23:33] <majavah>	 679 instances with the wrong production one, 136 in toolforge with the correct one, then a few with other project local CAs
[10:23:36] <elukey>	 and in profile::base contain profile::base::certificates
[10:23:39] <elukey>	 lovely
[10:25:59] <elukey>	 I am checking the package's debian config, and removing it may do the right thing (of course we'd need to make sure that puppet doesn't re-deploy it)
[10:26:02] <elukey>	 lemme check
[10:26:45] <elukey>	 nope the link gets removed 
[10:27:25] <elukey>	 majavah: how did you come up with the 679 number? (to understand the process)
[10:28:05] <majavah>	 taavi@cloud-cumin-03:~$ sudo cumin "A:all" "openssl x509 -in /etc/ssl/certs/Puppet_Internal_CA.pem -text -noout | grep CN"
[10:33:53] <elukey>	 do we have a "healthy" node on which we can check what Puppet_Internal_CA.pem links to? 
[10:33:59] <elukey>	 (just to double check)
[10:35:24] <elukey>	 majavah: --^
[10:40:37] <elukey>	 anyway, I think that the best course of action is to open a task
[10:40:43] <elukey>	 and add people
[10:40:50] <elukey>	 so that we can track work etc..
[10:40:55] <elukey>	 going to open one in a bit
[10:43:39] <urbanecm>	 elukey: there's one already
[10:43:49] <urbanecm>	 elukey: https://phabricator.wikimedia.org/T296127
[10:43:56] <elukey>	 I see thanks!
[10:46:02] <majavah>	 elukey: any node in toolforge is untouched, we don't do unattended-updates for wmf packages there
[10:46:31] <elukey>	 super
[10:46:39] <majavah>	 (I have no clue how much access you have in the cloud realm, let me know if you need me to do something)
[10:50:05] <elukey>	 thanks np :)
[11:11:05] <elukey>	 need to go but it seems that the situation is not worth a page 
[16:15:25] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap, 10Infrastructure-Foundations, 10Puppet: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10LucasWerkmeister) Web requests to the Beta cluster (e.g. https://en.wikip...
[20:11:49] <RhinosF1>	 urbanecm: looks like beta now fully down
[20:12:18] <urbanecm>	 RhinosF1: I'm aware, but the (likely?) root cause task is already an UBN ;)
[20:12:23] <RhinosF1>	 Could you have a glance and see if the bad certificate is back or if it's a new bad issue
[20:12:31] <RhinosF1>	 urbanecm: it is ye
[20:12:41] <RhinosF1>	 I can imagine it might still be Monday before a fix
[20:12:52] <RhinosF1>	 Just saw you were online
[20:13:00] <urbanecm>	 it's definitely same issue: ` Uncaught ConfigException: Failed to load configuration from etcd: (curl error: 60) SSL peer certificate or SSH remote key was not OK in /srv/mediawiki/php-master/includes/config/EtcdConfig.php:205`
[20:13:04] <urbanecm>	 that's literally the same error message
[20:13:49] <urbanecm>	 honestly the only thing left is the pager, but it's not _that_ urgent :)
[20:13:49] <RhinosF1>	 I wondered why it now causing it go down
[20:14:06] <RhinosF1>	 No definitely not
[20:14:07] <urbanecm>	 leaving it as "first thing to do on Monday" is enough for me
[20:14:11] <RhinosF1>	 K
[20:14:58] <majavah>	 RhinosF1: the bad cert was never removed from anywhere except puppetdb
[20:15:43] <RhinosF1>	 majavah: It apparently only went down half way through today though
[20:15:52] <RhinosF1>	 That's what confused me
[21:44:12] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook