[03:47:30] code [11:31:33] I'm trying to make sense about some alerts/warnings we see for some Wikidata related instances: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=wikidata&q=severity%3Dwarning [11:32:12] especially the PuppetAgentNoResources and stale last run [11:33:19] but "PuppetAgentNoResources" doesn't find anything on wikitech [11:33:40] (or codesearch for that matter) [11:34:11] I think maybe some of those instances have already been decommissioned, but not sure [11:37:09] I've seen [⚓ T324812 toolsbeta: puppet failing on multiple hosts](https://phabricator.wikimedia.org/T324812) but it is not really telling me much [11:37:09] T324812: toolsbeta: puppet failing on multiple hosts - https://phabricator.wikimedia.org/T324812 [11:38:51] MichaelG_WMDE: let's go bit by bit [11:39:16] 🙏 [11:39:17] Looking at wikidata-federated-properties, the VM is up, running and running puppet correctly [11:41:23] node-exporter seems up and running too (that's the process that exposes the metrics from the node to prometheus) [11:42:04] it does report 0 resources though [11:42:10] https://www.irccloud.com/pastebin/0qn1W0fi/ [11:43:03] I'm really not experienced with puppet, what does that mean? [11:44:09] oh, the process generating the stats is dead though (prometheus-puppet-agent-stats.service) [11:44:44] so the alert is complaining because the statistics we gather from that host about how many things are being managed by puppet (resources), says it's managing 0 things [11:44:51] and that it has not run in a while [11:47:44] Ok, thanks 🙏. Then those warnings might go away when the process is restarted? [11:48:58] yep, it seems to be failing to start though, looking, running it manually works and reports >500 resources [11:52:14] I think it was a permissions issue, the report file was owned by prometheus, but the process should create it as root instead :/, it's working now [11:52:28] let's check wb-reconcile, seems similar [11:53:07] for the record I did 'root@wikidata-federated-properties:~# /usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom', the same the service does [11:53:48] yep, same stuff :) [11:54:16] now wikidata-analytics-1 [11:55:29] that one seems to have trouble resolving dns [11:55:33] Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional resources using 'eval_generate': Failed to open TCP connection to puppetmaster.cloudinfra.wmflabs.org:8140 (getaddrinfo: Temporary failure in name resolution) [11:57:13] yeah, not sure if the wikidata-analytics still exists. I recall there was some work to sunsets parts of our old analytics infrastructure but I'm not sure if that included the cloud vps project [11:58:27] the project is there, and the vm exists too [11:58:34] (and is up and running) [11:58:42] it uses a floating ip [11:59:52] it's running a couple containers [12:00:00] quratorqcerevolver and quratorqcfrevolver [12:02:23] MichaelG_WMDE: do you mind if we reboot that one? To force a network reset from scratch [12:02:57] uh, let me ask the analytics people on our end, I'm not directly involved in the work there [12:10:33] dcaro: I got the green light, that can be rebooted [12:15:15] ack [12:15:26] the others are reverting to the old permissions, so starting to fail again :/ [12:18:15] meh, that's unfortunate. But now that we know what's going on, it is not urgent to fix from our side [12:19:16] !log wmdeanalytics reboot wikidata-analytics-1 to refresh the network [12:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wmdeanalytics/SAL [12:26:51] the reboot did not help xd [12:27:51] gtg. be back in a bit and give it another look, feel free to open a task so we can keep track [12:29:54] dcaro, no hurry, and thank you for all your help so far! ❤️ [12:30:06] * MichaelG_WMDE writes a phab task and then has lunch [13:12:52] I've created a task here: https://phabricator.wikimedia.org/T354268 [13:13:05] feel free to adjust it as you see fit :) [13:13:09] * MichaelG_WMDE is now on a break [15:02:21] anyone can help with debugging why vk.com shows as certificate expired from cloudVPS? Thought it was the buster (ouch) instances, but new bookworm instance has the same issue. [15:03:48] chicovenancio: I get an ok certificate [15:03:52] https://www.irccloud.com/pastebin/cX6pUQyU/ [15:04:39] oh, from within vps xd [15:04:44] interesting [15:04:53] I get expired too from the bastion yes [15:05:11] (I understood that that was hosted in cloudvps and giving a bad certificate) [15:09:25] interesting, from the tools bastion, there's a certificate expired in the chain [15:09:27] https://www.irccloud.com/pastebin/YgTGNQdf/ [15:09:34] tools.wm-what@tools-sgebastion-10:~$ openssl s_connect 87.240.132.67:443 -showcerts [15:10:07] sorry, this is the command line openssl s_client -connect 87.240.132.67:443 -showcerts [15:12:03] Yeah, but I imagine its also sending a valid chain, since other clients accept it. [15:15:51] what is 87.240.132.67? [15:16:43] oh it's vk.com [15:21:59] yep sorry, that's the same ip in both sides otherwise I get different ips [15:22:09] (just trying to minimize variability) [15:26:42] weird, from my Arch laptop it verifies fine but from a Bullseye container on the same machine I get a self-signed cert in chain error I get [15:27:39] it vaguely reminds me of the let's encrypt root expiry issue, but that should be fixed in these versions of openssl, right? [15:28:12] yes, otherwise we'd be getting a lot more reports of not being able to connect to the wikimedia sites [15:28:29] tunneling through the bastion and doing the check from my laptop also sees the cert as valid, so this has to be something on hte host (old certs, openssl version, ...) [15:30:33] chain is the same between local arch, local debian, and bastion [15:31:44] on debian 11 locally seems to work [15:33:26] hmm, debian 11 on cloudvps does not work [15:33:36] maybe the configured protocols or similar? [15:36:52] interesting, so far, anything under /etc/ssl seems similar (cloudvps has a couple more certs), and the openssl version is the same [15:37:02] looks like /usr/local/share/ca-certificates/GlobalSign_Organization_Validation_CA_-_SHA256_-_G2.crt is the expired cert [15:38:31] which then gets loaded into /etc/ssl/certs/GlobalSign_Organization_Validation_CA_-_SHA256_-_G2.pem by update-ca-certificates [15:39:36] hmm, that file is not provided by any package? [15:40:24] it's defined in modules/profile/manifests/base/certificates.pp [15:40:38] yep, puppet pulls it in [15:45:15] can confirm removing the cert and running `sudo update-ca-certificates` solves it [15:45:23] (at least until next puppet run?) [15:45:39] I'm asking in -sre see if they are aware of the issue [16:03:15] I'm wondering if this isn't a bug somewhere in certificate validation as well. Should the expired root cert be enough to fail things even in the presence of a valid chain? [17:06:15] !log tools.eranbot enable cron jobs for frwiki, enwiki, eswiki, arwiki, and simplewiki T354145 [17:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.eranbot/SAL [17:12:22] chicocvenancio: feels weird yes [17:20:28] I think it's just how the protocol works, but I'm not an expert. I found some related discussions here https://security.stackexchange.com/a/191341/89594 [17:21:45] yep, I read that too, but it's not clear to me if the client should only take into account the chain passed by the server (that should have worked) or should prioritize the local filesystem one instead [17:21:50] (that might be the case) [17:22:58] I guess we might fall into this case "The client is allowed to discover its own chain to a trusted root" [17:23:14] but I would need to read the SSL spec to be sure :) [17:32:12] created T354295, feel free to add stuff there if you find anything [17:32:12] T354295: [puppet] Remove expired and unused certs from modules/profile/files/ssl/ and modules/base/files/ca - https://phabricator.wikimedia.org/T354295 [21:17:30] !log tools deleting many stray core dumps throughout nfs storage [21:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:21:37] !log truncating 200 logfiles to 5M on tools nfs [21:21:37] andrewbogott: Unknown project "truncating" [21:22:07] !log tools truncating 200 logfiles to 5M on tools nfs [21:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL