Fork me on GitHub

Wikimedia IRC logs browser - #wikimedia-cloud

Filter:
Start date
End date

Displaying 98 items:

2024-01-03 03:47:30 <Guest15> code
2024-01-03 11:31:33 <MichaelG_WMDE> I'm trying to make sense about some alerts/warnings we see for some Wikidata related instances: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=wikidata&q=severity%3Dwarning
2024-01-03 11:32:12 <MichaelG_WMDE> especially the PuppetAgentNoResources and stale last run
2024-01-03 11:33:19 <MichaelG_WMDE> but "PuppetAgentNoResources" doesn't find anything on wikitech
2024-01-03 11:33:40 <MichaelG_WMDE> (or codesearch for that matter)
2024-01-03 11:34:11 <MichaelG_WMDE> I think maybe some of those instances have already been decommissioned, but not sure
2024-01-03 11:37:09 <MichaelG_WMDE> I've seen [⚓ T324812 toolsbeta: puppet failing on multiple hosts](https://phabricator.wikimedia.org/T324812) but it is not really telling me much
2024-01-03 11:37:09 <stashbot> T324812: toolsbeta: puppet failing on multiple hosts - https://phabricator.wikimedia.org/T324812
2024-01-03 11:38:51 <dcaro> MichaelG_WMDE: let's go bit by bit
2024-01-03 11:39:16 <MichaelG_WMDE> 🙏
2024-01-03 11:39:17 <dcaro> Looking at wikidata-federated-properties, the VM is up, running and running puppet correctly
2024-01-03 11:41:23 <dcaro> node-exporter seems up and running too (that's the process that exposes the metrics from the node to prometheus)
2024-01-03 11:42:04 <dcaro> it does report 0 resources though
2024-01-03 11:42:10 <dcaro> https://www.irccloud.com/pastebin/0qn1W0fi/
2024-01-03 11:43:03 <MichaelG_WMDE> I'm really not experienced with puppet, what does that mean?
2024-01-03 11:44:09 <dcaro> oh, the process generating the stats is dead though (prometheus-puppet-agent-stats.service)
2024-01-03 11:44:44 <dcaro> so the alert is complaining because the statistics we gather from that host about how many things are being managed by puppet (resources), says it's managing 0 things
2024-01-03 11:44:51 <dcaro> and that it has not run in a while
2024-01-03 11:47:44 <MichaelG_WMDE> Ok, thanks 🙏. Then those warnings might go away when the process is restarted?
2024-01-03 11:48:58 <dcaro> yep, it seems to be failing to start though, looking, running it manually works and reports >500 resources
2024-01-03 11:52:14 <dcaro> I think it was a permissions issue, the report file was owned by prometheus, but the process should create it as root instead :/, it's working now
2024-01-03 11:52:28 <dcaro> let's check wb-reconcile, seems similar
2024-01-03 11:53:07 <dcaro> for the record I did 'root@wikidata-federated-properties:~# /usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom', the same the service does
2024-01-03 11:53:48 <dcaro> yep, same stuff :)
2024-01-03 11:54:16 <dcaro> now wikidata-analytics-1
2024-01-03 11:55:29 <dcaro> that one seems to have trouble resolving dns
2024-01-03 11:55:33 <dcaro> Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional resources using 'eval_generate': Failed to open TCP connection to puppetmaster.cloudinfra.wmflabs.org:8140 (getaddrinfo: Temporary failure in name resolution)
2024-01-03 11:57:13 <MichaelG_WMDE> yeah, not sure if the wikidata-analytics still exists. I recall there was some work to sunsets parts of our old analytics infrastructure but I'm not sure if that included the cloud vps project
2024-01-03 11:58:27 <dcaro> the project is there, and the vm exists too
2024-01-03 11:58:34 <dcaro> (and is up and running)
2024-01-03 11:58:42 <dcaro> it uses a floating ip
2024-01-03 11:59:52 <dcaro> it's running a couple containers
2024-01-03 12:00:00 <dcaro> quratorqcerevolver and quratorqcfrevolver
2024-01-03 12:02:23 <dcaro> MichaelG_WMDE: do you mind if we reboot that one? To force a network reset from scratch
2024-01-03 12:02:57 <MichaelG_WMDE> uh, let me ask the analytics people on our end, I'm not directly involved in the work there
2024-01-03 12:10:33 <MichaelG_WMDE> dcaro: I got the green light, that can be rebooted
2024-01-03 12:15:15 <dcaro> ack
2024-01-03 12:15:26 <dcaro> the others are reverting to the old permissions, so starting to fail again :/
2024-01-03 12:18:15 <MichaelG_WMDE> meh, that's unfortunate. But now that we know what's going on, it is not urgent to fix from our side
2024-01-03 12:19:16 <dcaro> !log wmdeanalytics reboot wikidata-analytics-1 to refresh the network
2024-01-03 12:19:18 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wmdeanalytics/SAL
2024-01-03 12:26:51 <dcaro> the reboot did not help xd
2024-01-03 12:27:51 <dcaro> gtg. be back in a bit and give it another look, feel free to open a task so we can keep track
2024-01-03 12:29:54 <MichaelG_WMDE> dcaro, no hurry, and thank you for all your help so far! ❤️
2024-01-03 12:30:06 <MichaelG_WMDE> writes a phab task and then has lunch
2024-01-03 13:12:52 <MichaelG_WMDE> I've created a task here: https://phabricator.wikimedia.org/T354268
2024-01-03 13:13:05 <MichaelG_WMDE> feel free to adjust it as you see fit :)
2024-01-03 13:13:09 <MichaelG_WMDE> is now on a break
2024-01-03 15:02:21 <wm-bb> <chicocvenancio> anyone can help with debugging why vk.com shows as certificate expired from cloudVPS? Thought it was the buster (ouch) instances, but new bookworm instance has the same issue.
2024-01-03 15:03:48 <dcaro> chicovenancio: I get an ok certificate
2024-01-03 15:03:52 <dcaro> https://www.irccloud.com/pastebin/cX6pUQyU/
2024-01-03 15:04:39 <dcaro> oh, from within vps xd
2024-01-03 15:04:44 <dcaro> interesting
2024-01-03 15:04:53 <dcaro> I get expired too from the bastion yes
2024-01-03 15:05:11 <dcaro> (I understood that that was hosted in cloudvps and giving a bad certificate)
2024-01-03 15:09:25 <dcaro> interesting, from the tools bastion, there's a certificate expired in the chain
2024-01-03 15:09:27 <dcaro> https://www.irccloud.com/pastebin/YgTGNQdf/
2024-01-03 15:09:34 <dcaro> tools.wm-what@tools-sgebastion-10:~$ openssl s_connect 87.240.132.67:443 -showcerts
2024-01-03 15:10:07 <dcaro> sorry, this is the command line openssl s_client -connect 87.240.132.67:443 -showcerts
2024-01-03 15:12:03 <wm-bb> <chicocvenancio> Yeah, but I imagine its also sending a valid chain, since other clients accept it.
2024-01-03 15:15:51 <dhinus> what is 87.240.132.67?
2024-01-03 15:16:43 <dhinus> oh it's vk.com
2024-01-03 15:21:59 <dcaro> yep sorry, that's the same ip in both sides otherwise I get different ips
2024-01-03 15:22:09 <dcaro> (just trying to minimize variability)
2024-01-03 15:26:42 <AntiComposite> weird, from my Arch laptop it verifies fine but from a Bullseye container on the same machine I get a self-signed cert in chain error I get
2024-01-03 15:27:39 <wm-bb> <chicocvenancio> it vaguely reminds me of the let's encrypt root expiry issue, but that should be fixed in these versions of openssl, right?
2024-01-03 15:28:12 <AntiComposite> yes, otherwise we'd be getting a lot more reports of not being able to connect to the wikimedia sites
2024-01-03 15:28:29 <dcaro> tunneling through the bastion and doing the check from my laptop also sees the cert as valid, so this has to be something on hte host (old certs, openssl version, ...)
2024-01-03 15:30:33 <AntiComposite> chain is the same between local arch, local debian, and bastion
2024-01-03 15:31:44 <dcaro> on debian 11 locally seems to work
2024-01-03 15:33:26 <dcaro> hmm, debian 11 on cloudvps does not work
2024-01-03 15:33:36 <dcaro> maybe the configured protocols or similar?
2024-01-03 15:36:52 <dcaro> interesting, so far, anything under /etc/ssl seems similar (cloudvps has a couple more certs), and the openssl version is the same
2024-01-03 15:37:02 <AntiComposite> looks like /usr/local/share/ca-certificates/GlobalSign_Organization_Validation_CA_-_SHA256_-_G2.crt is the expired cert
2024-01-03 15:38:31 <AntiComposite> which then gets loaded into /etc/ssl/certs/GlobalSign_Organization_Validation_CA_-_SHA256_-_G2.pem by update-ca-certificates
2024-01-03 15:39:36 <dcaro> hmm, that file is not provided by any package?
2024-01-03 15:40:24 <dhinus> it's defined in modules/profile/manifests/base/certificates.pp
2024-01-03 15:40:38 <dcaro> yep, puppet pulls it in
2024-01-03 15:45:15 <wm-bb> <chicocvenancio> can confirm removing the cert and running `sudo update-ca-certificates` solves it
2024-01-03 15:45:23 <wm-bb> <chicocvenancio> (at least until next puppet run?)
2024-01-03 15:45:39 <dcaro> I'm asking in -sre see if they are aware of the issue
2024-01-03 16:03:15 <chicocvenancio> I'm wondering if this isn't a bug somewhere in certificate validation as well. Should the expired root cert be enough to fail things even in the presence of a valid chain?
2024-01-03 17:06:15 <wm-bot> !log tools.eranbot <jjmc89> enable cron jobs for frwiki, enwiki, eswiki, arwiki, and simplewiki T354145
2024-01-03 17:06:19 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.eranbot/SAL
2024-01-03 17:12:22 <dcaro> chicocvenancio: feels weird yes
2024-01-03 17:20:28 <dhinus> I think it's just how the protocol works, but I'm not an expert. I found some related discussions here https://security.stackexchange.com/a/191341/89594
2024-01-03 17:21:45 <dcaro> yep, I read that too, but it's not clear to me if the client should only take into account the chain passed by the server (that should have worked) or should prioritize the local filesystem one instead
2024-01-03 17:21:50 <dcaro> (that might be the case)
2024-01-03 17:22:58 <dhinus> I guess we might fall into this case "The client is allowed to discover its own chain to a trusted root"
2024-01-03 17:23:14 <dhinus> but I would need to read the SSL spec to be sure :)
2024-01-03 17:32:12 <dcaro> created T354295, feel free to add stuff there if you find anything
2024-01-03 17:32:12 <stashbot> T354295: [puppet] Remove expired and unused certs from modules/profile/files/ssl/ and modules/base/files/ca - https://phabricator.wikimedia.org/T354295
2024-01-03 21:17:30 <andrewbogott> !log tools deleting many stray core dumps throughout nfs storage
2024-01-03 21:17:33 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
2024-01-03 21:21:37 <andrewbogott> !log truncating 200 logfiles to 5M on tools nfs
2024-01-03 21:21:37 <stashbot> andrewbogott: Unknown project "truncating"
2024-01-03 21:22:07 <andrewbogott> !log tools truncating 200 logfiles to 5M on tools nfs
2024-01-03 21:22:09 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL

This page is generated from SQL logs, you can also download static txt files from here