[08:50:22] hi, I have erased the Puppet SSL on integration-puppetserver-01.integration.eqiad1.wikimedia.cloud and I think that needs the new certificate to be accepted on the global WMCS puppetserver if anyone can do that for me please ;) [08:51:18] let me check [08:54:47] I also managed to revoke the puppetserver certificate :/ [08:57:30] I just revoked the cert on the cloudinfra puppetserver [08:58:00] I think it is because I blindly ran `rm -fR /var/lib/puppet/ssl` [08:58:15] when that instance is attached to the global puppet server [08:58:38] yep, might be, certs in puppet are tricky [08:58:39] looks like the agent is happy now :) [08:58:50] first puppet ran worked ok, let's do a second to make sure [08:59:14] it removed the comment on the puppet.conf file btw [08:59:17] https://www.irccloud.com/pastebin/mrFnR5vM/ [08:59:31] (just in case you want to make it it's own master again) [08:59:33] ah yeah that was me trying to debug [08:59:35] second run worked too :) [08:59:44] the next unrelated screw up is I revoked the cert I think I did `puppetserver ca revoke` [08:59:56] which one did you revoke? [09:00:09] integration-puppetserver-01.integration.eqiad1.wikimedia.cloud [09:00:10] :/ [09:00:13] from itself [09:00:14] ? [09:00:14] don't ask me why, I have no clue ;) [09:00:16] yeah [09:00:17] xd [09:00:24] so now when I do `puppetserver ca list` [09:00:26] we get: [09:00:34] Fatal error when running action 'list' [09:00:34] Error: Failed connecting to https://integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:8140/puppet-ca/v1/certificate_statuses/any_key?state=requested [09:00:34] Root cause: SSL_connect returned=1 errno=0 peeraddr=172.16.7.28:8140 state=error: certificate verify failed (certificate revoked) [09:00:35] that might have make it fail when you were trying to use itself to be it's own server [09:01:12] oooohh, let me give that a look, it's kind of a catch-22 issue [09:01:15] I think it is fine having the agent attached to the global WMCS server, that is how Andrew created it some weeks ago [09:01:40] I think I had an issue running the puppet agent and I thus went to blindly delete /var/lib/puppet/ssl as I do with other instances usually [09:02:12] I broke it :/ [09:10:31] I'm regenerating the puppetserver cert, saved a copy of the old just in case [09:10:44] might require to recreate the certs on all the clients :/ [09:11:00] https://www.irccloud.com/pastebin/BfcJTsPV/ [09:11:12] managing the server now works, can you check in one of the clients? [09:11:49] it seems to be working for integration-agent-puppet-docker-1003.integration.eqiad1.wikimedia.cloud [09:12:12] oh you are a magician [09:13:14] the magic command was `puppetserver ca generate --ca-client --certname integration-puppetserver-01.integration.eqiad1.wikimedia.cloud` (after stopping the puppetserver service, it will ask you to remove some certs the first run) [09:13:17] and indeed some other agent works just fine [09:13:59] thank you! [09:14:18] yw! :) [09:14:21] somehow puppet agent stopped working last week, which triggered some email notification and I wanted to fix the puppet run [09:14:25] but I did the wrong way [09:14:38] next time I will ask for help immediately instead of making the situation worse! [09:17:22] np, it might have saved you some hassle, but it's good to tinker around too (if that's what you want of course) [09:17:56] at least I have learned a bit more about `puppetserver` :] [09:34:38] blancadesal: hmm, the versions script is telling me that https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/432 is not deployed in toolsbeta? [09:34:49] oh wayt [09:34:51] it is [09:34:52] builds-api (chart builds-api): builds-api-0.0.165-20240718140844-131a3480 (toolforge-deploy has builds-api-0.0.164-20240716153428-d1c47de5) [09:35:00] it's telling me my toolforge-deploy repo is out of date xd [09:35:25] :)) [09:35:53] "please solve your problems, then talk to me" xd [09:40:57] xd [09:42:05] quick review to avoid that from happening again https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/436 [09:42:35] lgtm [09:44:09] dcaro: I'm adding components-api to toolforge-deploy. what should I put as the initial `chartVersion` ? [09:45:47] it can be 0.0.1, we don't really use semver (yet, I'm happy to, but we either do and follow it, or do not, anything in-between loses all the advantages) [10:42:34] * dcaro lunch [10:59:26] My cats brought us an alive snake as a gift into the house today. đŸ«„ How is your day going? [11:06:44] they must have been so proud though xd 🐍 [11:11:07] definitely [11:42:57] dhinus: gave another spin to https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1055420 including incorporating feedback from some taavi :-P [11:52:58] arturo: just seen the comment from taavi :) I'll look at the patch after lunch! [11:56:42] I'll be a bit late to collab, will join around 14:15 or so [12:54:55] hi! I have a Puppet change (scheduled for tomorrow’s puppet window – remains to be seen if it’ll be merged ^^) which potentially affects some Wikimedia Cloud Services infra [12:55:13] https://phabricator.wikimedia.org/T370171#10002142 lists three systemd services running on cloudcontrol* nodes which have a timeout configured, but at the moment the timeout isn’t actually enforced [12:55:52] and I don’t have enough access (AFAICT) to check whether those services are currently below these configured timeouts, or whether they’re accidentally exceeding them and would start to break once the Puppet fix is merged and the timeout is enforced [12:56:09] maybe someone can SSH in and check what the current runtime of these services is? [12:56:43] (disclaimer, it’s also possible that I misread the Puppet config and the services are running on another host than I imagined ^^) [12:56:47] Lucas_WMDE: all of them are low-stakes systemd services, nothing will break if you introduce the change [12:57:00] but I can definitely check the runtime [12:57:06] okay, thanks! [13:01:35] Lucas_WMDE: updated ticket [13:35:33] arturo: +1d the patch, thank for the nice work! [13:35:38] *thanks [13:36:48] komla: have people already nudged you about the typo in the announce email? (I missed it too, when proofreading). It says to go to idp.wikipedia.org (which doesn't exist) instead of idp.wikimedia.org [13:49:41] cteam, I'm putting in a plug for deployment-prep upgrades during today's SRE meeting, does anyone have anything else I should mention there? [14:02:53] andrewbogott: good luck :P [14:03:18] I'm assuming you don't want to do /all/ of them :D [14:03:23] there are a few ready patches btw [14:03:36] which are blocking me from decomming some of the Buster instances [14:03:47] great, I'll catch up shortly [14:05:28] not going to do all of them - don't want to become the new maintainer of deployment-prep [14:05:44] although it's already too late, I guess [14:06:44] at least found out a few instances already have replacements available, pinged each instance creator to get an update on the status [14:07:48] yeah, that part was a bit weird, doing 80% of the work and then stopping [14:10:42] arturo: great, thanks a lot! [14:15:02] Southparkfan: I appreciate you moving a bunch of these things to service addresses. [14:15:23] yep, should make a future move much easier [14:17:57] A couple of these config changes are complaining about merge conflicts [14:19:29] looking [14:20:47] was able to rebase https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1055531, if I understand correctly this should be merged automatically, because the +2 has been copied over? [14:21:06] I can't tell, I guess we'll see in a minute [14:29:38] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1055539 has been fixed [14:42:04] deployment-prep puppet was broken yet again, shrug [14:45:47] andrewbogott: yes, a few have. I have since corrected it. Fortunately, this went out to a much smaller trial batch [14:47:02] oh good :) [14:47:49] komla: you are now officially smarter than crowdstrike [14:48:33] :-D [14:49:00] andrewbogott: at least one person mentioned that, lol [14:49:16] komla: I also went to the linked wikitech page, and found very little info. I would suggest increasing the information on that page [14:49:56] another question I had is: shall I link my personal wiki account, or my WMF staff account? [14:51:32] arturo: sure, thanks. it is still in draft. hoping to populate it with more FAQ after this trial run. [14:53:15] great! thanks [16:48:22] I found a taker for at least one more deployment-prep VM [16:48:40] hmm... andrewbogott trying to create a new VM on tools project ends up failing with self-signed cert [16:48:41] Warning: SSL_connect returned=1 errno=0 peeraddr=172.16.3.13:8140 state=error: certificate verify failed (self-signed certificate in certificate chain): [self-signed certificate in certificate chain for /CN=Puppet CA: tools-puppetmaster-01.tools.eqiad.wmflabs] [16:49:05] weird [16:49:07] what's the fqdn? [16:49:26] tools-services-06.tools.eqiad1.wikimedia.cloud [16:50:53] dcaro: is that not just the normal issue with flipping over from central to local puppetserver? [16:51:11] I thought that was not needed anymore [16:51:30] still says tools-puppetmaster-01, instead of tools-puppetserver, is that correct? [16:51:36] As far as I know it still is, although I think there are cookbooks to create nodes that work around it [16:51:43] oh, it isn't -- that's a hiera thing, let's see... [16:52:50] Actually, I think puppetmaster in the cert is OK because that was migrated over from the old one. [16:53:31] ack [16:53:39] yeah, I cleared ssl things and signed the new request and things are working [16:53:49] So I think this is working as expected, even though it's bad [16:53:53] I'll run the refresh certs [16:56:36] that worked :) `wmcs-cookbooks wmcs.vps.refresh_puppet_certs --fqdn tools-services-06.tools.eqiad1.wikimedia.cloud` [17:27:42] hmm... something deleted all the tools apt packages [17:27:50] (got a backup) [17:35:22] dcaro: ok if I reboot tools-k8s-worker-nfs-3 ? [17:35:37] alertmanager says 'has many processes stuck on IO' [17:35:46] should be ok yes, you might want to use the cookbook though (to move as much pods gracefully as it can) [17:36:19] `wmcs.toolforge.k8s.reboot` <- this one [17:36:30] (should be somewhere in the runbook) [17:37:44] ok, will do [17:45:53] okok, stopped the toolforge-serivces0-05 buster VM, replaced by the bookworm -06 one, will delete in a couple days if nothing comes up [17:46:37] * dcaro off [17:46:40] cya tomorrow! [18:08:21] komla: I'm going to send a quick note to cloud-announce about the wikitech migration just encouraging people to now regard the email as spam (since any personal email saying "log in here and give it your password" seems suspicious.). Hopefully emails from two different people will make it seem more credible :) [22:19:26] komla: I WP:BOLD'ly started writing FAQs for the SUl migration at . It would be great if you could double check that I am not lying to people. [22:20:11] andrewbogott: I have seen it. Thanks! [22:22:01] bd808: Simon and the team are targeting September 16th. I will loop him in to also crosscheck the FAQ. Thank you! [22:23:58] they are going to have a busy couple of months to get all the blockers done by then [22:25:13] getting 2FA for developer accounts into bitu/idp and moving the Developer account locking system are both non-trivial projects [22:27:30] I understand it is to align with service-ops OKR of moving Wikitech to Kubernetes [22:28:01] It initially had a much later deadline, but it was brought forward because of the above [22:29:39] that's nice, but it doesn't remove the dependencies. Unless i guess SRE management gets to decide that Developer accounts no longer get 2FA or the ability to globally lock an account.