[08:03:32] morning! [08:04:51] greetings [10:15:19] got a kinda-working first MR of the logs service :), reusing (copy-pasting basically) taavi's work on jobs-api https://gitlab.wikimedia.org/repos/cloud/toolforge/logs-api/-/merge_requests/1 [10:16:00] (tried only running on my laptop against lima-kilo, not really deploying it yet) [10:16:55] very cool [10:17:05] dcaro: you probably want to wait until the implementation in jobs-api is stable first, the current implementation in the main branch is not very asyncio friendly at the moment for example [10:17:58] taavi: that's ok, I'll copy it over, the skeleton and such is the main boilerplate I want out of the door [10:18:26] the implementation details will be easy to change, and it already uses asyncio/fastapi, so extra win [10:21:30] dcaro: btw looks like I found the issue with components-api getting the authentication issue, if you don't add a trailing slash to the url fastapi will respond with a redirect that's an absolute https://jobs-api.jobs-api.svc... URL which then bypasses the x-toolforge-tool code in api-gateway (and the client cert components-api lets it talk to [10:21:30] jobs-api directly, which is not ideal but explains the 'missing header' error) [10:22:23] ooohhhh, good catch [10:23:00] it's easy to change that redirect to a 404 [10:23:26] yep, I would expect to be able to configure also the 'external url' relatively easy on fastapi side [10:23:32] handling the routes without a slash is going to require duplicating those route annotations unless I've missed something [10:29:13] is there some specific reason we do want to handle both cases? [10:29:16] there's an `include_in_schema` option to the decorator to avoid them showing up in the openapi [10:30:25] I think both cases are ok, it comes from not having a strict standard on handling trailing slashes and building the different apis differently [10:30:55] * dcaro lunch [10:31:00] cya in a bit [10:42:00] fixed the MRs, now `toolforge components config generate` works again [11:28:59] Yay! I'll test in a bit [12:02:24] taavi: found one issue with the metrics endpoint, testing the rest [13:43:34] andrewbogott: opentofu-infra-diff is alerting again with a diff in project "magnum" [13:44:22] dhinus: I'll look. I created that project with the cookbook and merged the PR so in theory everything should be happy... [13:44:32] could be just a git rebase thing I guess [13:45:22] automating the update of a git repo: the third hardest problem in computer science [13:46:39] hmm did you run the tofu cookbook after merging the PR? [13:47:21] the project does exist, but tofu doesn't know about it [13:48:15] The project creation cookbook has a stage where it pauses and asks you to merge the tofu PR. Then you confirm, and it does something... is that not it applying tofu? Isn't that where the project comes from in the first place? [13:48:47] let me check, I'm not sure if the cookbook runs the "tofu apply" automatically [13:49:15] anyway I'm running it now [13:49:42] aaaand it fails because it's trying to create a project that already exists [13:50:09] so I guess 'wmcs.vps.create_project' just doesn't work [13:50:24] it did work for me once, but it probably has many edge cases where it doesn't :) [13:50:43] did you run it from cloudcumin? if yes I can check the logs [13:50:49] so probably I need to add a reference ID to tofu... [13:51:21] andrewbogott: you can "tofu import" but you'll need to do it for all resources in the project as well, which is annoying [13:52:33] I don't think I've used 'tofu import' before, is that wrapped in a cookbook or something I run on a cloudcontrol? [13:54:07] on a cloudcontrol, it's pretty straighforward for a single resource, but it will be a bit boring for 14 ones like in this case [13:55:12] I should probably put this in a wiki but the gist is, you run "tofu plan" from a cloudcontrol, that will tell you I want to create resource x, y, etc. Then you do "tofu import x {some_id}" to tell tofu, the thing you want to create exists with id some_id [13:57:06] where "x" is actually a long string like 'module.project["magnum"].openstack_identity_project_v3.project[0]' [13:57:16] that you can copy/paste from the "tofu plan" output [13:57:40] caveat: it might take longer to import 14 resources than to destrory the project and let tofu recreate it :) [13:58:46] the process you're describing... it doesn't affect the tofu code itself, just the cached local tofu state? [13:58:55] or is there a step where I make a new patch? [13:59:06] yes [13:59:13] no change to code [13:59:25] this just modifies the "tfstate" file stored in the s3 bucket [13:59:53] ugh [13:59:59] well, meeting time now i guess [14:00:03] yep [15:04:56] dcaro: the alert link unfortunately does not point to the page with the import instructions :/ [15:05:03] because it's a generic alert [15:05:14] 🤦‍♂️ there was something somewhere :/ [15:05:32] probably this one: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/tofu-infra#Timeout_when_applying_creates_object_in_openstack_but_not_in_tofu-state,_failing_on_the_next_run [15:05:47] andrewbogott: I think that's exactly the command you need ^ [15:06:07] great, I was just searching for that page... [15:06:47] yep, that was it :) [15:07:10] It says [15:07:12] │ This module is not yet installed. Run "tofu init" to install all modules required by this [15:07:12] │ configuration. [15:07:16] can I do that safely [15:07:16] ? [15:07:27] This is on cloudcontrol1011, maybe no one but me uses that one [15:07:28] maybe you're in a diff cloudcontrol than the one where the cookbook runs [15:07:51] I don't remember the deets but I think it should be safe, or at least worth trying so we know [15:08:05] * andrewbogott runs 'tofu init' and waits a while [15:08:22] import seems to have worked [15:08:28] now 'tofu apply'? Or should I use the cookbook for that? [15:08:42] I think it's the same, the cookbook doesn't do anything special [15:08:55] all cloudcontrols have the config for s3 state? [15:09:02] (for tofu state on s3 I mean) [15:09:07] I think they do [15:09:12] but I also wasn't sure [15:09:17] 👍 good to know [15:09:22] seems happy [15:09:26] now we'll see if the alert clears [15:09:28] thanks dhinus [15:10:46] I'll force a run of the opentofu-infra-diff.timer, so we don't have to wait until tonight [15:12:26] ok! [15:13:06] (sudo systemctl start opentofu-infra-diff.service) [15:13:10] that worked! [15:13:16] the alert is gone [15:18:48] continuing with my alerts cleanup... what's up with the toolsbeta puppetserver? [15:19:13] I tried running clean-stale-puppet-certs (per the cookbook) and I got certificate has expired [15:21:12] hmm... that might have been me deleting an instance some time ago [15:21:27] (did nothing weird though, just deleted the instance that had been stopped for very long) [15:21:30] but in that case the clean-stale script should fix it [15:22:06] yep [15:22:14] (I think) [15:23:20] oo [15:23:23] the clean-stale script doesn't work because the ca cert is expired [15:23:27] and that's when I stopped caring :) [15:23:28] https://www.irccloud.com/pastebin/CUTxggWU/ [15:23:31] yep [15:24:02] was there an alert for the CA cert? Wasn't that fixed some time ago? [15:24:15] dhinus: the codfw1dev puppet alerts should clear shortly, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1176262 [15:24:20] (might just need a restart to pick up the new) [15:24:36] dcaro: the follow-from-loki MR should be ready for review now [15:24:37] dcaro: I fixed the ca cert issue in project-proxy but haven't paid attention to toolsbeta [15:24:46] okok [15:24:54] taavi: nice! I'll test in a bit, running some other tests [15:26:26] nah, restarting puppteserver did not help xd [15:26:37] I'll try a reboot of the toolsbeta puppetserver [15:26:40] ops [15:26:42] :P [15:26:59] I still see 187 days of uptime though [15:27:08] toolsbeta-puppetserver-1.toolsbeta.eqiad1.wikimedia.cloud [15:27:22] sure, go ahead, it should not hurt [15:27:29] (I restarted the service) [15:27:35] ack [15:27:44] the stuck cert is toolsbeta-harbor-2.toolsbeta.eqiad1.wikimedia.cloud [15:28:04] * dhinus is literally wearing a tshirt saying "have you tried to turn it off & on again?" :P [15:28:10] (but in italian) [15:28:46] aannd... it did not fix it [15:28:52] the alert cleared though [15:29:03] interesting :D [15:29:11] and it's back [15:30:53] xd [15:31:46] this wiki suggests we should build a new puppetserver https://wikitech.wikimedia.org/wiki/Help:Project_puppetserver#Renewing_puppetserver_CA_certificate [15:33:29] this works though `root@toolsbeta-puppetserver-1:~# openssl s_client -connect 172.16.5.99:8140` [15:33:32] weird [15:35:42] we give 100 years long certs xd ` v:NotBefore: May 15 07:54:38 2025 GMT; NotAfter: May 16 07:54:38 2125 GMT ` [15:36:04] the CN is the old one though, so maybe it's that? [15:36:06] ` i:CN = Puppet CA: toolsbeta-puppetmaster-04.toolsbeta.eqiad.wmflabs ` [15:36:20] (the issuer CN) [15:37:33] I'm ok rebuilding if you want to go for it, I have not done it in a long time though, so might be good to try the process even [15:38:02] yeah I was thinking the same [15:38:05] kubernetes etcd relies on puppet-issued certs, so that is not the simplest operation [15:38:39] * taavi thought he already renewed that CA [15:38:54] that might be tricky yep [15:40:04] the date of this paste P76199 matches the creation time I get from openssl [15:40:12] but somehow that new cert is not used by puppetserver [15:42:33] found it [15:42:42] this one is expired [15:42:46] `/srv/puppet/server/ssl/certs/ca.pem` [15:42:54] and for some reason, puppetserver ca pulls it [15:43:19] https://www.irccloud.com/pastebin/apa5MMLf/ [15:44:15] I think we might be missing a copy/soft link or something [15:45:23] yep, just crated a soft link and now it's working [15:45:27] nice! [15:45:42] what is the path of the new one? [15:45:46] https://www.irccloud.com/pastebin/rpsOYpBH/ [15:46:06] gotcha [15:46:49] it's not in that wiki page, so maybe just removing it will fallback to the other path? [15:47:19] nope, removing it just fails [15:47:31] so I'll add to the wiki? [15:47:43] yes please [15:49:22] the alert is gone :) [15:50:17] okok, added to the wiki [15:50:59] (fyi. I found the cert by using strace to check the files the puppetserver ca opened) [15:54:05] we are so close to alert-board zero! [15:54:55] dcaro: thanks! [15:55:23] andrewbogott: yes, that's nice :) [16:00:25] 🎉 0 alerts 🎉 [16:02:16] are we sure the alerting system is not broken? /s [16:03:12] hahahaha, fair enough concern :) [17:17:00] I have the MR uploaded for the webservice cli deployment -- https://gitlab.wikimedia.org/repos/cloud/toolforge/webservice-cli/-/merge_requests/83 [17:17:44] I think https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Packaging#Deploy_the_package is basically next, but it looks like maybe dcaro does something slightly different when I look at https://gitlab.wikimedia.org/repos/cloud/toolforge/webservice-cli/-/merge_requests/82 [17:19:05] bd808: I follow that yes, the cookbook will also run the tests and report in the MR, then you can merge it and finally push the tag [17:20:01] that mr specifically piled up a bunch of changes though :/, not ideal, but well, happens [17:20:01] cool. So is `--git-branch bump_webservice-cli` how it finds the MR? [17:20:17] no need :), it already tries to find that branch [17:20:28] (advantage of predictable naming) [17:20:46] would not be a problem if you pass it too though [17:21:37] it might not deploy it to the old sgebastion though, I held it back manually there as it uses some code that's not working on python 3.7 [17:22:06] "Found already running tests (lockfile /data/project/test/functional_tests.lock, pid 2015160), can't run in parallel, aborting" [17:22:55] https://phabricator.wikimedia.org/P80936 [17:23:02] oh, someone is deploying stuff already [17:23:11] * taavi is deploying jobs-api [17:23:52] * bd808 waits his turn [17:24:08] bd808: have you had a chance to get any more info on T360488 ? [17:24:08] T360488: Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488 [17:24:47] dcaro: I haven't looked at that at all since that last comment [17:25:43] ack, I'll try to follow up at some point, I might even reconsider installing the packages in the bastions (though I would strongly prefer not to) [17:26:00] (to be discussed with the rest of the toolforge admins of course) [17:26:57] putting the perl runtime on the bastions would be a slippery slope change for sure. [17:27:19] agree [17:28:42] bd808: my deployment is now running in tools, the lock in toolsbeta should be clear now [17:31:40] thanks taavi [17:35:22] heh. dcaro I think your account is hard coded into the cookbook: "@@@@@@@@ Configuring toolforge-deploy for dcaro" in the output when running as me under sudo. [17:35:51] or is that your account being used to run tests? [17:38:18] taavi: clear for me to deploy to tools now? [17:38:26] yes, was just about to ping you [17:38:35] :) excellent [17:40:28] bd808: yep, it's used to run the tests for now, pending moving to a service account of sorts (needs full k8s access) [17:43:52] * dcaro off [17:44:01] deploying the new webservice-cli package to the tools bastions is failing, but not really telling me why. https://phabricator.wikimedia.org/P80937 -- dcaro does this seem like the held back package you mentioned? [17:44:04] bd808: happy deploying! [17:44:15] just in time :), looking [17:44:41] bd808: yep, that's likely [17:44:59] I think that holding the package might make apt install --upgrade fail [17:45:16] hmmm [17:45:23] It looks like 0.103.17 made it to dev.toolforge. Checking other bastion [17:46:05] yeah. updated on both. I guess that's cool then. [17:46:14] yep, it's just missing to run the tests [17:46:32] though I think it's safe to merge + push tag, given the small changes [17:46:42] if you want to run the tests to make sure, there's a cookbook for it [17:46:52] (that just runs the tests, it does not comment on the MR though) [17:49:52] this should help next time https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1176291 [17:59:49] bd808: anything else before I go? [17:59:58] (were you able to test/deploy ok?) [18:02:44] gtg. ping me if you still have issues, 🤞 [18:02:47] * dcaro off