[09:01:29] Good morning to everyone. [09:03:09] atrawog: o/ [09:08:34] I'm pretty much ready to run a PAWS deployment from pawsdev-bastion. But I'm going to need some help with git-crypt and unlocking the secrets.yaml. [09:13:25] atrawog: I think dhinus may have that knowledge more fresh. I think he will be around soon, and he may be able to assist with that particular bit [09:14:12] And it looks like I have to install tofu, kubectl and helm on the bastion system too. [09:15:35] Great, I have some time at hand today and can wait until dhinus is around. [09:48:45] hey I'm available now [09:49:22] so the git-crypt key is stored in the bastion under /home/rook I think, I have root access so I can make a copy of it to another home folder [09:52:56] dhinus: we may want to save the key elsewhere? maybe in pws [09:53:09] yes I think that's a good idea [09:53:24] I can't ssh to the bastion for some reason, trying to fix it [09:57:14] Thanks! Andrew made a new pawsdev-bastion deployment on codfw1dev. So it's possible that you need to reconfigure your keys. [09:57:37] yes, that was the problem, I was using an old hostname that was slightly different :) [09:57:56] I'll grab the keys from the production paws bastion [09:58:49] the repo is using a single symmetric key, not user keys [09:59:21] so that can be stored in pwstore as arturo was suggesting, but I guess atrawog doesn't have access to pwstore [09:59:37] correct [09:59:39] I will put a copy in /home/atrawog in pawsdev-bastion [10:05:46] Thanks a lot. [10:07:25] ok I found the key and it works, atrawog you have a copy in your home dir in pawsdev-bastion [10:07:44] you can use it with "git-crypt unlock /path/to/key" [10:10:02] arturo: I'm not sure it can be stored in pwstore as it is, because it's a binary format [10:10:33] mmm [10:10:53] I'm checking if I can convert it to an ASCII format and if git crypt can still use it [10:11:00] ok [10:12:35] There is no git-crypt on the system. Shall I install the missing packages myself, because it looks like I have sudo permission on pawsdev-bastion. [10:13:02] yes you should be able to install it [10:13:56] arturo: git-crypt only accepts binary keys, but I found this workaround: https://github.com/AGWA/git-crypt/issues/289 [10:14:25] dhinus: base64 sounds fine [10:15:01] I can add a note in the file in pwstore with the command to decode [10:15:17] yup [10:15:20] sounds good [10:26:25] PAWS deployment on pawsdev-bastion is running and I'm about to say yes to the tofu deployment. [10:27:03] That didn't go roo well: [10:28:10] Error: Error updating openstack_containerinfra_clustertemplate_v1 69b972c7-88f1-475d-912c-ef28658fc760: Bad request with: [PATCH https://openstack.codfw1dev.wikimediacloud.org:29511/v1/clustertemplates/69b972c7-88f1-475d-912c-ef28658fc760], error message: {"errors": [{"request_id": "", "code": "client", "status": 400, "title": "Unable to find fixed network lan-flat-cloudinstances2b" [10:28:44] I think the network names changed recently [10:28:53] so maybe there's an old one hardcoded in tofu? [10:29:03] yes, I think I sent a patch that was never merged [10:29:26] atrawog: https://github.com/toolforge/paws/pull/485 [10:29:29] That would explain a lot :) [10:30:53] How can I get in touch with supertassu who'S responsible for the github repo? [10:31:13] that would be taavi :-) [10:31:39] tangentially: I believe the repo also needs moving to our own gitlab [10:32:01] "responsible" is a strong word there [10:33:46] * dhinus in the meantime is still fighting with pwstore [10:34:14] Moving to gitlab would make sennse and we probaly should take a look on how the Jupyter images for PAWS get build too. [10:46:27] arturo: atrawog: anyhow, did you need me to push some buttons in github? [10:53:20] arturo: I managed to store the git-crypt keys for paws and quarry in pwstore. can you double check that you can use them? [10:53:32] let me check [10:54:27] dhinus: I can see them. Thanks for the additional instructions [10:55:29] I guess an additional pointer in wikitech and/or the paws/quarry repo themselves may also be good [10:56:43] I think the paws/quarry repo have the most recent docs at the moment, so I would put it in the readme there [10:58:58] @taavi it would be great if you could take a look at the PR from @arturo and merge it https://github.com/toolforge/paws/pull/485 [11:03:32] I've checked out the T389942 branch from arturo and so far the deploymen is running fine. [11:03:32] T389942: openstack: rename lan-flat-cloudinstances2b to VLAN/legacy - https://phabricator.wikimedia.org/T389942 [11:08:44] good! [11:08:57] so I guess that means you have tested the patch, and it can be merged [11:11:40] Looks like it. The cluster is still being created at the moment "openstack_containerinfra_cluster_v1.k8s_127a: Still creating... [8m50s elapsed]", but the original "Unable to find fixed network" issue is fixed. [11:13:47] arturo: https://github.com/toolforge/quarry/pull/73 [11:14:02] for PAWS, I updated this wiki instead: https://wikitech.wikimedia.org/wiki/PAWS/Admin#Deployment [11:14:51] But I likely have to fix the kubernetes version for the updated codfw1dev, because the PAWS deployment is still using "v1.27.8-rancher2". [11:14:51] dhinus: LGTM both [11:16:22] arturo: I need you to add a "review" in github before I can merge [11:16:44] dhinus: done [11:16:47] thanks! [11:20:57] * dhinus lunch [11:40:19] https://github.com/toolforge/paws/pull/485#pullrequestreview-2749669096 [11:49:11] thanks [11:53:25] taavi: does this ring any bell? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/-/jobs/479948#L61 [11:53:29] is a new setup I'm trying to introduce [12:00:39] arturo: not immediately :( [12:00:46] ok :-( [12:01:14] taavi: do you remember how the tofu-infra repo knows where are the endpoints for openstack? [12:04:39] nevermind, it seems to be hitting the right endpoint [12:05:07] do you remember how to use different gitlab runners? [12:06:10] * arturo reading https://wikitech.wikimedia.org/wiki/GitLab/Gitlab_Runner [12:06:23] apparently with clouds.yaml, if I'm reading https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/openstack/base/opentofu.pp#62 correctly [12:06:35] arturo: oh good point [12:07:29] i think you need to specify `tags: [wmcs]` to get scheduled to a runner in cloud vps, we block access to the codfw1dev apis from outside [12:07:42] ok [12:08:53] trying .... [12:10:25] it works! 🎉 [12:11:01] thanks taavi ! [12:11:16] chuckonwu: here is yet another example of a working opentofu setup. https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning [12:12:00] I've also made a number of updates to our docs to clarify a few things [12:12:01] here https://wikitech.wikimedia.org/wiki/Help:Object_storage_user_guide#S3_API [12:12:42] and here https://wikitech.wikimedia.org/wiki/Help:Using_OpenTofu_on_Cloud_VPS [12:35:05] topranks: I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134700, could you add the matching config on the switch side when you have a moment? [12:35:48] re adding the IPs/routes, right now adding cloud-private IPs without adding the DNS entry is quite difficult due to how the puppetization works. i guess that can be changed [12:36:05] although I wonder how much things would break if we just added all the IPs and DNS names at once [12:44:16] arturo: https://wikitech.wikimedia.org/wiki/Help:Using_OpenTofu_on_Cloud_VPS#Setup_options -- how do any of those options need a wmcs admin? [12:45:36] ec2 creds [12:45:53] can they be generated in horizon? [12:46:26] no, but you can self-service those using the openstack cli [12:48:04] I see [12:55:52] arturo: when you have a minute, can you test this on your unix laptop? https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/227 [12:58:58] sure [12:59:10] is the lima VM always emulating amd64 regardless of the base platform? [12:59:45] no, the VM is using ARM on my mac, but the docker containers inside the VM are emulating amd64 [13:00:23] and some binaries as well, like kind that runs outside of Docker, but it's a amd64 binary [13:00:40] but the main OS is using ARM [13:01:16] I don't remember why, but I think emulating the full OS was not working correctly, so we found this compromise [13:08:39] taavi: the config on cloudlb2002-dev isn't right :( [13:08:50] it's trying to use the prod-realm IP to source the BGP session [13:09:02] https://www.irccloud.com/pastebin/OPeL5zwp/ [13:09:32] that's not the case for the v4 equivalent though, not sure how the IP is added to the config for that [13:09:39] https://www.irccloud.com/pastebin/F8Hg3nQB/ [13:11:55] taavi: I also notice it's adding the peer IP with the "%" format, which is used sometimes if peering over IPv6 link-local. But in our case not needed as the neighbor is a global unicast IP, and the interface is wrong so it definitely won't work [13:14:48] what we really need there is: [13:14:53] local 2620:0:860:118:10:192:20:2 as 64605; [13:14:53] neighbor 2a02:ec80:a100:205::3 external; [13:15:09] sorry...... not that [13:15:34] local 2a02:ec80:a100:205::3 as 64605; [13:15:34] neighbor 2a02:ec80:a100:205::1 external; [13:15:39] ^^ this [13:17:49] hmm, let me see [13:34:32] topranks: still looking at the best way to fix that explicit interface, but https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135023 fixes source address at lesat [13:34:38] also https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135018 [13:41:45] this block in the template for bird.conf seems to be the issue [13:41:52] <% if ! @_multihop -%> [13:41:52] neighbor <%= neighbor_v6 %>%<%= @facts['networking']['primary'] %> external; [13:41:52] <% else -%> [13:41:52] neighbor <%= neighbor_v6 %> external; [13:41:52] <% end -%> [13:42:52] our issue here is that this shouldn't be multihop, but at the same time also not on IPv6 link-local (requiring the interface) [13:43:10] there is a potential fix I think though [13:43:18] give me a few mins [13:43:20] i guess we could just add a check for neighbor_v6 being a link-local address [14:10:37] arturo: andrewbogott: I'm gonna shut down clouddumps1001 and see if anything breaks T383723 [14:10:38] T383723: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723 [14:10:56] dhinus: ack [14:13:58] I checked a random tools nfs worker: /mnt/nfs/dumps-clouddumps1001 is hanging, but hopefully all tools are using /mnt/nfs/dumps-clouddumps1002 [14:14:51] I'm still a bit worried that we don't have a clean way to shut down those connections when the server is not available [14:16:22] no other alerts so far, I'll keep it shut down for 1 hour, then try to bring it back up [14:21:44] taavi: I think we need to check if the address is a link-local alright [14:22:11] we can use the "interface" directive in the bird conf rather than putting it together with the IP in the 'neighbor' statement [14:22:26] which I thought would make it simpler but doesn't really. I think something like this is needed: [14:22:27] https://phabricator.wikimedia.org/P74719 [14:23:11] I can try and work on a patch, but we need to be cautious and work out what else it might change [14:48:31] arturo: what project are you getting the quota error for? [14:48:33] andrewbogott: the error can be seen here: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/-/jobs/479978 [14:48:41] testlabs @ codfw1dev [14:48:59] it is a 403 QuotaExceeded error [14:53:07] here is the quota info for that user: [14:53:17] root@cloudcontrol2006-dev:~# radosgw-admin user info --uid testlabs\$testlabs [14:53:17] { [14:53:17] "user_id": "testlabs$testlabs", [14:53:17] "display_name": "testlabs", [14:53:17] "email": "", [14:53:33] :-( flood [14:53:40] oops [14:53:59] is there any chance it's happening via another user by mistake? This resembles an issue chuckonwu was having yesterday... [14:54:22] andrewbogott: yes, high changes, because is a similar setup, with similar pitfalls [14:55:54] high chances* [14:55:58] My issue ended up being I had the general openstack url where I needed the s3-compatible object storage url [14:56:14] oh great, so you're unstuck now? [14:56:25] arturo: I think that account maybe is really over quota. Going to adjust the quotas... [14:56:34] andrewbogott: ok [14:56:38] Yep! I've even imported the first resource, a floating ip from toolsbeta, and it's all working [14:56:58] andrewbogott: I do see a bunch of containers, so it sounds legit [14:57:00] https://usercontent.irccloud-cdn.com/file/1Xajf26d/image.png [14:57:15] chuckonwu: 🎉 [14:57:18] this is all documented at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Projects_lifecycle#swift_/_S3_/_radosgw_/_object_storage btw [14:57:25] chuckonwu: that's great! [14:58:04] andrewbogott: I think it was me who created the docs ... I should know better [14:58:25] they're good docs! [14:58:58] try now? [14:59:09] trying [14:59:35] it works! [14:59:39] thanks! 🎉 [14:59:42] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/-/jobs/480461 [14:59:57] hey, something is working as intended! [15:06:19] clouddumps being down is causing puppet to fail across all of cloudvps, not great :/ [15:06:30] :-( [15:06:38] I'll try taking it back online and see if things go back to normal [15:07:20] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu [15:07:23] this is also concerning [15:15:07] hmm let me check in grafana if there was a sudden spike [15:15:15] there was [15:15:18] I'm investigating [15:15:43] clouddumps is back online in the meantime, and puppet is working again [15:17:13] I'll allocate an additional worker meanwhile [15:17:24] I think it will go back to normal in a few mins [15:17:48] I can definitely see a spike in the graphs, but interestingly not a sudden spike, but a steady growth starting from when the clouddumps host went offline [15:18:06] https://usercontent.irccloud-cdn.com/file/euXsIkeD/Screenshot%202025-04-08%20at%2017.17.56.png [15:18:19] yeah, but also there were more than 70 new cronjobs created in the last 12 hours [15:18:26] so, some of this growth is legit [15:18:43] lots of "pending" pods https://usercontent.irccloud-cdn.com/file/rcXjupw6/Screenshot%202025-04-08%20at%2017.18.30.png [15:18:55] and the graphs go down exactly when I restarted clouddumps1001 [15:19:06] yes, it can be related [15:19:17] I see the other previous spike was during the outage on 2025-04-03 [15:19:28] so when there are outages, there are more pods in flight because more fail [15:23:24] I'm trying to understand why pods were failing in this instance [15:23:42] in theory they should've worked fine, because the symlink was pointing to the active clouddump host [15:25:34] I don't have a clear theory at the moment [15:25:43] I'll open a task [15:35:10] the cookbook to create a new worker node fails because it uses an old image [15:41:22] arturo: good catch, is the image hardcoded? [15:41:34] clouddumps issue: T391369 [15:41:35] T391369: If the inactive clouddumps host goes down, it causes a ripple effect on Cloud VPS and Toolforge - https://phabricator.wikimedia.org/T391369 [15:42:06] I haven't found where the image could be hardcoded, but seems like it [15:42:44] in general, we could set 'debian-12-bookworm' and this will be resolved to whatever latest uuid is defined in glance [15:43:28] andrewbogott: you may be interested in a bunch of VM unable to schedule in codfw1dev [15:43:33] (testlabs project) [15:44:03] tofu here: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/-/jobs/480525 [15:44:13] unable to schedule for +6m [15:46:39] arturo: the cookbook defaults to image=self.image or other_prefix_members[-1]["Image"] [15:47:01] yeah, that explains it [15:47:09] the previous members are using a deprecated image [15:47:23] so we need self.image (default from the parser) to be `debian-12-bookworm` [15:47:34] hm, what cursed thing do all those VMs have in common? I was just now able to create a VM in that project with no issues [15:47:49] andrewbogott: created by tofu? [15:47:55] no, from the UI [15:48:01] Just confirming that the scheduler isn't broken [15:48:12] I mean, what the broken ones have in common is tofu [15:48:54] yep, but it must be something more than that. A constraint that isn't actually supported by the existing cloudvirts probably [15:49:25] dhinus: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1135067 [15:49:52] flavor? [15:50:12] andrewbogott: they all use a tiny flavor`"g4.cores1.ram2.disk4"` which is maybe too small? [15:51:24] Could be, I'll try that one [15:51:28] arturo: +1d [15:51:32] btw in the logs I see 'Exception during message handling: nova.exception.InstanceExists: Instance networktests-vxlan-ipv4only-floating already exists.' 13 minutes ago, is that anything? [15:51:44] andrewbogott: oh, right [15:51:48] that's my fault [15:52:30] andrewbogott: they existed previous to the tofu run, created by hand. When I run tofu, detected it, removed them by hand, but newer ones were already in flight [15:52:48] cli creates with flavor g4.cores1.ram2.disk4, no issues [15:53:11] I'll confirm this theory now [15:53:29] running https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/-/jobs/480527 [15:54:26] andrewbogott: confirmed, works now. My bad, sorry for the noise [15:54:34] np! [15:56:01] you all may want to explore the work I did today with https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning as another setup example for tofu+openstack s3 bucket+gitlab CICD [15:56:29] it will potentially support 2 envs (eqiad1 codfw1dev) later when I add support for eqiad1 [15:56:40] arturo: nice! [15:56:40] so far only codfw1dev is supported, but you can see the shape of the repo already [15:56:47] arturo: are your tests exercising the proxy api? [15:57:00] andrewbogott: no [15:57:02] I only ask because tf-infra-tests does and something is broken w/AAAA records [15:57:02] ok [16:00:29] arturo: the test failure in wmcs-cookbooks I think it's the one related to the new spicerack version [16:00:40] maybe rebasing your patch is enough to fix it/ [16:01:49] ack [16:02:00] the main branch seems up to date [16:02:15] hmm [16:02:53] ah yes this was not merged https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1133908 [16:04:01] andrewbogott: can you merge this patch? ^ [16:04:12] yep [16:08:44] it's merged [16:16:56] thanks! [16:25:06] * arturo offline