[09:01:08] for some reason the networktest VMs dissapeared form the testlabs project @ codfw1dev [09:01:18] I'm glad it was coded in tofu, so I just recreated easily [09:35:55] topranks: hi! quick question, cloudsw1-c8 was before connected to asw2-b*, but now those interfaces are down, does that mean that now all the traffic between cloudsw<->asw flows through the core routers? [09:36:09] (/me trying to refresh my memory/catch up on latest changes) [09:37:36] we have this graph for ceph `Throughput to other switches (asw2-b) aggregated 2x10Gbps` that might not be relevant anymore if that flow does not exist [09:44:44] hmpf... maintain-kubeusers vcr recording tests are really flaky... [09:45:01] ? [09:45:04] what happened? [09:45:22] nothing happened, they just start failing after some time by themselves [09:48:12] I think I fixed something a couple of years ago to make them date/time agnostic, if that is what you mean [09:48:36] but the fix may have not been enough! [09:56:33] no idea, looking, they failed without any code changes during the last MR, and now that I merged it they failed again [09:56:46] the script to generate the vcr thingies is also broken btw. fixing [09:57:08] I think some limactl syntax might have changed [09:59:15] oh! :-( [09:59:17] sorry about the mess [10:05:29] dcaro: on the switches we decom’d that link as it was no longer being used, no more cloud hosts in row b [10:05:36] it's ok, not your fault, we should find a better way™ to make those more reliable [10:05:45] updating the diagrams is on my list to do this week [10:05:57] topranks: ack, so I can scrub that graph :) [10:06:03] Yup [10:08:27] GM [12:05:46] arturo: not urgent, but I would like you to review https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/180 [12:05:55] there's also a question in a comment inline [12:12:26] ok [12:28:50] arturo: thanks, merging now, then I'll create a MR to remove the "import" blocks and one to remove the legacy records [12:28:51] when merging that one, remember that it will fail in codfw1dev [12:28:56] yep [12:29:07] 🚢 🇮🇹 [12:31:03] maybe we could remove the error "You can only run 'apply' for all clusters, i.e: don't specify --cluster_name" [12:31:29] I was tempted to do a few times [12:31:48] but then, having apply to run across both clusters ensures certain level of consistency that I like [12:31:57] yeah makes sense [12:32:04] i.e, it kind of forces you to leave the repo in a consistent state [12:32:39] and if you need a manual hack like the one you need today, is obvious, and you pay attention, and you follow up with the patch, so the repo remains consistent [12:33:25] that's the psychological effect I've felt when doing it anyway :-P [12:34:45] arturo: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/194 [12:34:49] I also see value in the cookbook being more flexible, so I would be fine if that parser restriction was lifted [12:34:54] 👀 [12:35:13] yeah I had similar thoughts, let's leave the cookbook as it is for now [12:35:17] +1'd [12:35:21] thanks [12:39:21] arturo: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/195 [12:42:10] mmmm [12:43:04] turns out, it is in use [12:43:07] we need to replace it! [12:43:11] haha, nice one! [12:43:14] cloudinstances2b-gw.svc.eqiad.wmflabs [12:43:18] is in use [12:43:50] ok I can open a task if you want [12:43:58] what about the compat one? can that one be deleted? [12:44:11] yes, the compat one can be deleted [12:44:20] ok then I'll delete that one as part of the current task [12:44:24] yeah [12:44:24] and leave the other one for later [12:44:57] there are also some more records that I haven't imported yet (I added a TODO in the dns.tf file) [12:45:02] https://openstack-browser.toolforge.org/project/cloudinfra/zone/cloudinfra.wmflabs.org. [12:45:16] and also https://openstack-browser.toolforge.org/project/cloudinfra/zone/cloudinfra.wmcloud.org. [12:45:47] it would be nice to add a description for those to clarify what they're for [12:45:53] I see [12:45:58] some of them feel important! [12:46:05] (like the puppet-enc) [12:48:03] I updated the MR to delete only the compat record [12:48:15] ok [13:05:21] dhinus: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/197 [13:08:57] paired with: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136719 [13:12:54] arturo: left 2 comments in gitlab [14:03:29] arturo: I probably deleted those VMs -- sorry, I was cleaning up and mistook them for my own test hosts. Glad they were easy to recreate! [14:22:51] no problem! [14:28:16] andrewbogott: I'm getting some auth errors when running wmcs-wikireplica-dns from cloudcontrols [14:28:38] I'm not sure if it's because of my recent changes to that script [14:28:51] keystoneauth1.exceptions.http.Forbidden: You are not authorized to perform the requested action: identity:list_services. [14:29:24] I tried exporting OS_CLOUD as per https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas/DNS [14:29:29] but that didn't help [14:29:42] That could be from one of the patches on https://phabricator.wikimedia.org/T330759 -- who does that script run as? [14:29:49] root [14:29:57] I mean, which keystone user? [14:29:58] with OS_CLOUD=novaadmin [14:30:38] oh! that's very unexpected then [14:30:39] it's safe to run the script if you want to test it [14:30:39] let me try [14:31:03] it also has a --os-cloud that I didn't try to use [14:31:37] can you do this? [14:31:38] #OS_CLOUD=novaadmin openstack service list [14:32:03] yes [14:32:09] huh [14:32:24] well, let's do our checkin and then maybe I'll look over your shoulder for this [14:32:29] yup [15:06:52] Raymond_Ndibe: ping [15:21:53] chuckonwu: in case of doubts, you can follow https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/-/blob/main/.gitlab-ci.yml this as reference [15:22:45] Will do [15:38:24] hello atrawog! I'm busy for the next hour or so but I suspect that the next useful step might be getting shell on the worker nodes to make sure they have proper connectivity &c. Is that something you're able to do? In theory you can install a keypair on the fcos hosts I think [15:50:45] taavi: it looks like you fell off of some invites because we've been including @wmf things rather than your personal email. David is fixing for at least one meeting but it wouldn't hurt for you to also be vigilant. Sorry! [15:55:13] andrewbogott: I can see my key in https://labtesthorizon.wikimedia.org/project/key_pairs But when I'm trying to jumo into one of the newly created host like paws-dev-127a-c5msmu4mmbbi-master-0 my key is getting rejected. [15:55:50] I think you need to explicitly include the key in the magnum template [15:55:56] What's the default user to connect to the Fedora-CoreOS-38 instances? [15:56:05] But also I'm not 100% sure those fcos VMs run cloud-init :( [15:56:39] atrawog: I don't know, since I didn't set up this initially. I can replace the fcos base image if you have an opinion about a known good platform [15:57:20] * andrewbogott is still in a meeting so not very helpful [16:03:23] Well I think we should take a look together how the OpenTofu openstack_containerinfra_cluster_v1 resource deployment actually works behind the scenes. Because in theory Magnum should deploy a full Kubernetes cluster all by itself and just hand back the kubectl config. [16:05:51] And if I understand things correctly the config in your Magnum Cluster template has to match the config used to deploy the cluster in OpenTofu. [16:57:36] atrawog: ok, back from meeting! I agree that magnum should be working. But we've never deployed paws in codfw1dev as far as I know, so we need more info about what is actually going wrong. [17:01:43] Thanks! Have tried deploying any other k8s cluster on codfw1dev yet? At the moment I'm not sure if it's an issue with the PAWS deployment itself or some misconfiguration with the new Magnum stack. [17:09:04] I redeployed https://gitlab.wikimedia.org/cloudvps-repos/deployment-prep/tofu-provisioning fully (delete + deploy) last week, so at least in eqiad Magnum was working post upgrade. [17:10:23] I have not however tried to update the Magnum template there to use anything that would have come from the newer OpenStack version. [17:11:34] The things that generally can change after an OpenStack version bump are the labels -- https://gitlab.wikimedia.org/cloudvps-repos/deployment-prep/tofu-provisioning/-/blob/main/magnum.tf?ref_type=heads#L37-56 [17:15:58] Thanks a lot for the info. And it's quite possible that the issues with PAWS could be easily fixed by bumping the values to whatever is used by the current Magnum tmplate on codfw1dev. [17:45:49] atrawog: we could eliminate a variable by giving you a test project in eqiad1 instead... I'm trying to think through the implications of that [17:46:13] It would mean consuming production wiki accounts but we'd already be doing that in codf1dev anyway wouldn't we? [17:47:54] bd808/anyone: can you think of any reason a second magum+paws deployment in eqiad1 would leak out into prod paws and be a bad neighbor? [17:49:35] andrewbogott: Yes, setting up a dedicated test procject on eqiad1 could make sense. We just have to make sure that we don't end up mixing up the test and prod deployments on eqiad1. [17:52:10] It shouldn't cause any disruption. Blue green is how paws is normally deployed [17:53:34] atrawog: let's try that, then we can find a whole other suite of magnum bugs in the other deployment [17:55:34] There's a test oauth string in the configs that can be used to access a parallel paws in production that you can uncomment and deploy/update with. Or one can make their own [17:57:06] Rook: is there already a test/dev project in eqiad1 or should we T392004 [17:57:06] T392004: Request creation of pawsdev project - https://phabricator.wikimedia.org/T392004 [17:58:16] andrewbogott: Rook: It's starting to get late for me today. But we can do a PAWS deployment on eqiad1 tommorow once your online. [17:58:25] ok! [17:58:52] dcaro, can I get a +1 on T392004 ? [17:59:27] andrewbogott: +1d! [17:59:42] I merged the cookbook, so should be already on cloudcumin I think [17:59:50] let me know if it breaks xd [18:00:46] thanks! [18:04:53] * dcaro off [18:05:53] andrewbogott: I wouldn't have left a test environment there no. I believe there is only the production environment. Though easy to check if there is just one cluster there is just production [18:06:14] yeah, I just see the one [18:06:38] Oh one element of overlap with two is they will share the NFS. So be aware of testing NFS things with a test cluster they're testing on the production NFS [18:07:25] Rook: oh, I was thinking we'd do it in a separate openstack tenant. So, it'll have its own nfs server [18:08:23] Oh you can do that too I guess. Would probably need some code updates to work. Defining the name of the new Tennant and the domain names [18:09:03] true [18:09:10] although we would've needed that in codw1dev too [18:11:28] No that's already setup for there [18:12:14] Also be aware of tf state overlap. If you don't change it the new tenant install will try to use the existing one and may get confused [18:15:38] Just give the deploy.sh the codfw1dev option and all that should work. Once the project in codfw is setup for it. Mostly needs a bucket for the tf state file. It will deploy without NFS but will fail to come up. Though Ansible and tf will be fine [18:16:20] Hm... [18:17:24] I don't know enough to know if the codfw1dev failure is a magnum issue or something further along. Andreas sent me this log snippet: [18:17:31] https://www.irccloud.com/pastebin/FNNs4ao3/ [18:24:46] dhinus: this should fix the wikireplica script https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136780 [18:30:15] I'm not really seeing an error in there. Is it trying to deploy a k8s cluster and failing? [18:30:57] yeah, I think so. it says [18:31:00] Health Status Reason [18:31:01] {"api":"The cluster paws-dev-127a is not accessible."} [18:31:17] Rook: should I expect to be able to ssh to the worker nodes? [18:31:25] Or log in from the console? [18:32:26] You can add an ssh declaration to the cluster spec. I think on the cluster itself not the template to get ssh. It isn't there by default. So are a control node and workers being deployed? [18:33:15] yep, control node and two workers. [18:33:26] Where is the deploy being run from? [18:33:39] both active for 22 hours [18:33:46] I don't know, Andreas launched it [18:35:09] Needs to be run from inside the network. Usually I have a bastion host and run it from there. Otherwise it won't be accessible to all of the deployment scripts. Not really sure where it fails if run from outside but it will at least by Ansible [18:35:38] ok. There is a bastion, let's see if can find evidence if it was run there... [18:36:16] The log you sent looks like tofu. It might be failing to contact the cluster to generate a k8s config file [18:38:30] looks like it was launched there, or at least has been in the past [18:42:16] Rook: if you'll be here for a few and it sounds useful I'll try to tear down and rebuild the cluster so I can see the tofu output [18:43:39] Sure let me know what you find [19:34:32] of course now magnum is failing over and over [19:35:17] That's not ideal [19:36:09] You're running dalmatian now? The upgrade can cause problems but I think I've only seen it with the k8s version upgrades more than the open stack upgrade [19:37:26] yes, dalmatian. I don't have a theory about why it fails 60% of the time, it's pretty inconsistent. [19:41:36] Any idea how to debug something like [19:41:37] | master_config_deployment | 5e10af8d-693e-4144-9162-caf8cf8d4538 | OS::Heat::SoftwareDeployment | CREATE_FAILED | 2025-04-15T19:33:04Z | [19:41:41] ? [19:41:53] I fear I lack any real thoughts on that front. Status messages on the failed clusters offered much [19:42:16] Are you getting a control node but no workers? [19:43:10] yes [19:43:21] I think because creating the control node failed so it didn't move on to the workers [19:44:00] hmmm it looks like maybe I've run out of cinder quota because of leaks [19:44:03] let's see if that helps [19:44:22] I've seen that happen for a number of reasons. Did the application cred get the extended permissions? [19:45:15] It would need some cinder volumes. Would hopefully tell you that was the problem but magnum isn't great at error messages [19:45:41] Sometimes it was a problem with the fcos version [19:46:52] There are usually problems with installing a new version takes some messing with the random variables that are in the docs. Usually you don't need all of the documented ones and indeed usually one or two will cause it to not work [19:46:54] oh, huh, the prod cluster also shows 'unknown' health status, so I guess I shouldn't use that as a success requirement [19:47:39] Oh yeah that shows up a lot. The CLI and horizon cluster status can be many things that don't seem to reflect reality [19:49:02] * andrewbogott deletes a dozen volumes and tries again [19:54:43] Tofu will remove volumes from the initial deploy but not volumes that something like paws deploys. So they collect [19:54:59] failed again [19:59:06] using the app credentials that you created, they have extended=true [20:17:31] That cred still work? I thought the codfw project was removed at some point [20:18:55] I thought so too but it must work if the initial cluster starts up... [20:21:02] * andrewbogott makes new creds [20:22:14] I don't need floating IPs for this do I? [20:26:50] Nope no floating ips [20:28:37] Yeah I guess if you got at least one cluster deployed with the cred it would be working. But seems reasonable to put out a new one regardless. You've copied it into the terraform (don't need to right away but it is also in the Ansible in a base64 bit for deploying the volume in paws) [20:30:16] yeah, put the new cred in terraform but not ansible so far [20:30:36] I'm getting consistent failures that in the DB look like [20:30:37] "Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1" [20:30:56] Which db? [20:34:02] codfw1dev_heat, resources table [20:34:35] I'm back to thinking this is a heat or magnum bug [20:41:50] I would probably guess it is something going on in that venue. You can cut tf out and try deploying a cluster manually on the command line. There are some instructions for doing so on wiki tech. Would need to update the versions and some of the options I suspect [20:48:04] I think there's a db connection leak. [20:48:17] I restarted all heat and magnum services and got to CREATE_COMPLETE right away [20:48:23] going to try again for science [20:49:01] OpenStack debugging: if it's not DNS or rabbit then its the db :) [20:49:23] !bash going to try again for science [20:49:23] bd808: Stored quip at https://bash.toolforge.org/quip/FEs2O5YBvg159pQrtkXQ [21:03:09] ok, it reproduced. So, this explains why increasing the # of db connections 'fixed it' but then it stopped working again the next day. Leaks! [21:03:15] * andrewbogott not looking forward to debugging that [21:03:58] andrewbogott: yay and boo at the same time. #hugops [21:04:23] At least now I can get to the interesting bug, which is the failure of paws to deploy [21:05:02] Rook: deploy.sh is now saying 'TASK [Deploy paws]' -- that means that k8s must have come up, right? [21:09:22] Yeah that sounds like it made it to Ansible. You should see both workers and a control node in horizon if it got the cluster running. Status usually shows a success condition, at least early on it sometimes changes later [21:09:43] the script completed with state 0 [21:10:04] PLAY RECAP ********************************************************************************************************************** [21:10:04] localhost : ok=13 changed=7 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 [21:10:15] So maybe there's for real only the one bug [21:10:31] Silly question, how do I see if it's actually working now? [21:10:39] Don't forget to update the Ansible file with the base64 bit with the new cred. otherwise the cluster will deploy fine but paws won't be able to deploy a PV and will fail [21:11:04] unless the old cred also still works [21:11:11] Well it will fail on the above. But you can still connected it up with the proxy and get it to that failure [21:11:17] in which case it should've been fine to do some with the old and some with the new [21:11:22] Oh true if the old cred is working [21:11:37] Yeah it should be fine with a mix of them [21:11:41] ok, so now I make a web proxy pointing to... what? [21:11:44] Assuming they all work [21:12:05] I usually point it to the first node on whatever the node port is [21:12:15] 30000something [21:12:40] Do a kubectl get all to see the service and that is the node port you want [21:12:53] Or copy it out of prod. It's the same number [21:13:11] so a worker node, at the magic port? [21:13:24] it's 30001 [21:13:31] Though the proxy name has to match whatever it is. I think that's noted in the configs. Otherwise oauth won't work [21:13:38] Sounds right [21:13:55] Control node will work too. Any of them will get you to the mesh network [21:14:16] https://pawsdev.codfw1dev.wmcloud.org/ -- 404 [21:15:50] how do I get kubectl to access the remote cluster? Is that ready-made or do I set it up now piece by piece? [21:20:23] Tofu should have made you a kube.config file in the tofu dir [21:20:58] You can point the env var to that or put it in .kube/config I think to use it [21:23:20] yep, there it is [21:24:29] dang, I was expecting more [21:24:33] https://www.irccloud.com/pastebin/Jrdnc97d/ [21:50:18] Rook, I have to cook dinner and you probably need to wrap up for the night if you haven't already. Thank you for your help; I'm much further than before!