[07:47:00] please review: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1142547 [08:02:26] also https://gerrit.wikimedia.org/r/c/operations/puppet/+/1142546 [08:15:31] taavi: LGTM [09:35:37] taavi: please review https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/57 [09:36:37] arturo: can you update https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/-/merge_requests/25 to show it working? [09:37:05] taavi: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/-/jobs/508968 [09:37:36] thanks, lgtm [09:37:45] thanks [10:04:36] Morning! I have this change that affects `ceph.conf` - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1144583 [10:06:58] This touches your clusters too, and will need a rolling restart to implement, I believe. Would you like to co-ordinate around the timing and testing? [10:13:36] btullis: could the patch be merged without a restart of the daemons? [10:15:13] Oh, sorry. Yes, I just meant that it won't take effect until a restart happens. Which is fine. It would just be nice to know that it restarts cleanly afterwards, but you could do that at any time you like. [10:16:52] I plan to do a rolling restart of the cephosd100[1-5] cluster as soon as it is merged, but I am running reef. [10:16:54] so I guess the answer to your original question is yes -- we would like to coordinate the timing [10:18:27] I suppose I could re-work the patch to make it select on clusters. But I'n not sure it is worth it for this change. What do you think? [10:18:34] btullis: I've sent a calendar invite for tomorrow [10:18:55] Ack, nice. [10:19:12] david is out today. I guess we can do the rollout restart tomorrow in that slot, if that works for you [10:19:31] Perfect. Thanks. [10:20:23] thanks you :-) [10:51:45] taavi: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/-/merge_requests/26 and https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/60 [10:53:42] arturo: lgtm [10:56:17] thanks [10:57:10] taavi: please approve as well this one: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/26 [10:57:55] done [10:58:00] thanks [11:14:48] taavi: are you interested if I migrate https://gitlab.wikimedia.org/repos/cloud/metricsinfra/tofu-provisioning to the new layout used by the toolforge one? [12:03:34] created https://gitlab.wikimedia.org/repos/cloud/metricsinfra/tofu-provisioning/-/merge_requests/2 but it is missing the creds, which I will only generate if you agree with the change [12:04:35] arturo: sure. the main question is figuring out how to handle the various database credentials etc it provisions, which currently just live in a gitignored file in my local checkout of that repo [12:04:57] i think we want to structure that in a way where we can at some point provision that at codwf1dev [12:05:12] I guess puppet is the way to go for such secrets [12:06:22] the opentofu code needs those secrets as they're fed to the trove api, how do you get puppet to do that? [12:06:36] mmm right [12:06:41] so they need to live in the repo [12:06:43] also if we're going to have a lot more of those service accounts soon we're in need of something more scalable than https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/base/files/labs/notify_maintainers.py#31 [12:06:55] i guess you could use gitlab secrets [12:07:07] but creating them by hand and the copying to puppet is not the best [12:09:05] yes gitlab secrets could be nice [12:09:47] could you deploy a secret into a VM filesystem from opentofu? :-S [12:11:05] no idea [12:11:21] this maybe just be another instance of not having a good secrets solution overall [12:12:34] didn't andrew try to deploy openstack barbican at some point? [12:32:54] I just created https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/OpenTofu with the intention of it being the entry point for how we use tofu across projects [12:36:52] arturo: nice, thanks! [12:39:20] yw [12:53:46] please review https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/69 [13:13:54] Hi, FYI cloudbackup200[3-4] and cloudrabbit200[1-3]-dev have puppet disabled since almost a week. They are linking T390914. We shouldn't leave hosts with puppet disabled for longer period of time. [13:13:54] T390914: Upgrade cloud-vps openstack to version 'Epoxy' - https://phabricator.wikimedia.org/T390914 [13:17:34] andrewbogott: ^ [13:17:55] volans: thanks for the poke, I will resolve that shortly [13:18:02] thanks! [13:21:12] arturo: I have a couple of codfw1dev networking questions. First, new VMs created there with the dual-stack network look like this: [13:21:17] https://www.irccloud.com/pastebin/snkhSLfY/ [13:21:32] My very sophisticated question is: what's the deal with having two v6 addresses? [13:22:35] And my followup question is... is there any chance that's related to me getting a 503 from the cloudlb when that VM tries to talk to radosgw? [13:22:51] one of them is a link-local address and the other is the globally routable "real" address [13:23:15] unlikely [13:23:28] which 503 are you exactly getting from and where? [13:25:00] ok but eqiad1 VMs don't seem to have that link-local address do they? [13:25:19] taavi: the 503s are happening here: [13:25:20] root@tfbastion:~/tf-infra-test# TF_LOG=DEBUG tofu apply -var datacenter=codfw1dev [13:25:56] it can talk to everything except radosgw. And I /can/ talk to radosgw from labtesthorizon [13:26:06] they do? in general their ipv6 connectivity would be totally broken without it? [13:26:17] can you just paste the error? [13:28:07] So even if a VM is only set up in the legacy network it still has the v6 link-local address [13:28:15] I think that's what was confusing me [13:29:08] OK, so I will ignore ipv6 as a candidate for this [13:29:16] Here's a snip of a tofu debug output: [13:30:10] https://www.irccloud.com/pastebin/9iQVIpiN/ [13:30:21] the same action works in eqiad1. [13:30:41] Last night I was sure that the 503 was coming from haproxy and not from radosgw but today I'm not longer sure about that [13:39:26] trying the url mentioned in the stack trace with curl manually results in a 403 [13:39:38] and i don't see anything strange in any of the haproxy metrics [13:39:48] so that to me suggests an issue with one of the rados backend services [13:41:18] ok. I spent ages trying to pry logs out of rados and never saw evidence that anything was hitting it other than health checks. But I can take another stab at that. [13:41:42] (that lack of rados logs was why I started to blame the proxy) [13:47:59] huh, when I curl I see it in the rados logs. But when tofu tries the same thing... no logs. [13:53:38] * andrewbogott wants to read hidden RH docs for the first time ever https://access.redhat.com/solutions/6986506 [14:12:10] same creds and same action work with the openstack cli. [15:51:57] * arturo offline