[00:14:59] andrewbogott: they are self-serve somewhere in your Phab settings. I’m away from my laptop or I’d give you better directions. [00:15:07] Using strikerbot’s is fine too [08:04:29] I'm reading the g4 flavors announcement, and suddenly I had a question [08:04:59] what if a bunch of folks decide to migrate and we don't have enough HVs migteated yet to accommodate all the new VMs [08:05:08] migrated* [08:30:20] arturo: you mean migrate from buster to newer OSes? [08:32:14] we have 3 ~empty HVs on OVS atm [08:51:57] ok [08:52:06] I meant migrate to OVS [08:53:06] migration is not self-service [08:55:49] ok -- re-reading now: only for new VMs [08:56:17] apparently I read only a fraction of the email :-) [08:56:24] is super clear now [10:13:53] am I reading right the timestamps of these two log entries? [10:13:59] Jun 12 17:03:51 tools-k8s-control-7 kubelet[760]: E0612 17:03:51.972023 760 kubelet.go:2427] "Error getting node" err="node \"tools-k8s-control-7\" not found" [10:13:59] Jun 12 13:45:31 tools-k8s-control-7 kubelet[495]: I0612 13:45:31.401570 495 reconciler.go:352] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-proxy\" [10:14:24] the second one has an earlier timestamp, but is listed later in the log [10:14:56] oh, wow, the kubelet log has plenty of this [10:15:05] I imagine they could be cause by the overload [10:26:12] super-quick review here would be much appreciated: https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/40 [10:27:26] 👀 [10:28:45] blancadesal: LGTM [10:30:13] arturo: thanks! [12:07:23] [1/4, retrying in 3.00s] Server is in unexpected status: Server status is 'VERIFY_RESIZE', not in any of S, H, U, T, O, F, F [12:07:25] bah [12:07:33] FYI, clouddb1018 failed to reboot for some hardware issues, I filed T367499 [12:07:33] T367499: hw troubleshooting: server fails to reboot for clouddb1018.eqiad.wmnet - https://phabricator.wikimedia.org/T367499 [12:24:25] looking for code review: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1043744 [12:47:19] andrewbogott: ok to re-enable Puppet on cloudcontrol2004/6-dev? [12:50:04] Yes please [12:57:42] done [13:23:18] reimaging cloudvirt1034 to OVS [13:37:25] andrewbogott: not sure what you did yesterday but now the scheduled os-deprecation run fails with 'ModuleNotFoundError: No module named 'arrow'' [13:38:27] Yeah :/ I'll be back at my keyboard in 30 or so [14:32:18] taavi: fixed now (I just needed to rebuild the venv) [14:54:00] taavi: my plan was to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1043203 and then use cumin to touch (or remove) *zed* cloud-wide so that the update fires [14:54:38] I was worried that adding that flag to every 'apt get update' everywhere would add some security risks for other repos [14:54:38] why are we provisioning files for zed in the first place? [14:54:48] zed is the latest release packaged for bullseye [14:55:03] hmmm I see [14:55:12] and we install osbpo by default on all the VMs [14:55:13] ? [14:55:16] We could remove everything and fall back on the debian-packaged clients but they're even more out of date (V, I think?) [14:55:48] Yes, at the moment we do. Because we need those libraries for some monitoring things iirc [14:57:14] hmh. would be great if we didn't do that, but i don't think we can change that retroactively on vms that already have them (so before trixie in practice) [14:58:07] I don't think we need to patch apt::repository if openstack::clientpackages::vms::bobcat::bullseye has its own apt-update that runs first? [14:59:37] I tried that already, it doesn't fire at the right time [14:59:48] dhinus: suggests that we try https://gerrit.wikimedia.org/r/c/operations/puppet/+/1043806 first [14:59:59] Which I think won't solve the issue but is correct in any case [15:00:09] we need that anyway, but it won't solve the issue [15:02:57] ok, +1'd both [15:03:49] thx taavi [15:04:11] now I just have to remember the cumin syntax for all VMs [15:07:57] grrrrrrr [15:08:13] andrewbogott: I think your patch got lost and must be reapplied to cloudcumin [15:08:21] I added some details at T346453 [15:08:21] T346453: [cumin] [openstack] Openstack backend fails when project is not set - https://phabricator.wikimedia.org/T346453 [15:09:14] sure would be nice if that got fixed in upstream cumin, huh? [15:09:29] it's still in my radar but I already lost too much time on it :) [15:10:00] I can write another six alternative solutions but I'm sure they'll just be ignored too [15:10:06] * andrewbogott stops complaining and hotfixes again [15:16:55] dhinus: actually my hack is still there, so something else interesting is happening :/ [15:17:02] hmmm interesting [15:18:42] did something change in bobcat maybe? I don't think I used cumin after the bobcat upgrade [15:18:49] likely project name/id confusion [15:18:54] :( [15:20:24] in /var/log/cumin/cumin.log there are some details [15:23:16] yeah, I'm not sure what's happening yet. We also have a new domain (toolsbeta) so that could be part of it [15:26:40] dhinus: fixed; novaobserver didn't have permission to see the new domain [15:27:02] ah that makes sense! [15:27:21] I think it's looping and trying to get a new token for each domain [15:28:00] I think it needs to [15:28:53] I think you can potentially create a globally-scoped token, but then maybe nova won't accept it? I did kind of understand the scoping at some point, but I forgot :) [15:29:13] if it works, I think it's good enough for now [15:29:38] globally scoped token would work for user with that access but I think we were trying to avoid assuming weird user setup [15:30:11] I see what you mean yeah [15:30:31] there are potentially infinite combinations of domains/users/RBAC rules :) [15:31:54] yep [17:33:29] taavi: Is there anything I can do to help you wrap up your day? And is there anything in particular I shouldn't touch while we're mid-ovs-migration? [17:35:06] andrewbogott: the main thing you should be careful with at this point is draining cloudvirts still on linuxbridge. I've so far only ran the flavor cache fixing script just before draining hypervisors since I'm still a bit scared by it just poking the database [17:37:32] the other thing in my mind at this exact moment is fixing the flavors for a) the canary VMs on OVS cloudvirts and b) for the one toolsbeta node that was accidentally migrated but does not have a matching g4 flavor yet [17:39:33] taavi: for canaries it's easy enough to just delete them all and rebuild [17:39:37] after https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1043149 is merged [17:40:06] & I will definitely not drain or rebuild any cloudvirts over the weekend [18:03:59] taavi or dhinus: how does this look to you? https://etherpad.wikimedia.org/p/busterreminder [18:22:34] andrewbogott: looks good, I would s/those that are using/those that are still needed using/, because on first read I didn't parse that sentence [18:24:08] 'Please take some time to delete VMs that are no longer needed, and rebuild those that are still needed with a more modern release, ideally Debian Bookworm.' [18:26:45] ok, sent. thanks! [21:59:31] andrewbogott: I saw you deleted the clouddb-services project, I believe it hosted the DNS zone for db.svc.wikimedia.cloud [22:01:35] I still see the zone in "openstack zone list --all-projects", and the project is listed as "clouddb-services", so maybe it's a zombie project now? :P [22:08:09] left a comment in T365975 [22:08:10] T365975: [cloud-vps] migrate DNS zones away from deprecated clouddb-services project - https://phabricator.wikimedia.org/T365975 [22:08:24] * dhinus off [22:54:45] argh [22:55:52] dhinus: do you know what IPs those entries should point to? [22:56:35] hmmm vaguely :P [22:57:12] I mean I think we can find them in a /reasonable/ time [22:57:46] I can still see the zones with "wmcs-openstack zone list --all-projects" [22:57:56] so maybe you can still dump the records? [23:01:16] yes I can dump them, I will paste them to the phab task [23:07:32] actually I think it's fine, I'm going to transfer them to cloudinfra [23:07:40] can you stand by for 5 minutes and then confirm that it's working? [23:08:46] sure [23:09:03] I have a full dump in the meantime on my computer in the meantime [23:09:19] * dhinus cannot type at 1 am :P [23:09:41] ok, those domains are now owned by cloudinfra and look to still be populated as before. [23:09:45] Look that way to you too? [23:09:53] checking [23:11:00] yep those look fine, but I found 4 more zones :P [23:11:03] "openstack zone list --all-projects |grep clouddb-services" [23:12:24] those are just the standard set of project-associated domains... do you think there are still refs to them other places? [23:12:30] hmm maybe not [23:12:40] the only ones I know about are the 3 ones you migrated [23:12:59] maybe "svc.clouddb-services.eqiad1.wikimedia.cloud"? [23:13:45] there are only ns records though in that one [23:13:50] looks empty to me, just the SOA [23:14:01] yep [23:14:05] I think we're good [23:14:15] So I'm going to delete those 4 [23:14:19] sgtm [23:14:27] And continue to think about making a project deletion cookbook because there are always too many parts [23:14:34] Sorry for whatever temporary panic I caused you! [23:14:41] np [23:15:41] I randomly spotted your change [23:15:59] and thought it was better to double check now rather than risking an outage later :P [23:16:36] glad you noticed [23:23:57] all looks good to me [23:25:34] thanks for the fix, out of curiosity, do you think if you didn't move the zones they would keep on working indefinitely, and were just invisible in horizon? [23:35:54] anyways, not an important q :) I'm off to sleep [23:55:33] I don't know, probably they would live on for a while but might get lost during upgrades or similar