[00:16:58] bd808: if I can convince cloud-init to inject it can you think of any downside to just setting ResolveUnicastSingleLabel=yes? [00:17:27] I could also add a cloud-init rule to remove the packages I guess, maybe I'll start with that and see where it gets me [00:20:26] andrewbogott: I like the idea of trying to remove the packages if only because having bookworm vms use a completely different resolver library sounds like a foot gun. But if we should be using it, allowing single label lookups seems fine. [00:20:55] Debian folks seem to think that it shouldn't be installed by default... [00:22:29] of course cloud-init only supports installing and not purging, but I can probably just do a last-minute command [16:55:10] andrewbogott: hey, so I'm trying to work on building the new VM for T334891. I ran into some issues on the new VM, and deduced it might simply be because the volumes aren't mounted to it. So I tried first detaching `pickle_storage02` from `wikiwho-api`. It's now been stuck in a "Detaching" state for a long while :( Is that maybe expected given how large (5TB) the volume is? [16:55:11] T334891: Add more languages to WikiWho and build new VM - https://phabricator.wikimedia.org/T334891 [16:57:24] I see `ls /pickles-02/` is now empty on the `wikiwho_api` instance, so it apparently *did* detach. It's not showing up as so in Horizon, however [16:58:17] I remember creating the volume took a long time, but that's because it had to format it. If I remember correctly, simply mounting it didn't take this long, so I'm beginning to think there's an issue of some sort [16:58:41] attaching/detaching should be nearly immediate [16:59:56] OpenStack is all event driven and occasionally gets weird because it missed an event notification. Seems like the potential issue here, but andrewbogott would certainly be better equipped to debug [17:00:02] that's what I thought. Then I guess we have an issue! I have no options in Horizon except to update metadata [17:02:29] musikanimal: When I click on the instance and the its action log I can see "req-534b0760-52a1-49e4-b35d-508d7ee962b1 detach_volume June 6, 2023, 4:48 p.m. musikanimal Error " [17:02:44] bah [17:02:47] https://horizon.wikimedia.org/project/instances/98aae6b8-767b-482f-a7a2-ce5cf20411db/ [17:03:44] I don't know how to fix it, so I guess you should start a phab task at least [17:04:03] can do [17:05:54] musikanimal: you could try the "hit it with a hammer" suggestion of rebooting the instance that used to mount the volume from https://platform9.com/kb/openstack/volume-shows-attached-after-detach-operation-fails-with-error-u [17:06:54] hmm, I'm a little afraid to do that... there's yet another volume attached to it that I haven't detached yet. That 2nd volume is the one in heavy use [17:07:47] the act of rebooting shouldn't break the mounting to the 2nd volume right? I would think not [17:08:25] I would hope not. Instances get rebooted certainly as part of their normal lifecycle [17:08:53] (I should be clear, the "2nd volume" here is actually the first, so `pickle_storage`, whereas `pickle_storage02` is the one I'm trying to detach) [17:08:57] okay, I'll give it a try [17:12:39] no dice, I tried both a soft and hard reboot [17:12:44] Phab task incoming [17:14:09] musikanimal: is there useful data on pickle_storage02 or can I delete it? [17:14:42] the backend error seems to be [17:14:43] yes, about 2 weeks worth of imports. I'm hoping we don't need to delete it [17:14:48] ok [17:15:22] found by searching for the "req-534b0760-52a1-49e4-b35d-508d7ee962b1" event id value that was in Horizon [17:16:11] musikanimal: can you try to re-attach now? [17:17:17] doing [17:18:19] error: `Invalid volume: volume b4f3393d-923a-4af7-9781-aa8bc3ce9842 is already attached to instances: 98aae6b8-767b-482f-a7a2-ce5cf20411db (HTTP 400) (Request-ID: req-50b6f885-c321-493d-918b-f1abfac3313b)` [17:18:36] that was when I tried to attach to the new instance, `wikiwho01` [17:18:48] ok, in that case... [17:19:19] I see the `/pickles-02` directory is still on the old `wikiwho_api` instance, but the mount doesn't seem to be via `df -kh` [17:19:43] oh wait, no it is there [17:19:54] (it wasn't earlier though!) [17:20:25] so it is back to being attached to `wikiwho_api` (the old instance), but Horizon thinks it's unattached [17:21:12] yep, I'm still seeing what I can reset [17:21:35] ok thanks [17:28:43] musikanimal: how do things look now? [17:29:59] hmm, it says attached to `wikiwho-api` now but that now doesn't appear to be the case [17:31:35] so like, the opposite of the situation we were in earlier, lol [17:34:19] ok, be back in a few [17:34:36] no problem, thanks [17:46:57] * dcaro off [17:48:38] now `pickle_storage` (the older volume) is unmounted; is that expected? [17:51:27] musikanimal: yes, I'm trying to get everything to a known state [17:51:32] the attachment labels were wrong [17:51:53] okay, got it [18:37:44] musikanimal: it looks like this may be an upstream bug with volume detachment generally [18:38:12] did you make a phab task for this? [18:38:37] hmm okay. No, I got one sentence in then you pinged me. I'll finish creating the task now [18:42:08] could probably use some more info but here we are https://phabricator.wikimedia.org/T338262 [18:42:41] !log tools.wikiloves Deploy 19b5d50 [18:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikiloves/SAL [18:42:43] for the short-term, is there anything you think we can do to get the volumes reattached? even if it's back to the old instance, that's fine [18:54:37] I don't have anything good to suggest so far [18:58:21] oh no :( [18:58:45] Hey folks, I'm having issues launching my PAWS instance, is there an issue with it currently? [18:58:59] perhaps https://phabricator.wikimedia.org/T338225 [18:59:06] Is it giving you the "respawn" button? [18:59:09] musikanimal: ok if I reboot your VM again? [18:59:12] That doesn't respawn? [18:59:24] andrewbogott: yes go right ahead [19:00:15] andrewbogott: I was going to say, you managed to get `pickle_storage02` re-mounted to `wikiwho_api` earlier, but at the cost of Horizon saying it's not mounted. Is it possible to go back to that? It's fine (for now) if Horizon says the wrong thing [19:00:28] @Rook yes [19:00:48] It sure looks like the bug you posted, thanks [19:00:52] same for `pickle_storage`. That one is the more important one (it contains enwiki / dewiki) [19:01:17] Ah yeah, normally I would redeploy the trove db and get a fresh one. But trove dbs aren't deploying (https://phabricator.wikimedia.org/T337882) so I had to reuse the old one, which has the old (no longer valid) cookie information [19:01:28] Log out and back in usually gets it all cleared and good to go [19:02:22] musikanimal: they're now attached at sdb and sdc. I'll leave getting them mounted up to you. [19:02:36] And be warned that if you detach this will all begin again (although I have a slightly better idea how to manage it now) [19:03:14] so I should use the `mount` command directly? [19:03:43] probably. prepare-cinder-volume won't do anything bad but it's really not designed for multiple volumes on one server [19:04:19] I wonder if I should first try on the new instance, since that's the one we're trying to move to. You still need to update the hypervisor, right? [19:04:43] I should mention the new instance is still a little buggy. From this incident I believe I confirmed that not having the volumes mounted isn't the issue [19:04:54] but with some time I'm sure we can get the new instance working [19:05:05] Thanks @Rook worked like a charm. [19:05:14] 👍 [19:06:57] !log admin increased trove secgroups, instances, volumes quotas from 40 to 100. Trove is too popular! T337882 [19:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:07:02] T337882: Cannot create trove db in horizon/terraform - https://phabricator.wikimedia.org/T337882 [19:09:08] !log admin also increased RAM and secgroup-rule quota for Trove T337882 [19:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:10:30] musikanimal: let me know if you need help moving the volumes someplace else [19:11:15] well, I'm trying to decide whether I should try mounting to the new or old instance, because I know you were waiting for us to move to the new one to upgrade the hypervisor (or whatever gobbledygook ;) [19:14:59] ragesoss: let me rope you in here. So for the new instance (currently at https://wikiwho2.wmcloud.org/ ), we're getting a 400 and I've been having a hard time debugging that. Not all services are up and running yet either, but going by the setup instructions, we should have a working frontend now. https://github.com/wikimedia/wikiwho_api/pull/12/files are the tings I did different this go around [19:16:05] the celery systemd config (T318746) for sure isn't right yet, but that isn't our main issue [19:16:05] T318746: Use systemd to autorestart Celery workers - https://phabricator.wikimedia.org/T318746 [19:16:07] * andrewbogott out for a bit [19:16:16] okay, thanks for all of the help, andrewbogott ! [19:16:46] lemme figure out how to log in to it... [19:17:05] `ssh wikiwho01.wikiwho.eqiad1.wikimedia.cloud` [19:18:34] the Django logs aren't storing anywhere as far as I can tell. I believe it's supposed to log to /var/log/django/django.log but it's not (and it didn't on the old instance either) [19:18:50] that I assume is where I should look to debug the 400 [19:19:21] hmm... [19:20:37] I tried moving the LOGGING config from settings_live.py to settings.py in the event the former isn't getting loaded, but that didn't seem to have any effect [19:22:05] or wait, Horizon actually says the volumes are mounted to the old instance (see above discussion). So I wonder if we're asking for more buggyness by mounting to the new one [19:22:49] using the old instance is certainly the preferred route as far as getting us back up and running, but the whole point here was to move to a new instance so that's why I thought I'd try that. I'm not sure what to do! [19:23:14] might be the allowed_hosts value in settings_wmcloud.py [19:23:19] it does not include the new URL [19:23:27] ah! [19:24:39] that was it, sorry for my dumb mistake [19:24:56] that's what friends are for [19:25:00] hehe :) [19:26:10] okay, so now it's just a question of chancing it with the new instance given the OpenStack bug [19:26:26] I guess I'll try... Andrew didn't say not to when I asked above [19:27:16] whee! [19:29:00] didn't work. The new instance doesn't know about the volumes (which makes sense); I think we have to update OpenStack first, which given the bug will need intervention from andrewbogott [19:32:46] in the meantime I'll get `wikiwho01` in better shape, i.e. try to figure out why the celery systemd config isn't working [19:50:54] musikanimal: so which VM am I moving to? And what labels do you want for each volume? [19:52:20] let's move to `wikiwho01`. And sorry, not sure what you mean by labels? [19:53:29] it's two volumes, right? So one will be /dev/sdb and one /dev/sdc [19:53:32] which is which? [19:54:35] yeah two volumes. sdb is named `pickle_storage`, the other `pickle_storage02` [19:54:48] we want both attached to `wikiwho01` [19:56:28] ok, will try [20:00:05] musikanimal: does that look right? (again, you'll need to set up the mounts yourself) [20:01:44] andrewbogott: looks good! thank you :) [21:03:48] !log tools.ifttt Updated to e55fea9 [21:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.ifttt/SAL