[15:06:36] !log tools rebooting tools-k8s-worker-nfs-33, stuck processes [15:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:36:15] Krinkle: It looks like maybe you figured this out already, but the Toolforge X-Wikimedia-Debug handler is gone now. [15:45:19] Hello, I just started a resize of a Cloud VPS instance to go from 16G to 32G of RAM and it looks like the process went sideways. Project is named `moffliner` and instance is `mwcurator`. Status is now "Error" and power state "Shut Down" and it looks like I don't have the permissions to restart it. Can anybody help? Of course, it is a kind of production instance (not a high SLA, but still we have users every now and then). [15:46:46] benoit74: I will have a look. Did that host have a cinder volume attached by chance? (That should work but I've been seeing some bad behavior associated with that recently.) [15:46:56] Yes it does [15:47:07] ok, great, I'll try to revive it and then we can try again w/out the attachment [15:47:09] And data on it is pretty important (we do have backups ...) [15:47:58] just curious: I thought the data on those mwoffliner hosts was just temporary/staging; isn't it regenerated from dumps on each run? [15:48:20] not on this one :D [15:49:30] mwcurator is the machine running wp1.openzim.org where we build the selections on demand ; most data is in a trove database or another machine, but the whole app stack is installed on this cinder volume [15:49:51] makes sense. [15:50:07] OK, it is back up, do you want to try to detach the volume and retry the resize? [15:50:16] let me check [15:51:27] looks like the plan indeed ; what is the correct process ? shutdown, detach, resize, reattach ? [15:52:01] yes, that's what I would do. [15:52:43] btw all I did was a 'hard' reboot which in theory you should have been able to do too, unless having it in 'error' state is an edge case permissions-wise [15:56:16] looks like it caused permission issues indeed, I had no action possible besides "silly things" related to floating ip and editing instance and its metadata [15:56:56] strange, the UI must hide in panic if something is in error state :/ [15:57:32] probably someone hurt the UI during a similar situation :/ [15:57:36] yeah [16:03:36] Is it normal the volume takes ages to detach? I started the process more than 5 minutes ago, probably 10, and it is stil "Detaching" [16:04:57] I shutdown the instance from "inside" (through SSH) and the horizon console ("Shut off") so it is not supposed to use this volume anymore [16:06:05] it's not 'normal' but it's not unfamiliar. I'll look. [16:06:31] What is the new normal? :D [16:11:35] benoit74: how serious is it for you to have downtime on this server? I need to do some investigation (possibly a day of investigation if the last time I had this problem is any example.) [16:11:55] If you're in not in a rush I'll do that, if you are in a rush I'll restore that volume to a different ID and you can move ahead and ignore me [16:12:05] but might have to rebuild the VM in that case [16:12:57] the shorter the better, but I'm definitely ok you take time to investigate, someone need to sacrifice for the community :D And I would definitely prefer we do not have to rebuild the whole VM [16:13:53] I'm in Paris TZ, so might not be highly available in coming hours, so do not worry if I do not reply soon [16:14:08] ok. I will dig into the logs and see if I can find anything. Unfortunately my results were not great the last time this happened. [16:14:18] great, good to know. I'm utc-5 [16:16:18] do your best, no one is expected to do the impossible :D [20:36:32] is everything supposed to be broken? [20:38:21] Toolforge tşmes out [20:46:08] Not supposed to be but I'm looking [20:47:08] AntiComposite, yetkin, any chance things are better now? [20:47:59] anticompositetools is back up, https://cvn.wmflabs.org/ doesn't look like it yet [20:50:07] It is back, thanks (re @wmtelegram_bot: AntiComposite, yetkin, any chance things are better now?) [20:57:01] we had a network flap yep, sorry for the noise, things should start coming back up now, though there's some other changes that we have to do that might make the network flap again (it was not planned) [21:03:26] !status FLAKY doing an intervention, things are unstable [21:10:10] !status OK