[14:16:33] !log tools reboot tools-static-15 nfs is stuck [14:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:24:08] !status NFS misbehaving on toolforge, some tools might be stuck reading/writting to disk [14:24:26] !status NFS misbehaving on toolforge, some tools might be stuck reading/writing to disk [15:55:15] !status NFS misbehaving on toolforge, some tools might be stuck reading/writing to disk (T383238) [15:55:15] Too long status [15:55:16] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [15:55:26] !status NFS misbehaving on toolforge (T383238) [16:53:04] Is it possible to upgrade the linux distro of a vm in-place (bullseye -> bookwork) or do I have to spin up a new VM? [16:57:52] subbu: it is possible but often regretted :) [16:58:37] It makes things a little bit confusing when it comes time to deprecate OS versions because it's hard to distinguish between 'this VM was built with a Bullseye image' and 'this VM is running Bullseye' [16:58:47] and also of course it wreaks havoc with any copy-on-write efficiencies [16:59:10] but there's no real reason for you not to do it from a user perspective outside of "It is a good exercise to rebuild your servers now and then so you know what's on them" [17:01:00] got it ... [17:01:24] subbu: is that actually a useful answer or am I just making things worse? [17:02:01] I am thinking of parsing-qa-02 which is a largish vm ... not puppetized, but i can probably just do a backup of all the script files and commit them in git. [17:03:41] given quota, i am not sure I can spin up a new instance of that vm first .. i will have to look. [17:03:56] we can give you more quota very easily [17:04:17] yeah, quota to rebuild things is a sure thing [17:04:27] having scripts backed up somewhere is also a very good idea [17:04:48] And any 'data' can move into a cinder volume for easy transfer to a new VM [17:05:28] ok .. this one has been running for 3 years .. but, yes, probably a good exercise. :) but yes, i can see that quotas will need to be bumped. [17:06:56] subbu: ping me when you have a request ticket and we'll get it approved quickly [17:07:14] will do. thanks. [17:07:27] or I guess ping blancadesa.l who is on clinic duty until tomorrow. either way [17:08:20] ack [17:14:58] related qn. I have https://phabricator.wikimedia.org/T295907 ... that server is now a much simplified version of parsing-qa-01 after we retired a bunch of stuff from that VM ... so I could probably check in a bunch of /etc/ and /lib/systemd config files in a CTT repo and then just use that repo to initialize the new server. [17:15:09] But, curious if labs VMs can also be puppetized. [17:16:21] subbu: they definitely can be puppetized. The challenge is that we don't have branch management so you either need your own puppetserver with your local puppet config there, or you need to get a root (e.g.me) to merge your changes into the main repo. [17:16:53] hiera and node classifiers (like, 'what classes go on this VM') can be adjusted without any of the above, there's a Horizon UI for config [17:18:45] * subbu has forgotten most of the puppet stuff from 8 years back [17:18:51] I'll see what feels quicker / simpler. :) [17:21:57] andrewbogott, https://phabricator.wikimedia.org/T349941 is the main motivation for this rabbithole I started on ... i suppose technically I could just install updated nodejs binaries as an admin and go with that without upgrading to bookwork .. but I suppose cloud VPS will deprecate bullseye in the next year or two? [17:23:48] yeah. The idealized timeline is here: https://wikitech.wikimedia.org/wiki/Operating_system_upgrade_policy [17:24:27] which says that bullseye is already in the deprecation phase :( [17:28:02] alrighty ... in that case, makes sense to do the migration now. [17:29:48] probably unless you hope to get luck and skip to Trixie which will not be out for a while [17:44:18] andrewbogott, https://phabricator.wikimedia.org/T383251 (and the associated chain of tasks above it for context). [17:44:36] Also, see https://phabricator.wikimedia.org/T383251#10441998. :) [17:51:16] subbu: you need a fresh cinder volume vs. transferring over the old volume? [17:51:50] (related question, can the quota increase be reverted after the migration?) [17:52:43] transfer is fine ... I don't necessarily understand all the mechanics for this .. so i am just asking assuming everything needs to be duplicated. [17:53:38] and yes, it can be reverted after migration ... but, in the comment, I am asking for a generally bumped up quota for cpu, ram, and disk .. if that is acceptable, then I assume the new quota will still be higher than what it is now. [17:53:46] you can definitely detach a cinder volume from one place and attach it to another :) [17:54:12] ok. [17:54:19] Yes, I think I'd like you to have one ticket for migration/revertable quota, and a different one for permanent expansion for tracking reasons. [17:54:42] If you're patient with the red tape :) [17:57:59] I can do that. [18:00:10] Okay .. so, I assume then that once we boot up the VM, you can just assign additional cpu/ram/disk resources to ctt-qa-03? Or, even though they are separate tickets, when you bump up quota, you will do the bump all at once? [18:01:02] yeah, can do the initial bump all at once. [18:01:21] Resizing cpu/ram on VMs is possible but annoying, best if you start with what you expect to need. [18:01:30] but you can transfer the existing volume from the old VM to the new one, that's pretty easy [18:02:14] is it possible to get the callback url of a test oauth consumer changes on meta? [18:02:16] Or am I just better off making a new one? [18:02:39] @addshore: make a new consumer [18:02:48] will do! ty :) [18:03:00] happy new year! :) [18:03:20] timezone appropriate greetings :) [18:03:39] currently dark and snowy xD [18:04:52] andrewbogott, okay https://phabricator.wikimedia.org/T383252 is the separate ticket. [18:07:30] subbu: to detach and reattach a volume, there are some docs here https://wikitech.wikimedia.org/wiki/Help:Adding_disk_space_to_Cloud_VPS_instances#Reattaching_to_a_different_instance [18:08:45] thanks. [18:17:51] subbu: done, I think -- do the numbers look right to you? [18:19:35] cpu & ram looks right, but the only qn I have is: the cinder volume on parsing-qa-02 is 1tb ... if i unmount and remount it on the new vm, it will still be 1tb ... how do we bump that up to 1.5 tb? [18:23:46] after you unmount it from the old VM and disconnect it in horizon, there's a menu option to resize. [18:24:04] Then when you remount it on the new VM you'll need to resize2fs to get the filesystem caught up with the block device size [18:24:24] I can walk you through that when the time comes [18:29:56] thx [22:06:47] andrewbogott, I see that g4.cores16.ram32.disk20 is the most powerful VM I can spin up .. there isn't a cores32.ram64 one. If necessary, I can work around that by spinning two VMs with cores16.ram32 since the testing service doesn't require that all test clients run on the same VM, but thought I would check first. [22:10:26] it is possible to create a custom, bigger VM flavor but if you can do it with two smaller ones that seems generally better; are there advantages to having it in one? [22:18:24] Just less annoying to kick off test runs, etc. but we could probably script all of it to ssh to both servers and update configs, when we want to kick off a new run. So, we can do it. Let me plan to do that for now and I'll get back to you if I need the custom flavor. [22:23:04] ok. I don't understand your exact use case but it's generally nicer to have more smaller VMs, they can migrate between hypervisors more easily, are less likely to be noisy neighbors, and maybe you don't have an outage when one of them dies :) [22:28:13] Actually, there are a couple common use cases where this breaks for us now ... the clients generate screenshot images + diffs which we then serve via a webapp ... if the tests now run on different VMs, those images are now split across the two disks and cannot all be served from the webapp without a bunch of complexity to transfer images between the VMs. In the single VM scenario, all clients share the same disk as the webapp. [22:28:39] unless you have ideas for me there. :) [22:31:17] Think of this as mapreduce (hence it is called testreduce) where there is a test server serving test cases (wiki titles), and test clients processing tests and returning results to the server. Normally the results are all lean and can be send as part of the web post request to the server, but in this QA use case, the clients also dump 3 largish images on disk (Parsoid HTML screenshot & Legacy Parser HTML screenshot & diff image). [22:34:53] subbu: store the images in S3 compatible buckets? https://wikitech.wikimedia.org/wiki/Help:Object_storage_user_guide [22:36:30] or build a project-local NFS server I guess [22:39:36] okay .... the rabbit hole got deeper. :) In that case, for now, I'll probably spin up a single VM with the same capacity as the current one (and we are no worse off than before) .. and explore the object storage / nfs server solution separately later. [22:45:09] subbu: I'm a rabbit hole factory. when you run out stop by and I'll show you some more. ;) [22:56:15] good to know :) alright .. ttyl. to be continued tomorrow. [23:22:59] wm-bb: is the bridge up? [23:23:26] yes it is. not sure why I got an email saying otherwise... [23:32:26] there was one duplicated message earlier (apparently) (re @wmtelegram_bot: !status NFS misbehaving on toolforge, some tools might be stuck reading/writting to disk) [23:32:29] but otherwise everything seemed fine afaict