[00:06:55] bstorm: So.. apparently cinder is important. [00:07:17] I'm setting up a Bullseye agent, and the mounting of /srv/ isn't happening: https://phabricator.wikimedia.org/T284774#7332576 [00:07:36] searching on phab brings me to https://phabricator.wikimedia.org/T278641 which says "lvm" is deprecated in favour of cinder [00:08:06] Although it seems /mnt did get created and contains the expected 40G [00:08:22] so maybe the puppet code just needs to alias that when on bullseye? [00:09:36] This comes from role::ci::slave::labs::docker > profile::ci::slave::labs::common > profile::labs::lvm::srv > profile::labs::lvm::srv [00:11:12] is there a recommendation for what to do for roles that should work the same on older and new instances so that the extra space is always at /srv? [00:11:34] Yes…. Lemme find the new doc [00:12:41] Krinkle: This is updated to reflect the new setup. https://wikitech.wikimedia.org/wiki/Help:Adding_Disk_Space_to_Cloud_VPS_instances#Cinder:_Attachable_Block_Storage_for_Cloud_VPS [00:12:50] We also have a new replacement role you can use [00:12:53] maybe there is a variable or some other thing I can use to detect it, and then somehere high in the CI manifests include either lvm::src or vindermount::src [00:12:58] if you prefer that kind of thing... [00:13:07] I'm not sure how to make both work.. I don't want to break all the other instances [00:13:49] We've addressed that problem because we had it :) [00:14:37] Probably the easiest example is our toolforge servers. You might like just looking at what we did in puppet. The LVM volumes won't be affected if you use the new cinder profiles [00:14:59] We used this modules/profile/manifests/toolforge/grid/node/all.pp in puppet [00:15:20] Lots of the grid nodes for toolforge are really using LVM [00:15:26] It doesn't break them [00:15:44] In fact, none of them are really using cinder, but that's another issue [00:15:59] They are really using ephemeral disk like before in those cases [00:16:08] since we didn't want to actually preserve the data [00:16:29] This is mostly going over my head :) - from that help page, it looks like it wants me to create each srv volume I want locally to a given instance separately in the GUI, give it a globally unique name and then attach it to an instance. [00:16:41] If the CI instances don't need to preserve data on a larger separate disk, you may want us to make you a special image that includes ephemeral disk...and then you apply the "cinder" profile [00:16:42] That feels wrong, but maybe I'm reading the wrong part. [00:16:44] and it works [00:17:11] Yeah, if you need disposable storage instances that can be easily replaced, don't do that. [00:17:11] I believe that's what `g3.cores8.ram24.disk20.ephemeral40` is? [00:17:16] yes! :) [00:17:23] That's the image I'm already using [00:17:26] Cool [00:17:33] but the puppet manifest is failing because it's set to ensure lvm [00:17:39] So you don't need to create actual cinder [00:17:59] You need to replace the LVM stuff with things like we did in modules/profile/manifests/toolforge/grid/node/all.pp [00:18:05] I need the puppet manifest to continue to work on other instances but also put the ephemeral 40G at /srv for instances that use cinder [00:18:09] it uses the cinder name, but it is lying in that case [00:18:30] It will continue to work on the older ones [00:18:39] It should not alter those [00:18:50] You should not need actual cinder for anything [00:19:05] We did not document this particular case. [00:19:13] However, we did encounter it, obviously [00:19:27] cinderutils::ensure will use the ephemeral disk from the image [00:19:32] Instead of actual cinder [00:19:33] https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/ci/slave/labs/common.pp#L12 [00:19:47] If you have an existing LVM setup, it will leave that alone [00:20:17] You want to replace that with modules/profile/manifests/labs/cindermount/srv.pp [00:20:23] There is where I'm at. If I understand correclty, you're saying I can remove that, put a call to cinderutils::ensure { mount_point: '/srv' } there, and that will alias or move the 40G I got on this new instance from /mnt to /srv? [00:20:48] You can swap it out with [00:20:51] profile::labs::cindermount::srv [00:20:52] and on instances that still use lvm, it will somehow know to mount lvm there instead? [00:20:55] yues [00:20:57] *yes [00:21:16] The thing is, it won't do anything if there isn't a bunch of extra disk around [00:21:19] or will it no-op on those instances, and creating a new one with lvm would not work? [00:22:02] Basically, LVM is gone. This will just mount whatever disk is extra, which in this case won't really be cinder, but it's a cinder-named profile [00:22:19] Most users do not have ephemeral extra disk like CI does [00:22:40] The LVM class is all exec resources, so it doesn't clean up when removed [00:22:52] okay, let me try that [00:24:51] I was really careful about testing this on a few things myself just because I really didn't want to lose existing LVM things, but it's safe to remove the profile from pretty much anything. The new role will mount the extra disk without bothering with LVM (it'll just format it and mount it) [00:25:03] s/role/profile/ [00:26:01] It almost would be easier if we named it "extrastoragemount" instead of "cindermount" because it really will work with any extra disk. [00:26:48] I'm cherry picking it now to see if that gets me further along the provisioning this new instance [00:26:52] Ok :) [00:27:11] https://gerrit.wikimedia.org/r/c/operations/puppet/+/717732/ [00:27:37] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, No mount at /srv and no volumes are available to mount. [00:27:37] To proceed, create and attach a Cinder volume of at least size 3GB. (file: /etc/puppet/modules/cinderutils/manifests/ensure.pp, line: 87, column: 13) (file: /etc/puppet/modules/profile/manifests/labs/cindermount/srv.pp, line: 5) on node integration-agent-qemu-1002.integration.eqiad1.wikimedia.cloud [00:28:11] I'm guessing something made it go to /mnt already during the initial boot? [00:28:26] it shouldn't unless there's some other profile in there [00:28:39] /dev/sdb 40G 49M 38G 1% /mnt [00:28:47] That's been there since the first boot afaik [00:28:51] And that's the ephemeral disk, right [00:29:04] Yeah, it's 40G [00:29:18] Nothing should do that... [00:29:26] That's new for me [00:29:46] And it's formatted? [00:29:53] You didn't run any cinder scripts did you? [00:29:59] Like it' shouldn't even be a filesystem [00:29:59] Nope [00:30:21] Created fresh, added 1 role in horizon, did all the puppet ssl dancing, and then run puppet [00:30:22] what's the instance name? [00:30:31] integration-agent-qemu-1002 [00:31:25] $ ack -C10 mnt /var/log/syslog [00:31:36] Sep 3 23:29:06 integration-agent-qemu-1002 systemd[1]: prometheus_puppet_agent_stats.service: Succeeded. [00:31:36] Sep 3 23:29:11 integration-agent-qemu-1002 systemd[1]: Starting Discard unused blocks on filesystems from /etc/fstab... [00:31:36] Sep 3 23:29:12 integration-agent-qemu-1002 fstrim[5393]: /mnt: 39.1 GiB (41956876288 bytes) trimmed on /dev/sdb [00:31:36] Sep 3 23:29:12 integration-agent-qemu-1002 fstrim[5393]: /boot/efi: 117.8 MiB (123555840 bytes) trimmed on /dev/sda15 [00:31:37] Sep 3 23:29:12 integration-agent-qemu-1002 fstrim[5393]: /: 17.5 GiB (18760306688 bytes) trimmed on /dev/sda1 [00:31:37] Sep 3 23:29:12 integration-agent-qemu-1002 systemd[1]: fstrim.service: Succeeded. [00:31:37] `/dev/sdb /mnt auto defaults,nofail,comment=cloudconfig 0 2` this is in fstab [00:32:20] This is a new bullseye instance? [00:32:25] yes [00:32:42] These syslog entires are 10min after I added the role in Horizon [00:33:05] so maybe something from the old manifests for lvm caused it to end up there on a run that would eventually fail on lvm not being there [00:33:25] The lvm things shouldn't impact any of this. [00:33:35] That comment is weird "cloudconfig" [00:33:44] * bstorm scratches head a little [00:34:04] Not many people are using bullseye and you might be the first person to use it with bigger ephemeral disk [00:34:14] Soooo, this might be something weird about the cloud-init or something [00:35:06] I'm checking some things [00:35:29] so, in a nut shell, you're saying the problem we're having isn't "how to make /srv take over /mnt(cinder-ish) OR mount lvm" but rather, how to not make it hav /mnt in the first place. [00:35:35] Basically, that disk should still be unused until you add a role. It's not. It looks like something not-puppet did that to it [00:35:46] okido [00:35:47] No, it's both :) [00:35:52] You need that patch you made [00:36:15] yeah, I understand. but I was thinkign my next problem is to enhance so that it usurps /mnt if it's non-lvm [00:36:16] But I think we need a ticket for the WMCS bunch to fix this other thing possibly about bullseye images [00:36:29] It would do that even if it was LVM [00:36:34] On a new instance [00:36:57] can I do something ad-hoc for now that frees it up just once so the next puppet run works? I can document that meanwhile as workaround. [00:37:40] Unmount it and delete the entry from fstab [00:37:52] You *might* also need to destroy the filesystem, but maybe not [00:38:39] Have you got time to make a ticket about the weirdness we are seeing here? I need to get to dinner. The /etc/cloud config stuff is very different from other VMs. [00:38:54] https://phabricator.wikimedia.org/T290372 [00:39:28] Thanks! [00:39:56] The last time I used mounts on linux was some 5 years ago when someone challenged me to mount a usb disk on a friends laptop. That eventually worked, but I'd rather not google copy-paste this. [00:40:04] But yeah, that puppet stuff should just work if it finds the disk *unused* it tries to leave things alone that are already set up (which is why I consider it safe for the older VMs) [00:40:24] That is to say, I don't know what to do. [00:40:35] here lemme try something (I'm on the VM) [00:40:41] ack [00:41:04] looks like that worked [00:42:12] Oh? I got `Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Could not find resource 'Mount[/srv]' in parameter 'require' (file: /etc/puppet/modules/profile/manifests/ci/slave/labs/common.pp, line: 24) on node integration-agent-qemu-1002.integration.eqiad1.wikimedia.cloud` [00:43:41] Yeah, cinderutils::ensure isn't doing the mount "/srv/ in the else path [00:43:45] I don't know exactly what that means. [00:43:49] but I did see /mnt dissappear [00:43:52] so that's progress [00:43:55] yeah I did that :) [00:44:03] I probably need to destroy the filesystem [00:44:11] that sounds exciting [00:44:26] but this can wait, so we can pick this up on monday [00:44:26] I'd probably use require instead of include in your patch, but it probably doesn't matter here [00:46:36] ok. that's a new keyword for me in puppet. I've only ever seen include. but I see a few other files use that. I'll give that a go [00:48:24] I'm gonna reboot it. It needs to reset the custom facts [00:48:51] reboot is easier than remembering commands :-p [00:50:05] Krinkle: for your filesystem destroying: wipefs /dev/sdb [00:51:05] Yup, that's what I did :) [00:51:15] (I have been reading this interesting cinder-not-using-cinder saga) [00:51:15] Doesn't seem to have pleased our little puppet module [00:51:55] I'm not really a puppet speaker [00:52:37] was it automatically mounted again as /mnt ? [00:52:44] Yes it was. [00:52:49] Have you seen that before? [00:52:52] I mean, it isn't now [00:53:19] no [00:53:27] Ah ok [00:53:38] tehre are a few packages that coul do an automatic mount [00:53:45] I'm reading the puppet facts and not sure why this is skipping it [00:53:47] but the formating as well is interesting [00:53:56] That seems to be cloud init whatnot [00:54:00] that caused that [00:54:30] bstorm: btw, so you're thinking that when the puppet module is correctly mounting it at /srv, it'll always be from the if-branch where it defines the `mount[/srv]` resource? (hence the issue is foremost the mount not appearing, not the puppet resource existing) [00:55:20] The mount is not there, so puppet fails. The error is that puppet is not finding this disk and processing it...even though now the it doesn't fail for the same reason [00:55:43] I got the disk to reset, but this puppetization doesn't work right for bullseye, I'm starting to think [00:55:46] I can force it [00:55:51] Since puppet just runs a script :-D [00:56:20] https://www.irccloud.com/pastebin/WgHZF0Aa/ [00:56:24] it just runs that [00:56:30] That's just not actually happening [00:56:44] It is and not working [00:56:51] Sorry, or it is and not working [00:56:56] * Platonides whispers to Krinkle: rmdir /srv && ln -s /mnt /srv [00:57:44] Platonides: right, that might be an easier workaruond at this point, given I'll be documenting at least one shell command to run upon launch, could add that to the mix [00:58:01] where's prepare-cinder-volume ? [00:58:20] I'd still need a way though for it to go past this point in the provisioning. [00:58:21] pretty ugly, but should work as duct tape [00:58:53] This should not need me to do this by hand. [00:59:03] I assumed mount[] in puppet is like other resources in that it refer to something declared, not what exists on the server. So if we create it manually it wouldn't satisfy Mount[/srv]. but maybe Mount is special-cased in puppet. [00:59:06] You cannot do what Platonides suggested anymore...I killed the volume :) [00:59:28] But yes, it need to be mounted at /srv [00:59:47] Krinkle, I can force it to work probably. However, I cannot make puppet reliably work tonight [01:00:04] I can make this VM work. I dont' think that's what you need, though? [01:00:56] bstorm: well, it'd allow me to chip away at the next/main thing for this instance (new Qemu, and all the Jenkins-specific stuff) [01:01:09] Ok, if that's useful, lemme try the command [01:02:11] you're running a nested qemu ? [01:03:08] Yes [01:03:22] Because that's how we test docker itself [01:03:22] https://www.irccloud.com/pastebin/Cbp410xJ/ [01:03:50] puppet is running and happy now [01:03:55] I may be far ahead in time zone but it too is my time to tend to dinner [01:04:01] Thanks bstorm [01:04:12] That command is what puppet would have run if it wasn't being strange [01:04:18] I don't know why it was skipping the device [01:04:22] But it was [01:04:33] This is the fun of a new OS, right? [01:06:17] Ok, I'm out of here for the weekend [01:06:22] Have a good one! [14:59:57] so, I like the new "Stop" button in Quarry, but there's just one problem [15:00:08] every time I press it I get a 500 error [19:50:20] !log tools.heritage Deploy latest from Git master: b4d3e0e, 339838b (T289929), 7816a36 (T289930) [19:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.heritage/SAL