[13:30:12] o/ we've been getting `Puppet failure on...` messages for our Cloud VPS project `recommendation-api` for the past week or so. The first set of messages were for the three oldest instances that were up for removal anyways, so I deleted them. I just received a new one though for an instance I created last week with a standard configuration and haven't even touched beyond ssh-ing in to verify it had been created. so maybe more systemic [13:30:13] problem and at least a few puppet failure phab tasks from recent days showed up when i searched `puppet failure vps`. could someone help me figure out what's going on (or ask me to make a phab ticket out of it is fine too)? [13:31:20] it would be helpful to know what puppet fails with [13:32:26] * RhinosF1 is pretty sure he can guess why but without an error it's impossible to know [13:33:36] https://www.irccloud.com/pastebin/rzWCJHX2/ [13:34:52] so seems it did complete but with some errors? the email messages themselves don't seem to give much info though. they'll say things like this: [13:34:58] https://www.irccloud.com/pastebin/eVSCQVzm/ [13:35:14] isaacj_: that looks like a missing heria value [13:35:39] it's trying to mount something but can't because lvm isn't there [13:36:32] ok -- if you have any pointers for fixing, that'd be great. my more general concern though is that I didn't change heria etc. when setting up this instance so I'm hoping to not have to make this fix every time I create a new instance [13:37:39] I think it's a role [13:37:44] that needs adding [13:38:41] isaacj_: your VM is using a role that depends on a LVM disk being available, but those were deprecated earlier this year: https://techblog.wikimedia.org/2021/02/05/cinder-on-cloud-vps/ [13:39:21] majavah: the docs do tell you to do that https://wikitech.wikimedia.org/wiki/Help:MediaWiki-Vagrant_in_Cloud_VPS [13:39:52] the docs are outdated, then [13:39:54] hm... I can help refactor all that if majavah doesn't already have a patch in progress :) [13:40:06] no, I'm trying to focus on metricsinfra things :D [13:40:26] do you know how I managed to add this role through the standard instance creation process on Horizon? (i've created tens of instances in the last year and my other ones haven't had this issue) [13:41:17] because old instances will still need lvm i guess [13:41:40] you've done everything i can see right [13:41:41] the role::labs::lvm::srv is applied to all instances in that project via the "project puppet" panel on horizon [13:42:16] the process you followed is just very outdated so guaranteed to fail [13:43:52] hmm...is it possible to update Horizon so I don't use the outdated process? not sure if that's what andrewbogott was referring? [13:45:38] i assume andrewbogott will be able to advise on what needs setting and to fix anything wrong [13:45:45] i'll also note that all four of the instances that triggered these failures are on the same project (`recommendation-api`) and our other projects haven't had puppet failures so not sure if it's relevant to the config of that particular project [13:46:03] it is probably our oldest project on cloud vps though it dates before me so not fully sure [13:46:22] as majavah said, you apply the lvm role everywhere on that project [13:47:00] it probably needs changing but vagrant docs say use it so i assume some updates are needed on what to setup [13:49:29] I am trying to catch up... are we sure that all the VMs in that project are failing for the same reason? [13:50:24] andrewbogott: i didn't run puppet manually for the other three and they're deleted now so I'm not sure how to verify that [13:50:48] isaacj_: ok, so we're not talking about all the VMs in that project, just one that's there and three that aren't. [13:50:53] So which VM specifically is of interest? [13:51:18] the one that's been failing since Friday is `public-turnilo`. you're right that there are three older instances that seem ok [13:51:27] also: is this something you set up initially isaacj_ or did you inherit existing config? [13:51:43] (If the latter I won't ask you "why didn't you..." type questions) [13:51:55] i created public-turnilo instance last week but the project existed before me [13:52:06] the three that i deleted were not mine and were all much older instances [13:52:59] ok. Do you understand roughly how puppet config works per VM? How to see the per-VM/per-prefix/per-project configs? [13:53:42] no - i've actually never touched puppet before because most of our instances are prototypes that eventually get deleted so no strong need to have better automation around them [13:54:21] happy to try to figure out if need be though [13:54:49] ok [13:56:08] So while I catch up, have a look at some Horizon things: The 'puppet' menu on the left and also the 'Puppet Configuration' tab on each instance detail page. You should see that there's a way to set config for a single VM but also for a whole project or for all VMs with names that start with a given string. [13:57:26] Right now you have that obsolete puppet class assigned to the whole project. That's working on old VMs because that puppet class still works on legacy VMs but won't work on new ones. So for starters we should probably remove that project-wide setting but add it to each individual older VM. [13:57:31] Then it won't plague us for new VMs. [13:57:44] lmk when you're caught up and see how to do that, or if you have questions. [13:58:06] The offending role is 'role::labs::lvm::srv' [13:58:20] Works for me - any wikitech page to use for this? [13:59:02] somewhere... looking [13:59:08] but in theory the horizon UI is somewhat clear :) [13:59:37] haha, okay, i'll just dive in then. meetings for next hour but hopefully will get back to this later this morning or afternoon. thanks! [14:00:10] sure. [14:00:53] After that's sorted we'll need to add some new thing to the new VMs if you want storage beyond the default 20Gb. That's documented at https://wikitech.wikimedia.org/wiki/Help:Adding_Disk_Space_to_Cloud_VPS_instances#Cinder [16:36:43] following up: I did as recommended. I removed the mediawiki vagrant role altogether because I know what instance that was used for and it no longer exists. I moved the lvm::srv role to just the one instance where I think it might be used (though I'm not certain that it is actually used by it). i manually reran puppet on everything and they all seemed to complete cleanly. thanks for the helping me to debug this. i guess the true test [16:36:43] will be whether emails stop showing up but for now i'm pretty happy :) i've been meaning to test out cinder anyways, so this is a good motivation to do that as well [18:29:58] isaacj_: that sounds good. The only real room for danger with cinder is that you might want to mount the new volume at a mountpoint that already has useful data in which case you'll have to shuffle things around. Should be straightforward though. [18:34:54] andrewbogott: :thumbs up: I'll play around with it [19:21:31] !log tools.lexeme-forms deployed de5ab0e740 (l10n updates) [19:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL [19:34:53] !log tools.wd-image-positions deployed 57b861aaf8 (useful page title) [19:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wd-image-positions/SAL [20:03:45] !log tools.wd-image-positions deployed 61358b4346 (one more title and a crash fix) [20:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wd-image-positions/SAL [20:05:00] ^ I did that last deployment with `kubectl rollout restart deployment wd-image-positions` instead of `webservice restart` [20:05:10] still managed to get a 503 in the browser before the tool came back up again, though [20:05:55] maybe at some point I’ll try to measure if this results in more or less downtime compared to what `webservice restart` does (delete the pod and expect kubernetes to recreate it) [20:06:54] i don't think webservice sets a readiness probe, so k8s thinks your app is running when the container is up but uwsgi is still starting [20:07:19] makes sense [20:08:47] and we can't really set one to all tools, because not everyone has a suitable monitoring endpoint that it could just blindly use [20:29:29] would that be something that could go in the `service.template`? [20:29:58] `livenessProbe`/`readinessProbe`/`startupProbe` YAML snippet [20:34:50] lucaswerkmeister: yeah, it could. I was playing with that as an idea at one point but then decided that it really could wait until we get buildpacks [20:36:46] but it would not be a huge amount of work to let a service.template define some yaml chunks that mostly get copied into the deployment that `webservice` creates [20:44:02] ok :) [21:20:20] majavah: I tried a startupProbe connecting to the http port, but still got 503s, so I guess uwsgi starts accepting connections before it’s fully ready :/ [21:21:22] or I did something wrong, that’s also an option ^^