[07:54:17] !log tools created bullseye VM tools-package-builder-04 (T273942) [07:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [07:54:21] T273942: sbuild isn't behaving well in tools - https://phabricator.wikimedia.org/T273942 [09:35:36] !log tools live-hacking tools puppetmaster with a couple of ops/puppet changes [09:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [12:00:50] !log tools shutdown tools-package-builder-03 (buster), leave -04 online (bullseye) [12:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:05:44] Hey cloud folks - I have a use-case (an api with auth to access commercial feed data) and wanted to get a sense of how secure wmcs would be (in your opinions) as a possible hosting option. Thanks. [20:23:25] sbassett: So it's the login/password for the commercial API that you're concerned about? [20:25:15] sbassett: is this the API I think it is? ;) [20:28:02] andrewbogott: not necessarily, unless there's no possible way on wmcs to have any kind of private config. the biggest concern would be the ability to wall commercially-licensed data, only providing access to a trusted set of users. and then the next concern, assuming that is feasible, would be building a secure api on top. [20:28:45] GenNotability: yes, it is. money doesn't appear to be the problem. resourcing and a home is the bigger issue atm. [20:29:15] sbassett: niiiiice! [20:29:29] sbassett: this is probably best discussed on a project request ticket. Overall I don't have a ton of concerns; you can restrict access to your VMs to approved users. If you're storing PII on the VMs then that would likely violate our terms of use though [20:29:43] And if you're just writing a trivial API proxy then we're probably not the best place for that [20:29:59] andrewbogott: I'll file a protected bug for now with more details, thanks. [20:30:01] sbassett: I'll note that the WMCS TOU prohibits non-free content, not sure if it applies here [20:30:29] yeah, it might [20:30:31] majavah: I'll note that in the task. [20:30:35] It depends on what content the API is serving [20:31:40] the api would be serving the commercially-licensed data that we'd need to protect. so if that can't happen on wmcs, then i guess it can't happen. I was basically trying to find any way possible to avoid having to create a production service for this or have it piggyback on something like the api gateway. [20:32:42] the proprietary software restriction wouldn't be a problem, but the proprietary content restriction could be [20:32:59] the ToU wording is "Do not use or create content unless it complies with the Wikimedia Licensing policy.", referring to https://foundation.wikimedia.org/wiki/Resolution:Licensing_policy [20:33:53] AntiComposite: sounds like literally the opposite of what we'd need :) I'd be more than happy to make the api/data storage code open-source or even just use some existing templates (which likely already exist for this problem) [20:34:04] AntiComposite: the software one might be an issue too if they can't reveal which API it is [20:34:50] (and I'm not at all a fan of hosting something like that, even if the ToU was not a problem) [20:34:57] nah, just put the URL in the config file with the keys, and don't make references to the specific site [20:36:16] well, it would be nice to eat our own dog food (though wmcs is much nicer than dog food) but if we can't do that, then we'll need to keep searching for a home. there's a similar use-case already in wikimedia production for maxmind data, but i didn't really want the api for this to be Yet Another MW Extension. [20:37:57] I haven't looked at the responses for this particular API, but I doubt it would meet Feist v. Rural [20:37:58] and a full-on production service is generally a much heavier lift imo [20:40:01] now if we were copying their entire database over the API there might be a copyright problem :) [20:40:03] with maxmind you're at least willing to tell us what it is and exactly what it's used for [20:41:12] majavah: https://phabricator.wikimedia.org/T265845, just subbed you [20:41:51] tl;dr spur.us has an extremely helpful product to combat res proxy LTAs. but it's not free. [20:44:20] sbassett: is there a reason you don't want to create a production service for it? [20:44:54] usually the main obstacle people face when deploying stuff in production is, *checks notes* passing a security review ;-) [20:45:01] legoktm: no. just that it's more involved and a heavier lift and the secteam isn't quite resourced to be a full-on security tooling engineering team. [20:46:07] there's an extreme need for such a team at the foundation, imo, but i could blather for hours about that and don't want to ruin everyone's friday in the cloud channel :) [20:47:11] I would posit that once the initial setup is done (definitely more involved), deploying a service on prod k8s is less involved in day-to-day upkeep than maintaining a cloud VM [20:48:07] ok, i guess i would've assumed the same. it's just finding some initial cycles for such an effort. and possibly some collaborators. [20:48:39] anyhow, i'll plan to file a bug for this soon and maybe see how far i can run with it. [20:49:11] thanks all for the input. [20:49:24] :) [22:19:24] If I'm getting `channel 0: open failed: administratively prohibited: open failed` for sshing to a new instance (but the old instances in the project still work fine) is there something obvious I've screwed up? [22:29:31] James_F: try ssh -vvv and pastebin the output? which instance? [22:38:42] legoktm: integration-agent-docker-1021.integration.eqiad.wmflabs [22:39:13] new instances won't have .eqiad.wmflabs hostnames [22:39:18] Oh. [22:39:29] Well, right, that'd break all my scripts. :-) [22:39:30] try integration-agent-docker-1021.integration.eqiad1.wikimedia.cloud [22:39:42] https://openstack-browser.toolforge.org/server/integration-agent-docker-1021.integration.eqiad1.wikimedia.cloud [22:39:48] Yup, that works perfectly. [22:39:51] Duh, thanks. [22:39:55] :) [22:40:26] Hmm, it's asking me for a password to sudo. [22:40:40] Will it also not know how to route to .eqiad.wmflabs? [22:42:06] I see you listed in the sudo rule, so I don't think that's the issue [22:42:35] do you mean if it needs to talk to another *.eqiad.wmflabs instance? that'll work fine, as long as that instance actually has that as a hostname [22:42:37] It's almost like it's been two years since I created a CI agent in WMCS and things have moved on since I last did so. [22:42:45] > Puppet does not seem to have run in this machine. Unable to find '/var/lib/puppet/state/last_run_report.yaml'. [22:42:52] I suspect that's the issue somehow [22:42:55] Yeah, it's a silly question anyway because I'm trying to trigger the initial puppet run. [22:43:12] First job on our agents is to run `sudo rm -fR /var/lib/puppet/ssl && sudo puppet agent -tv`. [22:43:22] (How did this work before?) [22:43:48] and of course, my root key doesn't work because puppet hasn't run yet [22:44:05] Yup. [22:44:34] andrewbogott: if you're around, we're having trouble with puppet not running on integration-agent-docker-1021.integration.eqiad1.wikimedia.cloud [22:45:06] https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup is the runbook, FWIW. [22:46:27] oh, I think a.ndrew already went offline, probably worth a phab task then [22:47:15] Yeah, I guess. So much for quickly getting this up and running so I could test it over the weekend. ;-( [23:11:10] legoktm: is that a new host or an old one? [23:11:26] and can you ssh to it? [23:11:31] new, I think bstorm is looking into it though [23:11:40] yeah, I could ssh in, but only as myself, not with my root key [23:11:56] https://phabricator.wikimedia.org/T290775 is the issue [23:12:07] andrewbogott: it wouldn't run puppet even the first time [23:12:11] cert error [23:12:22] It has a project puppetmaster, but that shouldn't matter? [23:12:43] cloud-init didn't work on either puppet run [23:13:22] have you tried creating other VMs or just that one? [23:13:54] And do you mean it to be buster even if you're building a new service? [23:14:26] James_F: ^^ [23:17:35] andrewbogott: Hey. It's bullseye not buster, unless I mis-clicked? [23:17:54] It looked like buster when I looked quickly [23:18:06] we're talking about integration-agent-docker-1021 right? [23:18:18] It's buster [23:18:38] Yeah. [23:19:34] Because of the local puppetmaster, puppet should break, but only after it finishes one good run. I think you should delete that buster VM and try again (with bullseye and a different hostname) and see if it goes better [23:19:43] Good reason to try again? I could manually resolve on the console. I find the state it got into strange. [23:20:00] Happy to re-image it, sure. [23:21:04] It may not fail due to the LVM issue I mentioned since one of those patches was cherrypicked at one point [23:21:31] Re-build triggered. [23:21:33] Yeah, I don't recognize the particular error (something about cert mismatch?) but it seems moot since it's the wrong OS anyway [23:21:42] I'll be curious if it happens repeatedly :) [23:21:55] Yeah. It acted like it couldn't contact the puppetmaster correctly...or talked to the wrong one [23:22:00] If I don't encounter it with 1021 and it's a fault with the class I'm sure to do so with 1022 etc. [23:22:45] The LVM issue might not bite you because one of those patches was cherrypicked (I think) at one point. If not, that will stop you. The old LVM classes don't work anymore. [23:23:07] But it was a red herring for the original problem. I was just guessing :) [23:23:10] The custom puppetmaster doesn't have an entry for the 1021 box. [23:23:14] Ha. :-) [23:23:25] I'm curious to see how to adapt our needs for the new "ephemeral" storage. [23:23:37] Which is likely beyond me. But I can play around and see. [23:24:51] James_F: can you try a different hostname please? [23:25:38] Oh. Sure? [23:26:17] deleting and recreating a host with the same puppet cert and dns name is asking for trouble, lots of async things that can collide [23:26:24] !log tools cleared error state for tools-sgeexec-0907.tools.eqiad.wmflabs [23:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:26:31] I did a "re-image" instead. [23:26:35] especially if we /know/ that the puppetmaster is mad at that particular cert [23:27:02] But the puppetmaster isn't? [23:27:22] Anyway, 1022 spinning up now. [23:29:05] Yeah, looks like puppet is running. Thanks! [23:29:55] sure thing. I'm trying to clean up whatever went wrong with 1021 [23:30:07] you know how to reset the certs when it switches masters right? [23:30:54] https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup says to do `sudo rm -fR /var/lib/puppet/ssl && sudo puppet agent -tv` and then set the new puppet role (including the new puppetmaster). Is that still right? [23:31:24] yeah that should do it [23:31:30] assuming it's a fresh hostname [23:31:58] * James_F nods. [23:35:11] Now I'm on to the next error, which is probably the lack of the patches bstorm pointed to. [23:37:02] With those, somebody has to merge them. [23:37:50] Yeah, but I can pull them onto the project puppetmaster first to try them out? [23:38:29] Well, the one now has conflicts [23:38:41] Oh, and https://gerrit.wikimedia.org/r/c/operations/puppet/+/717732/ is already soft-deployed there. [23:38:51] heh [23:38:58] Thanks, Krinkle. ;-P [23:39:18] I think that one should just remove the require Mount[srv] because I think it is unnecessary since the require at the top takes care of it [23:39:30] it's just not obvious and fuzzy (which I get) [23:39:39] That's why it's not merged [23:39:50] * James_F pretends he knows what this stuff does. [23:40:00] It fails rspec because the Mount is only created if puppet has facts [23:40:11] puppet doesn't have host facts in rspec, only in puppet compiler [23:41:05] But yeah, the puppetization probably still needs the changes for the docker volumes as well [23:41:16] Almost certainly. [23:41:24] We're replacing instances with 80G of disk. [23:41:39] Going to 20+40 is probably OK, but going to 20 is not. :-) [23:41:41] I haven't updated my cherry-pick since the initial draft [23:41:51] the patch has evolved since [23:41:54] Yeah, no worries, I'll fiddle. [23:42:05] Re-applying it should remove the Mount errors [23:42:22] The "cinder" volumes accept the ephemeral storage. Is there enough ephemeral storage for all your needs in these images? [23:43:06] bstorm: take note of the docker volume problem though. This is why I marked my patch as WIP because andrewbogott and hashar were looking at this a few weeks ago and seemed to find a problem when there are two things wanting differnet amounts of the space. [23:43:06] bstorm: Maybe? Mostly having less disk space just means CI will be slower as there'll be more juggling of CI docker images when running patches. [23:43:20] for the qemu agent, /srv takes all the space [23:43:35] for the docker agents, /srv shuld take the remainder with some other portion going to the docker cache [23:43:38] Ok, so this might work for qemu, but not so much for docker [23:43:51] See also https://gerrit.wikimedia.org/r/c/operations/puppet/+/670524 [23:43:56] yeah [23:44:29] Ok, so in that case, I don't expect this to be super easy for you to test with this weekend James_F :( [23:44:44] No, but if it was easy it wouldn't be worth doing. [23:44:56] lol, well that's the spirit :) [23:46:32] * Krinkle plays a subtle echo of Kennedy saying "[…] we do this not because it is easy, but because is is hard" [23:46:36] Good luck. I'm not going to have time tonight to help sort it all out really [23:47:09] lol [23:47:17] :-) [23:48:56] I think the docker runners are going to need multiple disks...and possibly a bit more structure. [23:49:10] at least the way this is all put together right now [23:49:18] Ack. [23:49:39] if you create an *actual* cinder volume, you can attach it and use it [23:49:49] but that seems silly for a CI thing [23:51:23] If I'm right about that, then you might need some new flavors created. [23:51:48] Yeah; we're much more CPU/RAM bound than disk, really, but… [23:51:58] All the existing agents are g2.cores8.ram24.disk80 boxes. [23:52:33] with 70% lvm going to docker cache and the rest to /srv/ [23:53:55] I see...which is not really what the cinderutils classes do [23:53:55] the epehmeral40 flavour is private and seems to be created specifically for CI [23:54:03] I wonder why with 40 when afaik most instances have 80 [23:54:25] 20+40=...🤷🏻‍♀️ [23:54:27] Trying to push down on storage? [23:54:40] That may have just been for testing [23:54:52] I can claim ignorance because I wasn't helping then [23:55:04] the old "disk80" was 80G for extended and the base presumably 10 or 20 right? [23:55:34] No, typically the first 20 was the root and then the rest was unallocated [23:55:39] ah okay [23:55:45] LVM added the rest for you [23:56:13] I'm curious why the separation exists - apart from end-users sometimes preferring to manage disk full errors separately [23:56:28] Now, you have a disk and you can add other disks. Most people just have cinder, but in some cases we've made flavors that include an additional ephemeral disk (ephemeral as in attached to the life of the VM) [23:56:30] is there a benefit in openstack to having boot disks be small or separate from other instance-specific data? [23:56:47] The cinder disk is detachable and thus "persistent" [23:57:09] We moved most people to that model, but it doesn't make sense for CI or toolforge nodes [23:57:18] right [23:57:29] In openstack, a VM is really an instantiation of an image [23:57:46] that eases upgrades and such and presumably decreases the need for backup scripts and NFS etc [23:57:58] decreases space on our storage cluster [23:57:59] since the cinder disk isn't going to go away when the instance dies [23:58:03] yeah [23:58:38] If there's a single cinder disk shared across all the CI agents that'd be quite good. [23:58:41] Again, not so useful for your stuff, though. You can create them and use them for testing just to get around problems here, but it's not very repeatable and is a bit silly since you don't value the data per se [23:58:50] ok, so for the ephemeral case, the boot part and the other parts are all local and stored the same way, it's just partitions with the same VM being managed by the same cinder utils? [23:58:59] James_F, that you can't really do [23:59:01] Assuming we can convince docker to see it as a shared local cache of images so we only have one copy per instance. [23:59:03] Ah. :-( [23:59:10] Typical, dreams running ahead of reality. [23:59:14] Unless you connect it to somethign you use as an NFS server [23:59:22] Since it's a filesystem, etc. [23:59:59] Krinkle, yes. They are just disks in the VM and cannot be separated