[00:38:48] * bd808 off [09:25:29] morning [09:26:43] o/ [09:28:03] o/ [09:30:07] morning [09:36:26] how do you prevent each ansible run of lima-kilo from downloading all the binaries again? [10:01:58] please approve: https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/110 [10:07:03] you use the ansible tags to focus on what you want to test [10:13:00] mmm ok [10:19:12] dcaro: now that I have your attention, may you please review this one? https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/109 [10:21:35] 👀 [10:23:30] the branch naming you use is a bit cumbersome [10:23:54] https://usercontent.irccloud-cdn.com/file/FVvd5XAU/image.png [10:24:05] they are cut so they don't read the whole thing [10:24:09] xd [10:24:36] yeah, I usually put a string limit to don't make it too long [10:24:53] you can strip your name and the bug id I guess (if that's a bug id) [10:25:02] is a random int [10:25:45] what's it for? [10:26:12] so I generate the branch name from the commit title [10:26:26] (I have a little script https://github.com/aborrero/ansible-setup/blob/main/roles/git/files/gr.sh) [10:26:46] and it happened to me that I ended up generating the exact same branch name, which is a disaster, that's why the random int [10:27:46] which commit title do you use if you have more than one commit? [10:28:22] I don't do more than one commit usually, the script does not handle that case [10:28:56] I would be fully manual if I ever need to do that, so whatever branch name created by hand [10:36:40] thanks! [10:38:02] is there an easy way to load my .bashrc on the lima-vm ? [10:42:38] copying it over I guess (either using the mount, or limactl copy) [10:42:57] ok [10:42:58] we could add something to the scripts to add it if present or similar [10:57:53] metricsinfra-puppetserver is crashing to start the puppetserver, it complains about missing directories and permissions. It's using puppet7, is that something that we are working on? (I think andrewbogot.t was doing something there?) anyone knows? [10:58:36] maybe that is in the middle of the movement to the cinder volume for the puppet dir? [10:59:17] maybe, looks like all the puppet manifests and such are not there [11:00:01] I'll ping andrewbogot.t later just in case [11:00:33] probably https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009588 [11:00:51] yeah [11:19:23] I'm investigating haproxy alert for cloudcontrol1007 [11:20:30] 2024-03-07 20:04:09.839 826784 ERROR sqlalchemy.pool.impl.QueuePool pymysql.err.OperationalError: (2006, "MySQL server has gone away (BrokenPipeError(32, 'Broken pipe'))") [11:20:31] ??? [11:23:21] restarted the service, the error went away [11:24:42] I think I've seen that error before [11:24:58] it usually happens when the connection to the DB times out, but sqlalchemy should be able to restart it [11:25:11] but well, maybe not this time [11:54:54] is this URL responding to you? https://openstack-browser.toolforge.org/project/tools [11:56:48] it did open, but it took a long time (like 10 seconds) [11:57:00] ok, same here, but it was almost 1 minute [12:01:26] * dcaro lunch [12:01:51] can I get a review for https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/218? [12:02:07] taavi: approved [12:02:12] thank you [12:39:51] new upstream k8s deb repo doesn't have a component? I wonder how that will work with reprepro [12:41:26] see my comment in your cr. i may or may not have solved that problem in my homelab some time ago [13:14:11] taavi: thanks. I guess you had to read the reprepro source code for that? :-P [13:28:38] would you like to +1 the patch? [13:30:04] yes, done [13:30:09] thanks [13:37:56] taavi: follow up patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009754 please approve [13:38:02] looking [13:38:09] +1 [13:38:15] thanks [13:51:18] taavi: another follow up https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009760 please approve [13:52:46] +1 [13:55:03] thanks [14:11:40] taavi: I notice looking at this host you're prepping it for some of the new openstack POC? [14:11:48] i.e. with self-serve networks etc? [14:12:20] is there any docs or wikitech on that? it's motivated me to finish that diagram I was working on and share with you [14:15:09] topranks: yeah, cloudvirt2001-dev is currently in use as a OVS testbed. so far I have the agent running, but VMs are still failing to start. I'm temporarily using https://wikitech.wikimedia.org/wiki/User:Majavah/Cloud_VPS_Open_vSwitch and T358761 to document what I've done, this will all be improved once I actually have a VM ruunning [14:15:10] T358761: Deploy OVS test setup in codfw1dev - https://phabricator.wikimedia.org/T358761 [14:15:42] ok, I have no "end state" docs but I will add a little text and some diagrams there [14:15:57] FWIW I think we will need a vlan stretched between c8/d5 "spine" racks :( [14:16:41] the alternative is another VRF on the cloudsw but I think that's too much complication for now [14:16:50] thanks! [14:18:48] thank you! [15:02:05] andrewbogott: are the puppet alerts (https://alerts.wikimedia.org/?q=team%3Dwmcs) related to the puppet7 upgrade? [15:02:19] dhinus: yep [15:02:34] taavi: I'm having one problem after another with puppet7 servers, and then I noticed that puppetmaster1001 is still running puppet 5... do you know what the story is with production? I thought everything in prod was upgraded but it seems not. [15:14:09] andrewbogott: a large portition of production is upgraded. but there is still a non-zero number of hosts that haven't been upgraded, most notably all buster hosts and some databases [15:14:26] puppetserver* are the puppet 7 hosts in the wikiprod realm [15:15:22] ok, I see the now. So we still have some of our hosts (e.g. cloudcontrol1006) to migrate as well it seems [15:15:45] So setting up a puppeserver is working for someone :/ [15:16:00] I mean, I can make it work eventually but certainly not by setting it up with puppet [15:17:07] huh? all of our non-ceph hardware should be on puppet 7 already [15:17:57] that's what I thought too, I must be missing something. [15:18:01] if you have a non-working instance I can take a look a bit later [15:18:13] Ok, so I'm on cloudcontrol1006 and I'm trying to answer the question 'what puppetserver are you using?' [15:18:33] I look in /etc/puppet/puppet.conf and no server is mentioned. So I assume that means it's using the default which is 'puppet' [15:18:50] does the MOTD have a yellow 'this host has been migrated' line? [15:19:13] yes [15:19:49] So something in "So I assume that means it's using the default which is 'puppet'" must be wrong [15:20:14] it's using SRV discovery DNS records [15:20:56] https://gerrit.wikimedia.org/g/operations/dns/+/036a2a202d4b606c94f57343ead38914aedef18e/templates/wmnet#20 [15:22:18] hmmmm [15:22:30] so the hostname 'puppet' is a different host than the service 'puppet' :/ [15:25:19] if the config option to use SRV records is set, yes [15:27:01] I hate that but I'll live with it for now [15:29:50] I think the magic you're missing is https://gerrit.wikimedia.org/g/operations/puppet/+/a5150ff71674ff42d45ce9372d8bb0afdfb48a78/modules/profile/manifests/puppet/agent.pp#60 [15:41:02] * arturo hey, I'm thinking on upgrading toolsbeta k8s to 1.24 next monday. Please let me know if that works for you, or if you think I should reschedule. [15:48:19] are we backing up etcd nodes? [15:53:07] andrewbogott: did you have a broken puppet 7 server you wanted me to look at? [15:53:30] arturo: no, T339934 [15:53:30] T339934: [etcd] Find a backup solution for the etcd database - https://phabricator.wikimedia.org/T339934 [15:57:24] taavi: I'm building fresh, will let you know when it's ready (= not ready) [16:05:03] taavi, arturo: I added some detail to taavi's wikitech page [16:05:04] https://wikitech.wikimedia.org/wiki/User:Majavah/Cloud_VPS_Open_vSwitch [16:05:18] 👀 [16:05:29] let me know what you think, the new vlan mentioned is not yet deployed but we can do it fairly quickly when needed [16:09:27] taavi: it's going better this time; I suspect that many of the issues I'm having were disagreements between 5 and 7 about permissions. I'll see if I can pin that down when I migrate the next project [16:12:13] topranks: thanks, great write up and diagrams [16:12:24] topranks: I have a couple of questions [16:12:45] np, we can obviously change as we work things out just my initial thoughts on it [16:12:49] arturo: yeah fire away [16:13:39] 1) we most likely will need to migratre each VM individually, because they will need to detach from the current neutron "port" to a new one in the new self-service virtual subnet. I mention this because this means the 2 setups will definitely need to co-exists for a while [16:14:30] co-existance is fine, but my gut sense is it's better to "rebuild" the VMs/service on the new setup, as opposed to provide a way to migrate them? [16:14:55] I'm agnostic on that though, if we need to do something network wise to support what you have in mind we can discuss [16:15:01] I'm not sure we will have the option of a full rebuild, but we will see [16:15:04] 2) I don't fully understand why the stretched-vlan, instead of routed BGP between all the parties (cloudsw, cloudgw, neutron) [16:15:43] on 2) if the cloudsw is between cloudgw and neutron then it's participating in the routing [16:15:45] you mention in the docs the problem of 2 different BGP peers announcing the default route. Can't that be filtered somehow? [16:15:57] cloudgw needs to announce a default route for neutron to use and send all external traffic to [16:16:31] if the cloudgw sends that default to the cloudsw (for cloudsw to propagate onwards to neutron), then the cloudsw now has a default route that takes all the traffic to the cloudgw [16:16:48] the question then becomes - how does the cloudgw send traffic to the cloudsw that's supposed to go out to the internet? [16:17:23] can't we just filter default routes in cloudsw to only be accepted from CR? [16:17:29] in the scenario I describe cloudgw will forward internet traffic to the cloudsw - and cloudsw will send it right back to cloudgw because that's where it's got a default route [16:18:01] arturo: if the cloudsw does not accept the default the cloudgw would send, then traffic from neutron won't get sent to cloudgw - it will get sent to CR [16:18:31] you have the fundamental problem that the CR and cloudgw both send a default [16:18:54] they are in conflict. the only way to resolve it is to have two VRFs, an "inside" and "outside" one [16:19:01] inside with the default to cloudgw [16:19:06] outside with the default to CR [16:21:41] ok, I was under the impression that BGP peering and filtering could solve this [16:22:04] if you get *really* creative with BGP and policy routing you could potentially solve it without another vrf [16:22:19] but a vrf is by far the better way to achieve it that would be an awful mess :) [16:23:35] I think for now the L2 in both racks is the easier way. Ultimately it's the same idea - a new virtual network between cloudgw and neutron, but making that just a vlan/L2 means we don't need to deal with the conflicting defaults learnt on the switches [16:23:51] I mean, just to clarify If the peerings are just cloudsw <-> cloudgw and cloudgw <-> neutron, we cannot just tell cloudgw to only announce the default route on the neutron peering side [16:24:37] cloudsw sends the default to cloudgw [16:24:49] cloudgw propagates the default it learns from cloudsw to neutron [16:25:06] we can totally control what we send/accept on the cloudgw [16:25:15] ok [16:25:38] "we cannot just tell cloudgw to only announce the default route on the neutron peering side" [16:25:49] we can if we keep the network between cloudgw and neutron at L2 [16:26:45] if we jam the switch between the cloudgw and neutron then we need to move that separation up the stack with a VRF at L3 [16:28:24] I'll try to wrap my head about this [16:31:59] topranks: so, if cloudgw and neutron are connected via cloud-private, and cloudgw announces the default route, then neutron would no longer know how to get to cloudgw via cloud-private itself. Is this another way to express the same that you are explaining? [16:32:42] cloudgw and neutron are connected on the new L2 vlan and exchange routes directly between each other over that [16:33:08] the cloud-private subnets are in the cloud-vrf on the cloudsw, which is learning a default route from the CRs [16:35:58] the reason I'm focusing in this is because the L2 adjacency requirement is something that has bitten us in the past. Maybe a limitation in the current design. If we could not have this limitation in a potential new design, that would be better [16:36:57] I prefer not to do it myself. But it's totally safe because it only needs to extend between two switches, cloudsw1-c8 and cloudsw1-d5 [16:37:10] there is no "triangle", "square" or any other loop formed at layer-2 [16:40:28] ok [16:41:06] tbh I can't really see any downside to it. [16:41:53] if you feel strongly I can propose the second vrf idea, but when that was discussed before (for different similar reason) it wasn't met with wild enthusiasm [16:42:38] I also think it's a lot of complication to add right now, I figure we are better to keep it simple and review how the POC goes [16:42:58] I agree. This is only speculation at this point [16:43:22] but previously it feels like we are bounded to 2 racks that are manually configured (the L2 adjacency thing). So I'd rather not have that [16:43:55] well those are the racks where the CRs connect, so they are the head-end for the network regardless [16:44:20] like I said on the call last week, that's where you need to put the cloudgw and neutron head-end, for phyiscal topology reasons [16:44:21] meaning: if we ever racked a cloudnet box on a different rack/switch, we would need additional manual config [16:44:21] ok [16:44:45] it'd be a bad idea to give people the option to do that though :) [16:47:12] imagine you had cloudnet in E4, and a cloudvirt in F4. [16:47:25] external traffic from cloudvirt would go F4 -> [C8/D5] -> E4 -> [C8/D5] -> CR [16:47:33] if you put the cloudnet in C8 or D5 it goes [16:47:38] F4 -> [C8/D5] -> CR [16:47:55] 3 hops between racks rather than 5 [16:50:37] ok [16:52:09] openstack has a way to address that, called distributed virtual routing, or DVR, which kind of injects the neccesary routing on each hypervisor, removing the need for egress traffic to hit the cloudnet server [16:52:50] that works if we don't have the cloudgw in the mix [16:52:54] I haven't investigated how to introduce that in our network so far. It may be easier now with OVS [16:54:20] as in, the current problem is how do we get traffic from cloudnet to the cloudgw [16:54:37] removing the cloudnet leaves us with the same problem, it just becomes how do we get from cloudvirt to cloudgw [16:55:15] ok [16:55:54] if we did have that problem we'd *have* to add another vrf, as we can't stretch a vlan across more than 2 racks [16:56:19] I see [16:56:44] but as long as the cloudnet is part of the picture we should locate it in c8/d5 where the spine switches / CR links are [16:57:23] and if we have cloudnet/cloudgw in just those 2 racks then we can do a simple vlan (5 mins work) between them [16:57:39] ok taavi, here's the latest: metricsinfra-puppetserver-1.metricsinfra.eqiad1.wikimedia.cloud seems to be unable to see /srv/git/operations/puppet (client metricsinfra-meta-monitor-1.metricsinfra.eqiad1.wikimedia.cloud) [16:57:53] ok, let's see [16:58:29] there must be log files about this someplace... [17:01:33] andrewbogott: metricsinfra-puppetserver-1:/srv/git/operations/puppet exists but seems to be missing some local commits from metricsinfra-puppetmaster-1:/var/lib/git/operations/puppet [17:02:12] is that all that's happening? I tried moving that 'puppet' dir out of the way entirely and got the same failure on the client [17:02:35] https://phabricator.wikimedia.org/P58693 [17:02:54] role::wmcs::metricsinfra::meta_monitor only exists locally because that patch is still in development/discussion [17:03:56] arg, this is different from what I was seeing 10 minutes ago. Hang on... [17:04:39] * arturo offline [17:08:32] taavi: ok, here's a different issue: [17:08:40] https://www.irccloud.com/pastebin/gdyxOXTR/ [17:09:20] that's the issue I was trying to fix on the puppetserver which probably resulted in my accidentally resetting the local patches [17:09:41] the whole point of -a is to not change the ownership, so why is git worried now? [17:10:37] (this is not even a puppet issue anymore, I'm getting ever further from my goal) [17:13:20] we should honestly just disable that git anti-feature [17:13:50] Assuming it's just a pointless complaing I can set that directory as allowed [17:14:08] but I don't understand why it's confused by the permissions of that dir when it isn't about an exact duplicate [17:14:23] (... which leads me to think it isn't really a duplicate) [17:23:59] * dcaro off [17:24:01] have a good weekendh [17:27:02] taavi: ok, /now/ you can see what I was seeing which is that the class is present on puppetserver but the client still complains about not seeing it [17:30:52] ok [17:30:58] did you run the puppetserver-deploy-code script? [17:31:22] dp [17:31:28] I don't know what that is but it sounds promising [17:31:36] yeah, that's needed now [17:31:39] is that not something that puppet should do as part of setting up the server? [17:32:16] basically that needs to be run every time something is changed in /srv/git/operations/puppet. there's a git hook that will normally take care of it, but if you just rsync data in or otherwise bypass git you need to do it manually [17:32:56] should git-sync-upstream do it? [17:33:37] if it changes something [17:34:06] crap apparently running puppetserver-deploy-code requires 5x as much space as the git repo takes up otherwise [17:34:08] * andrewbogott resizes [17:35:02] do you have a cwd on metricsinfra-puppetserver-1? I need to unmount /srv [17:35:12] logged out [17:35:28] thx [17:44:22] yet more monkeying with permissions and it seems to be working now. Thanks taavi! [17:44:32] great! [17:44:55] * andrewbogott checking now to see if that puppetserver works with other clients [17:57:10] Do we have an existing method of connecting between hosts in a project with ssh? [18:09:41] Hey, I was just struggling with that earlier in the week. [18:09:58] Not really a ready-made method. If you want to copy files there's a puppet class to install an rsync server [18:10:25] the firewall should be open to port 22 between hosts but auth doesn't have a great solution [18:12:22] So if I wanted to rsync all the files from server a to server b, there aren't many direct options? [18:28:02] * bd808 lunch [19:18:03] Rook: sorry, missed you before. Easiest is probably to use a cinder volume. You can also scp from your laptop with two remotes but that will take hours. [19:18:53] What if we allowed local ssh key verification [19:18:53] ? [19:19:53] I'm not 100% sure that we don't already [19:20:26] If it's currently disabled, I'd rather we enable it on a temporary basis rather than across the board in order to discourage users from leaving private keys sitting around for years [19:22:05] It's kind of allowed, until puppet runs [19:22:23] The key has to be in /etc/ssh/userkeys which gets scrubbed when puppet runs [19:23:31] Are you trying to solve a one-off problem or automate this for repeated use? [19:23:49] the latter [19:24:11] I would like to be able to manage nodes within a project from a bastion node within the project with ansible, thus ssh [19:24:44] Ah, so rsync isn't good then [19:25:08] running it all externally is less pretty, as k8s needs to be local, and some kind of tunnel or the like would need setup. And it messes with the user's laptop settings to do so. [19:25:09] would it work to have puppet install a keypair with limits on the remote commands? [19:25:15] I think we have examples of that already. [19:25:17] Yeah, sometimes it's rsync, but a temp key will solve that [19:25:42] The hope was to avoid puppet entirely [19:26:20] Wait, so when you said automate for future repeated use... you meant just one one server [19:26:29] not, like, something that can be reproduced on other hosts at other times? [19:26:49] Yes it would be one server in the project [19:27:33] ok -- I think you either need to do it with puppet, or (probably) just stop puppet entirely on the host in question. [19:27:48] hmm... [19:27:57] if you look at /etc/ssh/userkeys/root.d/ there's a cumin key that's installed pretty much everywhere [19:28:16] Running it all the stuff from a laptop feels icky, but is sounding better... [19:29:21] Another option would be to run things from a cloudcumin host, and re-use the cumin keys. [19:29:36] That might require some puppet engineering but it would give us a future ansible solution cloud-wide [19:29:39] (well, for admins at least) [19:30:04] I'm trying to make something that doesn't feel like a bespoke wiki thing, to facilitate that one doesn't have to know a lot of strange things to work in our env [19:30:46] sure [19:31:04] But the only bespoke thing we're really facing is the installation of keypais [19:31:07] *pairs [19:31:22] I would put cumin and puppet in the bespoke category as well [19:32:15] Right, what I mean is: [19:32:42] if we write an 'ansible keypair' puppet class that dumps a private key on a 'controller' and dumps the public key on the 'clients' for a given project... [19:33:03] That's bespoke puppet code but it's doing something extremely obvious and normal (copying keys onto hosts) [19:33:08] after that you're just doing normal ssh things [19:33:20] It doesn't seem super bespoke to say "here's how to set up your keys" [19:34:18] Controlling things from your laptop also seems like a very normal ansible use [19:34:55] Hey the immutable trick still works! I wonder how upset that would make the monitoring... [19:35:18] you mean puppet can't remove a file if it's marked immutable? [19:35:44] yeah, the last time I was using puppet (about a decade ago) we did that all the time [19:38:43] So a class that makes a key for a control node and shares the public half with the other debian systems shouldn't require a local puppet master, yes? [19:39:14] Hmmmm [19:40:55] The most trivial version of this is just installing a public key on every host in a project, right? So that would mean supporting a hiera setting along the lines of "extra_private_keys: []" [19:41:36] Then you'd just make your keypair yourself, manage the private key yourself, and add the public key to the horizon UI. [19:41:42] puppet couldn't go to the control node, generate the keypair, then copy the public half from there? [19:41:48] If we want puppet to /make/ the keypair then that's harder. [19:42:02] Because puppet nodes are largely opaque to one another. [19:42:15] And we do this to ourselves why? [19:43:04] You mean the opaque state thing, specifically? [19:43:16] Do you want a technical explanation or a design explanation? [19:43:31] I meant why do we still use puppet, but that would be one of the reasons that raises the question [19:44:04] I'll see if I can get kubectl working through a proxy. I don't like that so much as it requires local setup that would be hard on say a windows instance. The current assumption is an admin for a project only needs ssh, which even windows has default now [19:44:43] Basically I feel we are working against our interest of letting anyone (with any kind of laptop) work on our projects with some of our approaches [19:44:57] https://phabricator.wikimedia.org/T355963 [19:46:50] That is one option. Though we lose collaboration on that front [19:52:58] Anyway thank you for musing on possibilities [20:04:16] yep! [21:49:36] andrewbogott: seems like metricsinfra-puppetserver-1.metricsinfra.eqiad1.wikimedia.cloud is frozen. puppet 7 seems to want a bit more resources than cores1.ram2 instances have [21:50:00] hm, I've been seeing it work ok with 2 but I'll try to resize. [21:50:38] meanwhile, I'm back to frustrating permissions problems on project-proxy-puppetserver-1.project-proxy.eqiad1.wikimedia.cloud [21:51:11] (I'm especially frustrated because I get problems like this every time but when I tried to fix my patch kept getting "works for me!" -1s until I gave up at actually fixing things) [21:54:48] uhoh, puppet just updated max-active-instances: from 1 to 2 when I resized that host. So it may be determined to go into swap death no matter how bit I make it [22:03:14] where on project-proxy are you seeing permission errors? [22:03:50] I fixed them already [22:04:02] but I have to manually fix permissions in 3 or 4 places every time I setup a puppetserver [22:04:15] the most obvious one is in /srv/puppet-code/environments [22:05:27] that one is only created or touched by a script that is only ever ran by the root user, so I'm very curious how you get permission errors there [22:05:31] that's https://gerrit.wikimedia.org/r/c/operations/puppet/+/975089 which is a chicken/egg thing... if the dir exists then puppetserver doens't try to create it and is happy [22:05:47] but if it doesn't exist puppetserver tries to create subdirs there and fails [22:05:56] 'it' in this context is 'environments/production' [22:07:35] anyway, I need to wrap up shortly so I can bug you about this again when I start on the next puppetserver :) [22:07:38] ok, that should be created by the first puppet run, let me see why that fails [22:08:14] aha, it's that git security anti-feature again :/ https://phabricator.wikimedia.org/P58697 [22:08:44] that's a different path though isnt' it? [22:09:38] the impoortant thing is at the very bottom: (/Stage[main]/Profile::Puppetserver::Git/Exec[puppetserver-deploy-code]) Skipping because of failed dependencies [22:10:05] so because the provisioning of /srv/git/operations/puppet 'fails', it won't try to run that magic script [22:10:22] oh! [22:10:23] yep [22:10:48] i'll send a patch or two [22:11:42] thanks [22:21:42] ok, that's project-proxy sorted, I'm going to stop now before I break more things [22:24:03] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009805 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1007396