[07:48:55] * arturo online [07:51:51] morning! [07:53:21] i'm renewing the puppet CA in toolsbeta, so in case you see puppet errors that's probably why [07:54:46] ack [08:03:04] o/ [08:17:09] certs have been renewed [08:17:18] (+ docs updated etc) [08:17:33] thanks! [08:30:04] arturo: I think we need a generic fix to prevent the *-tofu service accounts from receiving puppet failure emails [08:31:15] we can maybe just update their email address to a blackhole [08:31:34] we only needed a valid address for the initial account setup [08:44:22] wikitech-static is failing with (Cannot access the database) [08:46:27] the disk is full [08:47:58] maybe similar to T338520 [08:47:58] T338520: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520 [08:51:43] yep, the same fix worked [08:55:28] arturo: is there a specific reason to make octavia-lb-mgmt ipv4-only? [08:56:21] taavi: is just a healthcheck network, I don't see the value of it being dualstack. We could make it IPv6 only however [09:01:08] arturo: i would honestly just do dualstack, cloud-private doesn't use ipv6 everywhere just yet so can't make it v6-only but I also think it should support v6 from the start so dualstack seems the best option to me [09:03:07] if cloud-private doesn't work on IPv6 then not having any IPv6 sounds the right option to me. Also, adding IPv6 dualstack at the openstack level is a very easy change via tofu-infra, it can be done at any later time without disruption [09:04:57] cloud-private will support v6 very soon [09:05:26] (it's a matter of figuring out how to add the addresses and dns records without disruption, everything else is already there) [09:05:43] if you add it later then you'd still have all the existing instances with no v6 addresses [09:05:48] it's easier to just have that from the start [09:08:30] ok [09:10:44] taavi: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/232 [09:12:39] +1 [09:13:49] I'll let andrew merge that one [10:13:19] please review: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1145093 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1145094 [10:21:55] done [11:31:56] * arturo brb [12:55:21] taavi or topranks, want to double check https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146598 and https://netbox.wikimedia.org/search/?q=octavia-lb-mgmt ? [12:56:43] andrewbogott: that looks good to me in theory [12:56:48] what is this network for though? [12:57:32] the only consideration I would have is whether it possibly should be a private network [12:57:38] rather than on public address space [12:57:55] It's for lbaas, the load balancer instances will live there. There needs to be communication between the load balancers and cloudcontrol nodes. [12:58:07] I don't think I have an opinion about public vs private, that's maybe an arturo question [12:58:15] so it doesn't sound like you need them to get hacked into [12:58:44] For IPv4 the discussion is different, cos all the openstack networks are on private space [12:58:53] and we can decide what we want to do NAT for or not [12:59:13] For IPv6 we should choose a block from space open to the internet or not, depending on the use case [12:59:36] hm [12:59:58] so possibly should choose a block from https://netbox.wikimedia.org/ipam/prefixes/1080/ [13:00:06] I mean, as load balancers, they should be reachable on the internet. But since this is meant to be the management network that should probably be handled separately. [13:00:41] you don't want people managing them from the internet [13:00:46] you'd assume [13:00:50] agreed [13:01:33] I'm confused, though, 2a02:ec80:a100:2::/64 is under that cidr you just linked me to isn't it? [13:02:23] oh wait, I confused my tabs [13:02:42] It's not no. It's from 2a02:ec80:a100::/55 which ends at 2a02:ec80:a100:01ff::/64 [13:03:08] 2a02:ec80:a100:100::/56 is a separate block [13:04:22] so I am a bit overwhelmed by all the bits, how does this look? https://netbox.wikimedia.org/ipam/prefixes/1208/ [13:05:25] andrewbogott: yeah that's perfect [13:05:33] ok, will update the gerrit patch [13:05:49] you could maybe also use 2a02:ec80:a100::/64 direct [13:05:53] https://netbox.wikimedia.org/ipam/prefixes/1080/prefixes/ [13:06:18] 101 is the second network in the /55 [13:06:22] but both work as well [13:06:41] the question I can't answer is if it needs direct internet access or not [13:08:44] I'm pretty sure not but I'll cross that bridge when I get to it [13:08:57] Do you feel strongly about me moving it to 2a02:ec80:a100::/64 ? [13:09:02] cool, yeah the reason I asked is the name "lb-mgmt" didn't _sound_ like it needed it [13:09:16] andrewbogott: it'll do my ocd in so yeah [13:09:19] I'll change it in netbox [13:09:22] thx [13:09:25] if you want to prep the patch with that [13:12:27] hm, if we're provisioning the v6 subnet in a way where it has no internet access we probably should do the same for the v4 subnet just to be consistent [13:14:02] taavi: indeed yes [13:14:17] that's the regular firewall/nat question for the private IPv4 range [13:16:07] Isn't the v4 range we're using already private by default? [13:16:13] Unless we nat it? [13:16:33] it is yeah, that's the key difference with the IPv6, and why there are two blocks to choose from [13:17:16] andrewbogott: yes but I believe the only options on cloudgw at the moment are "do not let any traffic in our out of this network" and "permit all egress and NAT it" [13:18:27] right but that means this is a firewall question and not an ip range question? [13:18:41] exactly, yes [13:19:30] yes [13:20:30] ok... so now I have https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/233 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146598 [13:21:20] and eventually https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146515 [13:43:19] very quick implementation of that firewall change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146629 [13:46:10] hmpf... striker uses compose to build a blubber image... things become tricky for podman [13:50:03] thx taavi -- I'm not sure I follow the routing rule [13:50:11] well, I guess we're just using the existing one so that seems safe [14:15:12] dhinus: looks like the tofu tests passed yesterday, is that possible? [14:15:29] andrewbogott: didn't check, looking [14:16:38] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146635 should have an equivalent for the v6 subnet I think? [14:17:03] andrewbogott: yesterday, no. but today (two hours ago) yes! [14:17:09] dhinus: amazing! [14:17:16] we're on a winning streak [14:17:22] :) [14:17:26] taavi: you're right, although I think it's moot for my immediate testing. I'll add anyway. [14:17:32] :-) [14:24:01] ok, next networking blocker: I would expect this to work (or at least not time out) [14:24:03] cloudcontrol2004-dev:~# telnet 172.16.131.236 9443 [14:28:02] oh hang on, that VM didn't even get assigned an IP [14:28:07] So that .236 is going nowhere [14:45:24] taavi: here's that v6 firewall change https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146641 [14:46:09] andrewbogott: you can pass an array to srange, no need for two rules [14:46:23] mixed v4 and v6? I looked for examples of that but didn't find them [14:46:33] yes [14:46:38] ok... [14:48:26] updated [14:50:35] hm, a comma would be nice [15:30:05] arturo: last time I started an amphora it didn't get an IP. I'm guessing that's because the project it's in (service) didn't have access to the mgmt network, so I wrote https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/234 -- does that seem reasonable? [15:30:17] And if so I need to figure out how to delete the old network so that tofu can create that one. [15:30:43] mmm [15:31:04] the docs specifically said the network should belong to admin [15:31:30] oh, ok, maybe we just need to make it public or shared or something [15:31:43] yeah shared was set to false originally by me [15:32:03] ok, let's try flipping it to true and see what we get... [15:32:04] cam ew share that to the specific project instead of making it globally available? [15:33:12] I'm not familiar about how all that permission model works. If you make the network tenant-owned, is the admin neutron router able to get a port on that network to act as a gateway? [15:33:39] if not, then we will need a tenant router to act as a gateway, with a leg in the flat network [15:33:52] otherwise it wont get egress/ingress routing [15:34:23] my feeling is that if we make it shared=true then _any_ tenant may be able to create ports on that network [15:34:47] oh, this time around it got an IP! that's better than last... [15:35:11] what did you do? [15:35:29] switched the mgmt network to 'shared' so that the service project can use it [15:35:44] ok, arturo, now I see 'Connection to 172.16.131.114 timed out' on the cloudcontrol. [15:36:09] if you don't do this change via tofu-infra it will get reverted on the next run [15:36:18] * andrewbogott nods [15:36:25] I also can't ping 172.16.131.114 from a cloudcontrol [15:36:36] can you take a look at that while I make the tofu change? [15:36:41] yes [15:38:19] I can ping from the neutron main router [15:38:22] https://www.irccloud.com/pastebin/wpcDAaAE/ [15:38:50] I can't ping from cloudgw, so that gives an indication of where the failure could be [15:38:54] aborrero@cloudgw2002-dev:~ $ ping 172.16.131.114 [15:38:54] PING 172.16.131.114 (172.16.131.114) 56(84) bytes of data. [15:38:54] ^C [15:38:54] --- 172.16.131.114 ping statistics --- [15:38:54] 2 packets transmitted, 0 received, 100% packet loss, time 1011ms [15:39:54] mmm sorry wrong VRF [15:40:33] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/235/diffs [15:41:50] maybe update the comment? [15:42:04] good idea [15:42:32] found the problem [15:42:47] andrewbogott: try now [15:44:03] on cloudgw servers we added a required network route to /etc/network/interfaces, but puppet updating that file wont trigger a service reload. The clean thing to do is a reboot [15:44:06] I'll reboot now [15:44:42] ok, so I should wait to retry? [15:44:56] no, I applied the routing change by hand [15:44:58] it should work now [15:45:00] 'k [15:45:04] is just we are pending a reboot [15:45:09] * andrewbogott watches the logs [15:45:19] actually, I can't reboot them today, I need to leave the laptop now [15:45:52] Mark ACTIVE in DB for load balancer id: 9c9df77e-39e2-4410-980f-4b02f9a5c393 [15:46:01] good news? [15:46:12] yeah, I think that means health checks are working! [15:46:21] So now I just have to see if can actually do anything :) [15:46:36] I can do the reboots -- just cloudgw2xxx-dev ? [15:46:47] yeah, the 2 of them [15:46:57] * arturo offline [16:16:55] clouddb alert triggerd, dhinus are you doing something? [16:17:05] ops yes sorry [16:17:11] my fault [16:17:14] ack np [16:17:27] * dcaro fire mode off [16:18:12] I briefly stopped one mysql instance, then I restarted it but forgot to run START SLAVE :) [16:19:57] alert cleared [16:25:10] xd [16:25:23] got bitten by that too at some point [16:26:56] * dhinus offline [16:43:29] fyi taavi I need to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146680 on dynamic proxy hosts and downgrade python3-flask; at the moment if that service goes down it won't come back up [16:48:51] andrewbogott: where is the new package coming from? osbpo? [16:49:32] * andrewbogott looks [16:49:36] https://www.irccloud.com/pastebin/3wA2Emrf/ [16:49:48] ok, so it's not currently broken in eqiad1, it's just waiting to be broken when I upgrade [16:49:57] So just applying that pin will be enough [16:51:12] although now I'm worried that unattended upgrades could causes similar breakage for our users... [16:51:39] honestly I don't think we need to ship osbpo to all the VMs. surely the APIs don't change enough between versions to make the libraries included in debian proper be too old within a given VM's lifetime [16:51:55] and this is certainly not the first time something similar has happened [16:52:25] That seems right, at least for 90% of the time... [16:54:36] * andrewbogott makes T394438 [17:58:24] * dcaro off [19:47:21] T394453 [19:47:21] T394453: Emails to cloudservices@wikimedia.org from root@beta.toolforge.org bouncing - https://phabricator.wikimedia.org/T394453 [20:30:17] I have a silly tcpdump question... [20:30:19] I see things like [20:30:20] 20:28:01.062140 IP 172.16.131.144.51294 > cloudcontrol2004-dev.private.codfw.wikimedia.cloud.rplay: UDP, length 314 [20:30:56] My question is: how do I know that those messages are actually getting to a server on that host, and not just bouncing off it? The service certainly doesn't log anything if it's getting them. [20:32:31] (bd808, I'm not ignoring that bug but will probably let someone more involved in those service accounts fix it) [20:41:39] I have achieved the strange landmark of 'everything works except the status fields' [21:07:27] andrewbogott: no worries about that bouncing email thing. Somebody will figure things out. The easiest thing to do is open the list for arbitrary senders, but that might turn out to be more annoying if there are piles of spam bots already bouncing too.