[10:19:42] Krinkle: you might be best to ask in -cloud-admin channel. The firewall rules are likely being set by OpenStack. You need a rule in the main network namespace on the docker host, for traffic arriving on the (likely bridge) interface with 172.17.0.1 from the containers. [10:20:13] localhost will not work, as it has to route over the veth link from container to main host net ns [10:21:10] unfortunately I don't know enough about the cloud orchestration to know exactly where to add such rules [12:07:52] I need to debug something on beta cluster (deployment-prep.deploy03), I’m going to edit one of the PHP files by hand and run a command to get CLI output [12:08:14] As I cannot reproduce identical behaviour locally, therefore I need a stacktrace from beta cluster [14:39:16] hello folks! [14:39:22] I was chatting with Emperor about https://phabricator.wikimedia.org/T361844 [14:39:57] TL;DR: the swift.discovery.wmnet cert is going to expire on the 14th, and afaics it is managed by Cergen (unless PEBCAK) [14:40:00] swift TLS certs expire Quite Soon - I think the question is whether https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate is still a correct and safe procedure [14:41:09] (given the staff meeting in 20 minutes and it being Thursday, I'd be looking to do this early next week) [14:41:17] make sense yes [14:41:52] I realized that I have never done such procedure, do we need to destroy the cert first in the puppet CA and then regenerate via cergen? [14:42:26] in theory even if revoked/destroyed we should be ok, I am not aware of internal clients checking it [15:00:13] I'm not sure "whoops I broke swift" would be the best way to find out something does in fact check :) [15:05:48] elukey: I don't believe there's any support for revocation internally [15:10:39] you might want to consider switching it from puppetCA/cergen to cfssl [15:11:12] since that is going on for a bunch of other services right now (if it's an option here) [15:11:27] it will make it a lot easier to handle certs as a side-effect [15:13:14] +1 to using CFSSL, soooo much easier [15:13:38] does cfssl not still require you to renew an intermediate cert periodically? [15:13:50] it's basically yaml engineering as opposed to doing edits in 3 different repos [15:14:08] I dont know the answer to the renewal question at this time [15:15:09] but tbh I also never had an expiring cergen cert [15:15:22] they typically last ~5 years AFAICT [15:15:37] I think cfssl automatically renews everything? At least, I hope ;) [15:15:47] inflatador: https://wikitech.wikimedia.org/wiki/PKI/CA_Operations#Renewing_a_new_intermediate rather suggests otherwise [15:15:48] aha, that could mean we switched just in time before hitting this [15:16:21] cdanis: ack thanks! [15:16:32] Emperor interesting. I definitely need to read/understand this a bit more [15:17:07] * inflatador is used to pointing at Hashi Vault path and letting it work its magic [15:17:23] also see "phase out cergen" https://phabricator.wikimedia.org/T357750 [15:17:46] swift.certs.yaml is already a checkbox [15:18:56] folks there are multiple issues in migrating to cfssl directly, like checking that all clients trust the new PKI intermediate [15:19:10] and we have ~10 days, including a weekend :) [15:19:21] yeah, I think trying to migrate swift to a new TLS set up before the current certs expire would be "brave" [15:19:26] so probably it is safer to first renew via cergen, then migrate to cfssl (my 2c) [15:21:09] also re intermediate: I think it is related to the intermediate PKI CAs certs, something to follow up but the wiki page could also be stale. I never heard/seen John talking about this [15:22:56] probably serviceops have more experience with renewing cergen discovery certs, I see https://phabricator.wikimedia.org/T304237 [15:23:29] Cc: jayme and rzl that afaics worked on it and may shed some light :) [15:25:20] if moving to cfssl meant no more manual renewals, that would change my appetite for doing the change quite a lot :) [15:26:55] 👀 [15:27:52] wasnt there the trouble with tegola that more or less broke after swift tried to move to cfssl? [15:28:17] https://phabricator.wikimedia.org/T344324 [15:28:29] that's a different swift probably [15:28:53] paged for a db server [15:29:14] db2214 [15:29:41] jayme: that was thanos-swift; this is ms-swift, but yes, the potential for 🔥 is probably non-zero, hence this discussion about whether the wikitech docs are current and correct and safe [15:30:15] we should bash "whether the wikitech docs are current and correct and safe" :) [15:31:14] 😿 [15:31:39] so the last time I did https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate - that worked [15:32:32] more or less. According to the page history that was in sept. 2023 [15:32:44] jayme: ok perfect this is exactly what I needed to hear, we need to destroy the cert first and force its recreation via cergen [15:32:45] (and I only updated a path in the docs) [15:32:55] then do the dance to update public and deploy [15:33:10] yep [15:33:16] mutante: it's a slave server in s6 [15:33:33] Emperor: are we conflating renewing the *intermediates* vs renewing the host/service certs themselves? [15:33:40] yep [15:34:27] creating a new intermediate in pki/cfssl is a manual process but issuing a new cert via that intermediate is not [15:34:38] Emperor: ack, it's back up, depooled and no other effect [15:34:43] but I'm still catching up on scrollback here so maybe I missed something [15:35:12] cdanis: and renewing that intermediate likewise requires manual work? [15:35:13] cdanis: I think that https://wikitech.wikimedia.org/wiki/PKI/CA_Operations#Renewing_a_new_intermediate was the concern IIUC [15:35:28] Emperor: yes but intermediates are generally quite long-lived iirc [15:36:08] the cert causing the hassle right now is 5y old... [15:36:58] anyhow, I think the takeaway is that the wikitech procedure for update the swift certs will probably work, and I should attempt that next week [15:37:16] sure [15:38:06] Emperor: the manual work that you pointed out is nonetheless way less risky and time consuming than this one, the work that we all have done to move to PKI is used by bare metal and k8s constantly, and everything is managed transparently without any toil :) [15:39:05] also we are not sure how old that comment in the page is, the github issue is from 2020 [15:39:18] maybe we could just try to automate it away and that's it [15:39:35] the direction to PKI/cfssl is good and we should invest on it more, this is my point :) [15:54:36] topranks: Interesting. So you think an OpenStack level rule (ie the firewall I set in horizon) is preventing a container from talking to its own host over the host IP? [15:55:19] Krinkle: that would be my guess [15:55:36] The patch mentions localhost as workaround in combo with --network=host, but indeed I'd prefer to use the docker host IP [15:55:56] is the Hound service running in a container also? [15:56:07] Yes, on the same VM [15:56:12] ok [15:56:54] The entries in iptables appear to be local and docker specific, so I assumed it's something we can/have to do in puppet [15:57:03] so really what you need to do is allow the direct IP traffic inbound in the container running Hound [15:57:04] But the ip ranges are indeed very similar [15:57:21] and also ensure that the VM allows forwarding between the containers directly, and won't try to NAT [15:57:24] So ill try to rule out horizon first [15:57:47] if I could log onto an example VM I could probably give clearer instructions [15:57:52] I guess what we need to know is: [15:58:02] 1) what controls the rules on the VM - in the main network namespace [15:58:16] 2) what controls the rules in each container, i.e. in the network namespace for each container [15:58:36] Well, we have a port map and the level of abstraction we use makes me not want to try to know and pass down each container's IP, that's presumably a harder problem to solve than allowing container to own host, which works by default but seems to fail on this VM [15:59:19] "container to own host" isn't what's needed though [15:59:25] We have 10-20 hound instances and each maps to a preset port that the frontend knows. So I'd rather just do the host IP. [15:59:33] the traffic needs to route from CONTAINER_NS -> MAIN_NS (VM) -> CONTAINER2_NS [16:00:22] You could perhaps use some NAT rules on the VM netns to forward packets sent to it for to the hound container [16:00:41] I thought container -> docker host ip :3002 > and from there it goes to the right container/IP like it does today already. [16:01:00] yeah that might work [16:01:14] there is probably a NAT rule on the VM for "docker host ip :3002" doing that [16:01:51] the question is if the inbound interface is specified in that rule or not - it might be, and that might be why it doesn't work when the traffic comes from another container on the same VM (and thus doesn't come in the outside primary interface) [16:03:05] if you wanted to paste the contents of "iptables -L -v --line -n", "iptables -L -v --line -n -t nat" and "ip netns list" it might give an idea [16:03:17] I understand about half of that. I think using the right host IP addresses that. It does for me locally and elsewhere I tried, but on this VM the connection just hangs, which made me check iptables and I see some suspicious entries there indeed [16:03:47] But I don't know how to find what in puppet is putting it there or how our puppet module can coorporatively undo that [16:05:43] if it's being set by our normal puppet stuff it can probably be allowed with a firewall::service definition in the role or possibly ferm::rule if it's complex [16:06:50] I thought perhaps openstack managed the rules for cloud VMs but I could be completely wrong on that [16:07:52] alright. I'll have a look and try some things [16:08:18] In general, Openstack has security groups (FW rules) that won't be visible at the VM level. Can't speak much to our specific env though [16:08:58] yeah within a VM anything outside shouldn't matter [16:09:12] I'd concentrate on the NAT rule on the VM that maps traffic for "docker host ip :3002" [16:09:44] look at it and see if there is any reason that won't work if the traffic comes from an IP on 172.17.x or comes in the docker0 bridge device rather than from outside [16:58:11] !log T355281 executed “mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=aawiki --skipclusters=main,echo,growth,mediamoderation,extstore en wikipedia test2wiki test2.wikipedia.beta.wmcloud.org” on deployment-deploy03.deployment-prep [16:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:14] T355281: Set up some beta cluster wikis with different registrable domain - https://phabricator.wikimedia.org/T355281 [16:58:43] Ah, sorry, wrong channel, that should go to releng. sorry [20:20:57] moritzm: If you are still about, I have an interesting problem with your reject patch about the 'eject' package, see T361749 (and read from the bottom up) [20:20:58] T361749: cloud-init timeout too short on Bookworm - https://phabricator.wikimedia.org/T361749 [20:21:39] I'm kind of surprised that ensure->absent seems to remove packages that depend on the absented package, but maybe it's always been that way? [20:24:30] hmmh, I'll have a closer look and come up with a prod-specific fix, can you for now simply merge a revert of the patch which introduced the absenting of eject? [20:25:18] I need to doublecheck what the package providers calls under the hood, but if it's "apt-get remove foo", then dependent packages would get in fact removed along [20:25:33] the more surprising fact is WTF cloud-init is depending on eject, though? [20:26:03] it's a tool to open an optimal drive, could not imagine anything less cloud-needed :-) [20:26:31] maybe because cloud-init uses virtual CD-ROMs as config-drives [20:26:49] but these would get unmounted, not ejected? [20:27:49] anyway, we can also simply tighen the dependency in standard_packages to also check for the production real, but a simple revert to unstuck cloud-init is also perfectly fine, I had checked typical reverse dependencis before hand, but had not expected cloud-init :-) [20:28:02] I'm used to seeing them unmounted, yeah [20:28:33] it is a weird dependency for sure [20:32:27] moritzm: I'm going to try the explicit include of cloud-init on cloud-vps and see if that stops the removal without errors. [20:32:30] If that fails I'll revert. [20:35:04] ack thanks and sorry for the VPS-induced fallout [20:55:49] I'm still surprised that it removes dependent packages, but I guess today I learned [21:04:32] andrewbogott: given eject is dependency of cloud-init, cloud-init will get removed everytime puppet removes eject, since puppet uses force( which it shouldn't) then it will get reinstalled when puppet tries to install cloud-init, so I think a revert is the better option, or update the patch to exclude the removal on openstack vms via a the hypervisor fact or the realm as moritzm mentioned [21:05:08] jhathaway: ok, I agree that removing/replacing seems bad [21:05:09] thanks [21:06:17] yup, also glad to know we are still ejecting things in 2024!! Just waiting on my latest burned CD to eject right now :) [21:10:51] gives me flashbacks to an old job, where one of our customers insisted on booting his servers off CD-ROM (FreeBSD ofc) because read-only is more secure! That was only 8 or 9 years ago. [21:37:12] love it! [21:38:38] Immutable infrastructure indeed [21:39:08] I'm sure DC Ops just LOVED that guy [23:21:08] topranks: I've pasted the output at https://phabricator.wikimedia.org/T361899 [23:25:24] inflatador: hm.. interesting, maybe my mental model of 'docker host ip' isn't right (very likely isn't, but I thought it'd be close enough for this). Are you saying that openstack excerts control / is part of the picture when a container on a given VM tries to connect to the docker host IP of/inside that same VM? I know the IP ranges are close (both 172.x) but I assumed it never leaves the VM to do that. So the security group and firewall [23:25:24] rules in Horizon seem like tehy should not be able to make a difference. [23:29:37] with my limited knowledge, I can't rule out that the 'parent' network plays a rule when addressing an IP that belongs to it, in the same way, with my home router, I'm guessing 192.x even the one tha tpoints to "me", the router might play some rule in deciding whetgher or not I'm allowed to reach myself. that is, unless there is something in the kernel turning that into a loopback (maybe? assuming that it knows who it is, maybe it won't [23:29:37] delegate). But since I *can* reach the docker host IP from the VM outside the container, i figured that to whatever approves this connection, the outgoing connection is the same either way since the individual container IPs are only valid within the same VM, so the openstack network presumabluy can't tell the differnece between something connecting from within a container vs outside a container. [23:30:39] feel free to tell me Im way off. I've not really thought much about this before.