[06:43:48] <taavi>	 morning. I'm starting the cloudcontrol1007 reimage
[07:02:04] <taavi>	 I'm hitting T347375 it seems
[07:02:05] <stashbot>	 T347375: First Puppet run of a cloud_private connected node fails - https://phabricator.wikimedia.org/T347375
[08:08:50] <moritzm>	 FYI, I'll merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/959894 in ~ 10m
[08:09:12] <dcaro>	 morning
[08:11:00] <dcaro>	 there's lots of puppet alerts for toolforge, looking
[08:13:16] <dcaro>	 it seems already fixed, puppetdb seemed to be down for a bit earlier
[08:13:42] <dcaro>	 several hours ago though
[08:14:08] <taavi>	 hmm
[08:14:17] <taavi>	 cloudcontrol1007 is back online! \o/
[08:14:21] <dcaro>	 \o/
[08:15:08] <taavi>	 found a "fun" chicken-and-egg issue with that, which is that the first puppet run by default doesn't have the required data to provision the cloud-private subnet, as the netbox hiera sync runs later by default. (https://phabricator.wikimedia.org/T347375)
[08:15:48] <dcaro>	 how did you break through it?
[08:16:22] <dcaro>	 saw a Sep 26 08:15:02 cloudcontrol1007 nova-api-wsgi[2007]: 2023-09-26 08:15:02.578 2007 ERROR oslo.messaging._drivers.impl_rabbit [-] [e2bfb3c5-88fc-40af-bc60-98d2637be27e] AMQP server on rabbitmq02.eqiad1.wikimediacloud.org:5671 is unreachable: [Errno 104] Connection reset by peer. Trying again in 0 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
[08:16:28] <dcaro>	 only one though
[08:17:11] <taavi>	 manually changed the server from 'staged' to 'active' in netbox (which the cookbook would do right after) and ran the sync-netbox-hiera cookbook
[08:18:55] <dcaro>	 👍
[09:00:26] <arturo>	 taavi: did you find some non-puppetized manual steps beyond the netbox hiera thing?
[09:01:43] <arturo>	 rejoining the galera cluster comes to mind, for example
[09:02:41] <taavi>	 arturo: I had to manually start the timer to sync the keystone fernet keys, but otherwise it seemed to go relatively smoothly. although I found one annoying puppet ordering issue, sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/961018 to fix it
[09:02:52] <taavi>	 it automatically joined the galera cluster for example
[09:03:24] <arturo>	 ok
[09:34:49] * taavi lunch
[09:51:54] <dhinus>	 mcrouter is now in the bookworm wikimedia apt repository (T346762) :)
[09:51:55] <stashbot>	 T346762: Package mcrouter for Debian Bookworm - https://phabricator.wikimedia.org/T346762
[09:52:21] <dhinus>	 real quick +1 here? https://gerrit.wikimedia.org/r/c/operations/debs/mcrouter/+/961060
[09:53:55] <arturo>	 dhinus: +1'd. It can be `sudo -i` too
[09:54:32] <dcaro>	 don't need too, unless it uses anything from the interactive login setup (bash profile and such)
[09:54:32] <dhinus>	 true that! I'll just leave it with "#", which matches what's in the wiki as well
[09:55:00] <dhinus>	 I think it's maybe coming from bash_profile?
[09:55:38] <taavi>	 dcaro: it uses some environment variables to do the gpg signing stuff
[09:55:49] <dcaro>	 then it might need the -i
[09:55:52] <dhinus>	 I always forget the rules for env vars, but reprepro was definitely failing
[09:56:15] <dcaro>	 or using the --preserv-env options
[09:57:26] <dhinus>	 isn't preserve-env to get the user's env vars, and not root's env vars?
[10:00:02] <dcaro>	 to pass them through sudo yes
[10:00:45] <dhinus>	 but in this case, the vars are not there if you're not root, so I don't think they cannot be "preserved"
[10:00:54] <dcaro>	 ahhh, okok
[10:01:01] <dcaro>	 then sudo -i something something should do it yes
[10:01:22] <dcaro>	 I would add then at least 'as root, run:'
[10:02:03] <dcaro>	 though the previous commands will only work as root anyhow xd
[10:02:25] <dcaro>	 but that's in a different host, I would jut add it in case
[10:02:27] <dhinus>	 yes, I think "sudo -i" would work, I thought of just copying what is in the wiki (linked two lines below) so there's no confusion
[10:03:14] <dhinus>	 I though I was smart and modified to use "sudo" instead, but that didn't work :)
[10:03:15] <dcaro>	 it does not mention root there either
[10:03:34] <dhinus>	 it has the "#" that to me is a good indication (but others might not notice it)
[10:04:01] <dcaro>	 that will depend on your os/prompt setting
[10:04:15] <dcaro>	 (and what you are used to)
[10:04:28] <dcaro>	 there's a mention after, if you get an error xd
[10:04:47] <dcaro>	 (kind of like it, go straight, and if after 10 min you see a hotel, that was not the direction)
[10:04:48] <dcaro>	 Note: if you get an error like Error opening config file './conf/distributions': No such file or directory(2), then you forgot to do sudo -H bash 
[10:05:01] <dhinus>	 haha I didn't read that line in the wiki
[10:05:20] <dhinus>	 I personally think the current version of the readme (plus the link to the wiki) is enough, but if you want to improve it I'm happy to +1
[10:05:53] <dcaro>	 it's ok, whatever works
[10:05:56] <taavi>	 topranks: hi, trying to deploy https://gerrit.wikimedia.org/r/c/operations/homer/public/+/961055/ but I don't see 1007 in the diff - is there some cache that needs to be updated?
[10:06:58] <taavi>	 ah, I think I need to run the netbox script first
[10:07:03] <topranks>	 taavi: hey 
[10:07:14] <topranks>	 yep think you got it - as that server got updated yesterday/today 
[10:07:37] <topranks>	 this needs to run probably:
[10:07:37] <topranks>	 https://netbox.wikimedia.org/extras/scripts/capirca.GetHosts/
[10:08:00] <taavi>	 yep, doing that now
[10:15:08] <taavi>	 hmmm
[10:23:20] <topranks>	 "hmmm" ??
[10:24:04] <taavi>	 that was about an unrelated thing, sorry
[10:24:44] <taavi>	 arturo: I think we're going to eventually need to replace `::openstack_controllers` with a structure that has the host names in both realms, since the current method of overriding it on some hosts is quite messy
[10:24:59] <arturo>	 taavi: I  agree
[10:25:13] <arturo>	 something similar is what andrew started yesterday for designate
[10:27:38] <taavi>	 in the meantime, https://gerrit.wikimedia.org/r/c/operations/puppet/+/961068/
[10:28:11] * dcaro lunch
[10:28:53] <arturo>	 taavi: LGTM
[10:30:42] <taavi>	 do you know how to deploy the grants for the new hosts? (and drop the old grants)
[10:33:12] <arturo>	 I don't :-(
[10:33:28] <arturo>	 I would ask in #-data-persistence
[10:33:43] <arturo>	 or create a phab ticket and send it to folks
[11:33:50] <dcaro>	 taavi: they are added manually, the sql there is for reference only iirc
[11:34:32] <dcaro>	 but yeah, they will know better in data persistence
[12:20:18] <arturo>	 dcaro: what if I add TLS to harbor in lima-kilo? with self-signed certs. Would the insecure-registry allow the buildpack lifecycle  to work normally?
[12:20:38] <dcaro>	 self signed certs are not supported in harbor
[12:20:46] <dcaro>	 (that was the first try we did back in the day)
[12:21:14] <arturo>	 the docs say something different
[12:21:16] <arturo>	 To configure HTTPS, you must create SSL certificates. You can use certificates that are signed by a trusted third-party CA, or you can use self-signed certificates. 
[12:21:41] <dcaro>	 https://github.com/buildpacks/lifecycle/issues/524
[12:22:57] <dcaro>	 specifically https://github.com/buildpacks/lifecycle/issues/1077
[12:23:05] <arturo>	 https://github.com/buildpacks/lifecycle/issues/1077 <--- wow isn't this the exat same problem?
[12:24:11] <arturo>	 (sorry I'm on a meeting and can't focus on this atm)
[12:25:00] <dcaro>	 similar yes, old issue https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/blob/main/utils/parse_harbor_config.py#L86
[12:44:42] <dhinus>	 dcaro have you already tried the workaround described at https://github.com/buildpacks/lifecycle/issues/1077#issuecomment-1531731329 ?
[12:46:16] <dcaro>	 not really no, that would require though us injecting the self-signed certs from harbor into k8s and then mounting them in the pods
[12:46:21] <dcaro>	 (and really, this is working locally)
[12:46:38] <dcaro>	 so I never felt the need to continue trying
[12:48:29] <dcaro>	 but if there's no way to fix this on lima-kilo might be the way to go
[12:50:14] <dhinus>	 ack
[12:51:08] <dcaro>	 I think that some version of https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/12 will be enough
[12:52:20] <dcaro>	 dhinus: btw. have you tried the vagrant lima-kilo setup?
[12:52:30] <dhinus>	 not yet no
[12:54:52] <dcaro>	 ack, if it works for everyone, we can greatly simplify the lima-kilo code
[14:26:20] <arturo>	 dcaro: I wonder how your local setup works without TLS. In other words, how is possible you are not finding the same issue
[14:26:40] <dcaro>	 essentially, the docker config
[14:26:58] <dcaro>	 allowing insecure registries allows it to push /pull from http repositories
[14:27:30] <arturo>	 but that is the setting lima-kilo is using
[14:28:02] <arturo>	 I mean, there is nothing hidden or convoluted in the lima-kilo setup. The config used is in plain sight, and mostly it comes from the builds-api/builds-builder repos
[14:28:08] <dcaro>	 it has to match exactly (proto, ip and port)
[14:28:33] <dcaro>	 and there's a bit of a mess on how lima-kilo does that, as it injects the port in places that the helm charts don't expect it
[14:28:54] <dcaro>	 (as locally I use the standard port 80 for helm)
[14:28:57] <dcaro>	 *harbor
[14:30:42] <arturo>	 I think the setup in general should be made robust enough as to support running stuff in different ports, if possible!
[14:31:40] <dcaro>	 that's this no? https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/12
[14:31:59] <dcaro>	 there's many places that we have hardcodded ports
[14:32:04] <dcaro>	 (3003 and such)
[14:32:08] <dcaro>	 30003?
[14:32:10] <dcaro>	 whatevere xd
[14:32:25] <dcaro>	 but yes, if you need to start something in a different port, support to do so is needed
[14:32:58] <dcaro>	 I do think though that is better to address on a need basis, instead of trying to support every configuration option from the start
[14:33:06] <dcaro>	 (for this at least)
[15:03:00] <arturo>	 (sorry, very busy with meetings, I guess we can follow up tomorrow)
[16:52:05] <dcaro>	 hmm, it might be a runtime issue
[16:52:10] <dcaro>	 it seems lima-kilo uses containerd
[16:52:13] <dcaro>	 https://www.irccloud.com/pastebin/7x9kBNyi/
[16:52:17] <dcaro>	 while my local uses docker
[16:52:26] <dcaro>	 https://www.irccloud.com/pastebin/AsSU6zHn/
[16:52:37] <dcaro>	 I think containerd ignores dockerconfig
[17:02:36] * dcaro off
[19:01:35] <jhathaway>	 Hey folks I made a mistake and changed a default for nginx in wmcs, while trying to fix another mistake :(, would love some advice on how to repair the damage. https://gerrit.wikimedia.org/r/c/operations/puppet/+/960708
[19:02:47] <andrewbogott>	 huh, I just pinged you in a different channel, we can swap tasks for a bit :)
[19:04:29] <andrewbogott>	 jhathaway: I'm still trying to figure out what that change actually does. Do you see evidence of issues on wmcs so far?
[19:06:10] <taavi>	 was that causing the weird nginx "no such file or directory" issues we saw yesterday evening?
[19:07:07] <jhathaway>	 taavi: probably
[19:08:20] <jhathaway>	 the change flips a toggle in our puppetry to mount /var/lib/nginx on tmpfs, sorry for the lack of context, here is the first patch, https://gerrit.wikimedia.org/r/c/operations/puppet/+/959226
[19:08:51] <jhathaway>	 prior to my mistake this feature was gated to production, but it has now been applied in wmcs
[19:08:52] <taavi>	 yep I saw the first patch as that was breaking puppet on some VMs, but didn't connect the two issues together
[19:09:23] <jhathaway>	 unfortunately it is not a trivial rollback, since flipping the toggle will not umount the volume
[19:09:39] <andrewbogott>	 I don't think I follow why we can't just switch the default patch to the proper default (and restart nginx everywhere)
[19:10:02] <andrewbogott>	 ok, so patch, unmount, restart nginx
[19:10:11] <andrewbogott>	 the volume to unmount would be the same everywhere right?
[19:10:18] <jhathaway>	 andrewbogott: right
[19:10:55] <jhathaway>	 those steps should work, but I'm unfamiliar with how to do that in wmcs, and whether I have the proper permissions
[19:10:56] <andrewbogott>	 so a rollback might cause service interruption to hosts that /want/ to use the temp volume, but puppet will correct it on a subsequent run
[19:11:05] <andrewbogott>	 and we may have service interruption happening already
[19:11:26] * andrewbogott waits for taavi to agree or disagree with that plan
[19:11:55] <jhathaway>	 we may need to stop nginx prior to unmounting if linux considers the volume busy
[19:12:03] <andrewbogott>	 true
[19:12:45] <taavi>	 yeah. and I don't think the current puppetization takes the possibility of a rollback into account, so we would need to clean fstabs manually
[19:13:25] <jhathaway>	 ugh, that is also true, puppet's fstab handling is embarassingly poor
[19:13:25] <andrewbogott>	 wait, it adds an fstab entry too?
[19:13:43] <taavi>	 my understanding is that mount{} does
[19:13:45] <taavi>	 is there a specific reason we would want to have the directory (not) on a tmpfs on cloud vps?
[19:13:55] <andrewbogott>	 ok, for starters, what's a good command to run to detect whether or not that flag has done something on a given VM?
[19:14:09] <jhathaway>	 mount|grep /var/lib/nginx
[19:14:45] <andrewbogott>	 Oh, this is nginx, so I guess there's no state there to be lost on a reboot
[19:14:57] <andrewbogott>	 so maybe we can just ignore this and pretend like this was a change for the better :)
[19:15:26] <taavi>	 restarting nginx on the affected instances where that wasn't already done by hand might be a good idea
[19:16:15] <jhathaway>	 I'm also happy to help revert, just to reduce complexity for no obvious gain
[19:17:01] <jhathaway>	 the main downside is probably more memory usage, but I haven't tried to quantify how much
[19:17:06] * andrewbogott runs a query to find affected instances
[19:18:47] * andrewbogott once again remembers how much time https://gerrit.wikimedia.org/r/c/operations/software/cumin/+/869332 would save
[19:22:48] <andrewbogott>	 damnit does cumin work on vms anywhere now?
[19:22:52] <andrewbogott>	 taavi: used cumin lately?
[19:23:00] <andrewbogott>	 I'm getting different failure messages no matter where/what I try
[19:23:17] <taavi>	 I haven't tried it recently
[19:26:13] <andrewbogott>	 Today may be the straw that breaks my back, cuminwise
[19:27:00] <jhathaway>	 :)
[19:37:51] <andrewbogott>	 apparently I am going to spend my day fixing cumin rather than using cumin to fix this issue.
[19:51:05] <jhathaway>	 the true mark of a good sysadmin ;)
[20:05:07] <andrewbogott>	 jhathaway: can you please document the nginx issue (or potential issue) in a phab ticket and make T347428 a sub-ticket?  thx.
[20:05:13] <stashbot>	 T347428: cumin and cloud-vps instances not working - https://phabricator.wikimedia.org/T347428
[21:12:30] <jhathaway>	 andrewbogott: will do
[21:19:57] <jhathaway>	 https://phabricator.wikimedia.org/T347432