[06:43:48] morning. I'm starting the cloudcontrol1007 reimage [07:02:04] I'm hitting T347375 it seems [07:02:05] T347375: First Puppet run of a cloud_private connected node fails - https://phabricator.wikimedia.org/T347375 [08:08:50] FYI, I'll merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/959894 in ~ 10m [08:09:12] morning [08:11:00] there's lots of puppet alerts for toolforge, looking [08:13:16] it seems already fixed, puppetdb seemed to be down for a bit earlier [08:13:42] several hours ago though [08:14:08] hmm [08:14:17] cloudcontrol1007 is back online! \o/ [08:14:21] \o/ [08:15:08] found a "fun" chicken-and-egg issue with that, which is that the first puppet run by default doesn't have the required data to provision the cloud-private subnet, as the netbox hiera sync runs later by default. (https://phabricator.wikimedia.org/T347375) [08:15:48] how did you break through it? [08:16:22] saw a Sep 26 08:15:02 cloudcontrol1007 nova-api-wsgi[2007]: 2023-09-26 08:15:02.578 2007 ERROR oslo.messaging._drivers.impl_rabbit [-] [e2bfb3c5-88fc-40af-bc60-98d2637be27e] AMQP server on rabbitmq02.eqiad1.wikimediacloud.org:5671 is unreachable: [Errno 104] Connection reset by peer. Trying again in 0 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer [08:16:28] only one though [08:17:11] manually changed the server from 'staged' to 'active' in netbox (which the cookbook would do right after) and ran the sync-netbox-hiera cookbook [08:18:55] 👍 [09:00:26] taavi: did you find some non-puppetized manual steps beyond the netbox hiera thing? [09:01:43] rejoining the galera cluster comes to mind, for example [09:02:41] arturo: I had to manually start the timer to sync the keystone fernet keys, but otherwise it seemed to go relatively smoothly. although I found one annoying puppet ordering issue, sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/961018 to fix it [09:02:52] it automatically joined the galera cluster for example [09:03:24] ok [09:34:49] * taavi lunch [09:51:54] mcrouter is now in the bookworm wikimedia apt repository (T346762) :) [09:51:55] T346762: Package mcrouter for Debian Bookworm - https://phabricator.wikimedia.org/T346762 [09:52:21] real quick +1 here? https://gerrit.wikimedia.org/r/c/operations/debs/mcrouter/+/961060 [09:53:55] dhinus: +1'd. It can be `sudo -i` too [09:54:32] don't need too, unless it uses anything from the interactive login setup (bash profile and such) [09:54:32] true that! I'll just leave it with "#", which matches what's in the wiki as well [09:55:00] I think it's maybe coming from bash_profile? [09:55:38] dcaro: it uses some environment variables to do the gpg signing stuff [09:55:49] then it might need the -i [09:55:52] I always forget the rules for env vars, but reprepro was definitely failing [09:56:15] or using the --preserv-env options [09:57:26] isn't preserve-env to get the user's env vars, and not root's env vars? [10:00:02] to pass them through sudo yes [10:00:45] but in this case, the vars are not there if you're not root, so I don't think they cannot be "preserved" [10:00:54] ahhh, okok [10:01:01] then sudo -i something something should do it yes [10:01:22] I would add then at least 'as root, run:' [10:02:03] though the previous commands will only work as root anyhow xd [10:02:25] but that's in a different host, I would jut add it in case [10:02:27] yes, I think "sudo -i" would work, I thought of just copying what is in the wiki (linked two lines below) so there's no confusion [10:03:14] I though I was smart and modified to use "sudo" instead, but that didn't work :) [10:03:15] it does not mention root there either [10:03:34] it has the "#" that to me is a good indication (but others might not notice it) [10:04:01] that will depend on your os/prompt setting [10:04:15] (and what you are used to) [10:04:28] there's a mention after, if you get an error xd [10:04:47] (kind of like it, go straight, and if after 10 min you see a hotel, that was not the direction) [10:04:48] Note: if you get an error like Error opening config file './conf/distributions': No such file or directory(2), then you forgot to do sudo -H bash [10:05:01] haha I didn't read that line in the wiki [10:05:20] I personally think the current version of the readme (plus the link to the wiki) is enough, but if you want to improve it I'm happy to +1 [10:05:53] it's ok, whatever works [10:05:56] topranks: hi, trying to deploy https://gerrit.wikimedia.org/r/c/operations/homer/public/+/961055/ but I don't see 1007 in the diff - is there some cache that needs to be updated? [10:06:58] ah, I think I need to run the netbox script first [10:07:03] taavi: hey [10:07:14] yep think you got it - as that server got updated yesterday/today [10:07:37] this needs to run probably: [10:07:37] https://netbox.wikimedia.org/extras/scripts/capirca.GetHosts/ [10:08:00] yep, doing that now [10:15:08] hmmm [10:23:20] "hmmm" ?? [10:24:04] that was about an unrelated thing, sorry [10:24:44] arturo: I think we're going to eventually need to replace `::openstack_controllers` with a structure that has the host names in both realms, since the current method of overriding it on some hosts is quite messy [10:24:59] taavi: I agree [10:25:13] something similar is what andrew started yesterday for designate [10:27:38] in the meantime, https://gerrit.wikimedia.org/r/c/operations/puppet/+/961068/ [10:28:11] * dcaro lunch [10:28:53] taavi: LGTM [10:30:42] do you know how to deploy the grants for the new hosts? (and drop the old grants) [10:33:12] I don't :-( [10:33:28] I would ask in #-data-persistence [10:33:43] or create a phab ticket and send it to folks [11:33:50] taavi: they are added manually, the sql there is for reference only iirc [11:34:32] but yeah, they will know better in data persistence [12:20:18] dcaro: what if I add TLS to harbor in lima-kilo? with self-signed certs. Would the insecure-registry allow the buildpack lifecycle to work normally? [12:20:38] self signed certs are not supported in harbor [12:20:46] (that was the first try we did back in the day) [12:21:14] the docs say something different [12:21:16] To configure HTTPS, you must create SSL certificates. You can use certificates that are signed by a trusted third-party CA, or you can use self-signed certificates. [12:21:41] https://github.com/buildpacks/lifecycle/issues/524 [12:22:57] specifically https://github.com/buildpacks/lifecycle/issues/1077 [12:23:05] https://github.com/buildpacks/lifecycle/issues/1077 <--- wow isn't this the exat same problem? [12:24:11] (sorry I'm on a meeting and can't focus on this atm) [12:25:00] similar yes, old issue https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/blob/main/utils/parse_harbor_config.py#L86 [12:44:42] dcaro have you already tried the workaround described at https://github.com/buildpacks/lifecycle/issues/1077#issuecomment-1531731329 ? [12:46:16] not really no, that would require though us injecting the self-signed certs from harbor into k8s and then mounting them in the pods [12:46:21] (and really, this is working locally) [12:46:38] so I never felt the need to continue trying [12:48:29] but if there's no way to fix this on lima-kilo might be the way to go [12:50:14] ack [12:51:08] I think that some version of https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/12 will be enough [12:52:20] dhinus: btw. have you tried the vagrant lima-kilo setup? [12:52:30] not yet no [12:54:52] ack, if it works for everyone, we can greatly simplify the lima-kilo code [14:26:20] dcaro: I wonder how your local setup works without TLS. In other words, how is possible you are not finding the same issue [14:26:40] essentially, the docker config [14:26:58] allowing insecure registries allows it to push /pull from http repositories [14:27:30] but that is the setting lima-kilo is using [14:28:02] I mean, there is nothing hidden or convoluted in the lima-kilo setup. The config used is in plain sight, and mostly it comes from the builds-api/builds-builder repos [14:28:08] it has to match exactly (proto, ip and port) [14:28:33] and there's a bit of a mess on how lima-kilo does that, as it injects the port in places that the helm charts don't expect it [14:28:54] (as locally I use the standard port 80 for helm) [14:28:57] *harbor [14:30:42] I think the setup in general should be made robust enough as to support running stuff in different ports, if possible! [14:31:40] that's this no? https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/12 [14:31:59] there's many places that we have hardcodded ports [14:32:04] (3003 and such) [14:32:08] 30003? [14:32:10] whatevere xd [14:32:25] but yes, if you need to start something in a different port, support to do so is needed [14:32:58] I do think though that is better to address on a need basis, instead of trying to support every configuration option from the start [14:33:06] (for this at least) [15:03:00] (sorry, very busy with meetings, I guess we can follow up tomorrow) [16:52:05] hmm, it might be a runtime issue [16:52:10] it seems lima-kilo uses containerd [16:52:13] https://www.irccloud.com/pastebin/7x9kBNyi/ [16:52:17] while my local uses docker [16:52:26] https://www.irccloud.com/pastebin/AsSU6zHn/ [16:52:37] I think containerd ignores dockerconfig [17:02:36] * dcaro off [19:01:35] Hey folks I made a mistake and changed a default for nginx in wmcs, while trying to fix another mistake :(, would love some advice on how to repair the damage. https://gerrit.wikimedia.org/r/c/operations/puppet/+/960708 [19:02:47] huh, I just pinged you in a different channel, we can swap tasks for a bit :) [19:04:29] jhathaway: I'm still trying to figure out what that change actually does. Do you see evidence of issues on wmcs so far? [19:06:10] was that causing the weird nginx "no such file or directory" issues we saw yesterday evening? [19:07:07] taavi: probably [19:08:20] the change flips a toggle in our puppetry to mount /var/lib/nginx on tmpfs, sorry for the lack of context, here is the first patch, https://gerrit.wikimedia.org/r/c/operations/puppet/+/959226 [19:08:51] prior to my mistake this feature was gated to production, but it has now been applied in wmcs [19:08:52] yep I saw the first patch as that was breaking puppet on some VMs, but didn't connect the two issues together [19:09:23] unfortunately it is not a trivial rollback, since flipping the toggle will not umount the volume [19:09:39] I don't think I follow why we can't just switch the default patch to the proper default (and restart nginx everywhere) [19:10:02] ok, so patch, unmount, restart nginx [19:10:11] the volume to unmount would be the same everywhere right? [19:10:18] andrewbogott: right [19:10:55] those steps should work, but I'm unfamiliar with how to do that in wmcs, and whether I have the proper permissions [19:10:56] so a rollback might cause service interruption to hosts that /want/ to use the temp volume, but puppet will correct it on a subsequent run [19:11:05] and we may have service interruption happening already [19:11:26] * andrewbogott waits for taavi to agree or disagree with that plan [19:11:55] we may need to stop nginx prior to unmounting if linux considers the volume busy [19:12:03] true [19:12:45] yeah. and I don't think the current puppetization takes the possibility of a rollback into account, so we would need to clean fstabs manually [19:13:25] ugh, that is also true, puppet's fstab handling is embarassingly poor [19:13:25] wait, it adds an fstab entry too? [19:13:43] my understanding is that mount{} does [19:13:45] is there a specific reason we would want to have the directory (not) on a tmpfs on cloud vps? [19:13:55] ok, for starters, what's a good command to run to detect whether or not that flag has done something on a given VM? [19:14:09] mount|grep /var/lib/nginx [19:14:45] Oh, this is nginx, so I guess there's no state there to be lost on a reboot [19:14:57] so maybe we can just ignore this and pretend like this was a change for the better :) [19:15:26] restarting nginx on the affected instances where that wasn't already done by hand might be a good idea [19:16:15] I'm also happy to help revert, just to reduce complexity for no obvious gain [19:17:01] the main downside is probably more memory usage, but I haven't tried to quantify how much [19:17:06] * andrewbogott runs a query to find affected instances [19:18:47] * andrewbogott once again remembers how much time https://gerrit.wikimedia.org/r/c/operations/software/cumin/+/869332 would save [19:22:48] damnit does cumin work on vms anywhere now? [19:22:52] taavi: used cumin lately? [19:23:00] I'm getting different failure messages no matter where/what I try [19:23:17] I haven't tried it recently [19:26:13] Today may be the straw that breaks my back, cuminwise [19:27:00] :) [19:37:51] apparently I am going to spend my day fixing cumin rather than using cumin to fix this issue. [19:51:05] the true mark of a good sysadmin ;) [20:05:07] jhathaway: can you please document the nginx issue (or potential issue) in a phab ticket and make T347428 a sub-ticket? thx. [20:05:13] T347428: cumin and cloud-vps instances not working - https://phabricator.wikimedia.org/T347428 [21:12:30] andrewbogott: will do [21:19:57] https://phabricator.wikimedia.org/T347432