[06:39:18] greetings [07:34:49] I want to get to the bottom of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184792 today, I'm using toolsbeta puppetserver as a playground to test FYI [08:13:31] morning [08:21:49] welcome back! [09:09:30] FYI I'll take a look at tools-k8s-worker-nfs-66 shortly [09:10:58] ACK [09:30:51] mmhh alert went away, but anyways I did poke a little and I'm opening a task re: nfs stuck with what I found [09:31:15] because it is causing so much grief it isn't even funny [09:32:30] I acked it [09:32:37] (so nobody would act on it) [09:33:22] oh thank you dcaro ! ack that makes sense [09:33:29] see what I did there? ack [09:33:57] xd [10:05:05] oh, something is hapenning [10:05:06] looking [10:06:43] it might be prometheus not getting stats [10:07:42] nodes are up and running [10:07:58] things look ok, I think it's just monitoring/o11y [10:08:17] wait, it's toolsbeta [10:09:51] things look up and running, so probably prometheus/alertmanager [10:13:29] there's something wrong with the ssl key [10:13:38] `err="error creating HTTP client: unable to use specified client cert (/etc/ssl/localcerts/toolsbeta-k8s-prometheus.crt) & key (/etc/ssl/private/toolsbeta-k8s-prometheus.key): tls: failed to find any PEM data in key input" scrape_pool=k8s-cadvisor` [10:13:54] yep [10:13:57] https://www.irccloud.com/pastebin/LcOlARD1/ [10:14:23] that sounds like the private repo is missing some values [10:15:43] it seems it was reset to main [10:15:59] reverting `root@toolsbeta-puppetserver-1:/srv/git/labs/private# git reset --hard 3c9e010f` (that's `HEAD@{1}`) [10:20:32] hmm.... this might have messed up some things (harbor credentials, etc.), might have to do some tests around [10:55:56] got a fix for the lima-kilo logging issues, until there's a fix upstream: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/963 [11:24:00] dcaro: I apologise that was me re: toolsbeta private reset [11:24:19] please let me know how I can help fixing [11:25:14] No. You can try running the functional tests, and see what fails, I'll do so after lunch too [11:25:20] *np [11:25:34] ok! will try running the functional tests dcaro [11:26:11] 👍 [11:26:15] Thanks! [11:30:25] mhh looks like bats is missing in the shared venv and/or the venv needs recreation [11:30:35] ModuleNotFoundError: No module named 'bats_core_pkg' [11:30:45] I'll stop here as I'm not sure what to do next [11:57:13] * dcaro back [11:57:17] looking [11:59:56] I'll just force to recreate it `sudo become test` (the user in toolsbeta is `test`, from the run_functional_tests.sh script), then `rm -rf ~/venv` [12:00:13] ack [12:00:33] it will take a minute to recreate [12:00:43] hmm, it failed [12:01:10] oh, this rings a bell [12:02:16] hmm, the issue is that the venv is created with python 3.11, but the bastion is using 3.13 [12:03:33] I had this issue when testing the trixie bastion, looking for a fix [12:04:15] tools is also using trixie, so I'll just swap the python version used to generate the venv [12:05:46] `| * b946a27 (origin/use_python_313) (2 weeks ago) David Caro run_functional_tests: use the python3.13 image for venv creation` [12:05:46] xd [12:07:08] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/964 [12:07:21] let me double check that the bastion selected for the component deployments is the latest, not the oldest [12:07:28] paws is down [12:07:43] and back up [12:12:18] and https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1188325 [12:14:01] okok, with the first of those patches, I'm able to run the functional tests in toolsbeta, first error seems to be dns issue of sorts `Could not resolve host: gitlab.wikimedia.org` [12:14:28] not what I would have expected xd [12:16:24] heheh [12:22:03] it seems to happen in many pods, I'll restart coredns [12:23:03] no errors on it's logs though, but maybe some network things got botched? [12:24:07] that seemed to help [12:24:11] https://www.irccloud.com/pastebin/I7cnt3Ag/ [12:24:20] from a shell inside the components-api api container [12:24:49] ack [12:25:05] rerunning the functional tests, no idea why that failed though :/ [12:25:26] or if it was related to the secrets thing [12:26:14] huh. it seems we are pulling kube-state-metrics directly from quay, and it's failing [12:26:27] `│ stream logs failed container "kube-state-metrics" in pod "kube-state-metrics-58cdf6c5d-zfv4p" is waiting to start: trying and failing to pull image for kube-system/kube-state-metrics-58cdf6c5d-zfv4p (kube-state-metrics)` [12:26:30] that's not good [12:28:15] indeed [12:32:01] ok.... so this seems it was moved to a different repo at some point, and even the old versions don't work anymore https://github.com/prometheus-community/helm-charts/commit/095e25e0f381080abf68de45a1e5e8257d186253 [12:32:12] so we might have to migrate too.... I'll open a task [12:32:20] this might be failing in prod already [12:33:54] T404585 [12:33:54] T404585: [kube-state-metrics,wmcs-k8s-metrics] the images from quay don't work anymore - https://phabricator.wikimedia.org/T404585 [12:34:14] on the bright side, tests seems to be passing so far :) [12:44:38] all tests passing in toolsbeta :) [12:44:58] \o/ nice, sorry again for the private.git reset [12:49:23] np :), stuff haappens [12:49:30] paws is down again [12:50:15] 3 nodes are not ready [12:51:01] trying to get a debugger container in one of them [12:52:05] nah... the container is not starting [12:52:53] I'll reboot the nodes [12:56:57] the nodes are coming back online [13:00:02] ingress-nginx controller is down (and the metrics services) [13:00:10] trying to roll reboot to force recreating the pods [13:00:32] service seems to be responsive again [13:01:25] okok, all services green in paws :S, have to find a way to debug those issues [13:04:39] back to kube-state-metrics.... [13:05:02] tools is using an image from our registry, and lima-kilo too :/, so not sure where it comes from using quay in toolsbeta, looking [13:10:21] oh... we have two kube-state-metrics deployments in toolsbeta, one in kube-system, and one in metrics namespaces [13:11:49] I think it's a leftover, it was deployed in 2019, I'll just delete it, we don't need to upgrade it then as we are using a new enough version :) [13:14:58] ack [13:42:51] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/963 and https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961 fix the lima-kilo logging setup, and the test that was not failing [13:43:36] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/962 and https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/214 fix/improve the collecting of logs from short-lived pods [13:43:55] (those last two might be a bit longer to review, and less urgent) [13:44:05] but affect users the most [15:33:53] "Your process `(sd-pam)` has been killed on tools-bastion-14 by the Wheel of [15:33:53] Misfortune script." -- that sounds like a new system process that hasn't be exempted from the script. [15:45:21] Filed as T404601 [15:45:22] T404601: (sd-pam) killed by Wheel of Misfortune on Toolforge bastion - https://phabricator.wikimedia.org/T404601 [15:53:09] bd808: are you still running intel hardware these days? I have a weird django build issue which I could use your help with but it will be much easier for you to reproduce on intel [15:57:29] this is for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy and issue is approximately 'why no build'? There's a dependency issues which is weird but easy to fix, and then a static-resources thing that I can't get past. Ever since upstream moved from having MANIFEST.in to using pbr everything is wonky. [15:58:47] andrewbogott: I've been on an M3 MacBook for quite a while. Most things work for me under rosetta in amd64 containers though [15:59:10] ok. That's approximately what I'm doing too (running an emulated VM in UTM) [15:59:18] It's just a matter of taking 20 minutes to build vs 5. [15:59:23] * bd808 peeks at ci failure [15:59:57] unfortunately the change I just pushed will fail with the boring issue (wrong dependency version in requirements.txt) rather than the interesting issue. [16:00:17] I /really/ don't want to form the requirements repo but I might have to, to get to the interesting one [16:00:21] *fork [16:00:54] is "ERROR: No matching distribution found for openstacksdk>=4.5.0" the more interesting failure? [16:01:15] hm, no, I thought I fixed that long ago... [16:01:26] is the job you're looking at recent or from a few days ago? [16:01:36] September 5th [16:02:23] The actual thing that' stumping me comes at the very end of the build, and it's from compiling static resources [16:04:45] andrewbogott: i noticed there were missing zones in the new gitlab-runners-staging project so i reopened https://phabricator.wikimedia.org/T404386 [16:04:51] sorry bd808, you'll have to patch and then build locally to see what I'm seeing I think. [16:04:53] let me know if you'd rather i file a new task [16:05:12] dduvall: sorry, I saw that but then forgot to make the zones, I'll try to do shortly. I'm a bit distracted today though so you might have to nudge again [16:05:34] no prob. thanks for doing that [16:05:43] bd808, here is what I'm building with: [16:05:49] https://www.irccloud.com/pastebin/LFaaEeVp/ [16:08:32] andrewbogott: which branch are you on? main or something else? [16:08:35] main [16:09:24] thank you! I probably should have got this cleaned up a bit better before asking but I /think/ it's just that one-line change in order to see what I'm seeing. I'm double-checking but... 15 minutes away from results. [16:09:54] git is not loving updating my existing clone... [16:11:05] * bd808 nukes this old clone [16:11:17] I did a bunch of drastic branch reform so yeah, you'll need a fresh checkout [16:11:33] I'm trying to get us down to just a main branch and an upstream branch because my previous process was baffling to everyone but me [16:17:37] (a possibly related mystery is: why did I have to add --use-pep517 to the pip stage to get templates installed when everyone upstream asserts that it works properly by default?) [16:22:37] "compressor.exceptions.OfflineGenerationError: No 'compress' template tags found in templates.Try running compress command with --follow-links and/or--extension=EXTENSIONS" [16:22:57] that failure came out of instalpanels.sh for me [16:31:48] yep, that's what I saw on my last run too. It's from the very last line of installpanels.sh, 'run_manage compress --verbosity 3 --force' [16:32:22] I'm pretty sure it's a result of some sort of dependency failure where the 'horizon' module isn't actually installing all needed files during the 'pip install' stage. [16:32:35] But I would love it if it turned out to just be a syntax error instead :) [16:33:01] The tedious thing I've been doing is doing alternating builds with that 'compress' line commented out so I can look at what files are actually copied into the container. [16:34:42] * andrewbogott doing that again, now [16:35:06] Raymond_Ndibe: I think I found a way to handle dynamic defaults "nicely" in fastapi, tweaking `model_fields_set` [17:09:22] yeah, there are 0 .html files installed in /opt/lib/python/site-packages/openstack_dashboard -- I don't understand how that can be unless pbr is disabled in some secret way [17:18:23] dcaro: paws is down again I think? [17:23:18] andrewbogott: yep [17:23:56] I rebooted the workers and it's back. Seems like that fix is working less and less though [17:25:47] yep :/ [17:26:51] could it just be a drive space thing with /tmp filling up? That's one thing I know that fits the 'gets better with rebooting but still degrades over time' [17:27:30] maybe not /tmp, but a combo, /tmp + /var/logs, so every reboot it gets better for less time (/tmp clears, but /var/log still grows) [17:27:59] right, that fits [17:28:13] that means we should just redeploy more often :) [17:29:34] maybe [17:35:37] * dcaro off [17:35:40] cya tomorrow [17:51:57] andrewbogott: I feel into a meeting shaped hole, but I'm briefly out now before lunch. I think the at you are correct that the template compressor is not finding the templates from the top level horizon package. The various "Error parsing template ..." messages just before the stack trace are the templates that reference some other template that is not bing found. [17:53:06] I will make some time after lunch to try and figure out if the needed content is in the container but not in the right search paths, or if it is missing altogether. [17:54:23] it's missing altogether. Some of the submodules get the resources installed and some don't [17:55:26] like /opt/lib/python/site-packages/octavia_dashboard has lots of templates but designatedashboard has none [17:56:02] this all seems related to https://review.opendev.org/c/openstack/octavia-dashboard/+/950205 [17:56:08] but I hope you have a good lunch! [18:30:01] (now I have to run errands, but with luck I'll be back in a couple hours) [20:04:47] * dduvall nudges andrewbogott about zones [20:56:30] andrewbogott: I got bin/installpanels.sh to run to completion in a local container. Now to try and figure out which of the changes I made on the way were necessary... [20:58:51] dduvall: barring typos you should have your dns zones now. [20:59:07] bd808: that sounds promising! Assuming that you didn't fix it by just skipping the hard parts :) [21:00:00] I think the big things I tweaked were requirements/upper-constraints.txt (`horizon>=25.4.0`), horizon/setup.cfg (`version = 25.4.0` in [metadata]), and bin/installpanels.sh (changed run_manage to run from /srv/app/horizon/manage.py) [21:01:27] the horizon package uses pbr which wants to guess a version number based on git tags. it was guessing something that didn't match with other package expectations. [21:01:53] ooh, the path where we run manage.py was one of my prime suspects [21:02:42] the git tag thing also makes sense; I was figuring on waiting to release this until they tag the next release in a couple weeks but I didn't think that would fully break the build in the meantime... [21:02:43] yeah, I wonder if I can roll back everything other than that. That was the last thing I changed before it all worked [21:03:31] other than that == manage.py location [21:05:06] * andrewbogott is considering the possibility that pbr is terrible [21:05:35] pbr is al about convention over configuration and doing things exactly like upstream expects [21:05:57] when it works it is great, and when it breaks it is mysterious and spooky [21:06:44] andrewbogott: i see them. thank you! [21:07:02] I'm now seeing that there are two lines installing the horizon project and one looks like it was already trying to work around pbr version guessing issues [21:07:35] yeah [21:07:56] sloppy, I have the un-versioned one removed in my local checkout but I haven't commited that yet apparently [21:20:38] andrewbogott: https://phabricator.wikimedia.org/P83350 makes the container build for me. I haven't tried running it or checking anything inside it to make sure things are correct, but at least no build time crash. [21:21:28] promising! Thank you, I'll give it a try (and of course ~an hour will elapse before I know anything). [21:22:02] That's the whole diff vs the 40c730b treeish [21:22:31] my local build command is `DOCKER_DEFAULT_PLATFORM=linux/amd64 docker build --tag horizon/production --target production -f .pipeline/blubber.yaml .` [21:25:11] After it builds you can use `DOCKER_DEFAULT_PLATFORM=linux/amd64 docker run --rm -it --mount type=bind,src=.,dst=/srv/app --hostname horizon --entrypoint /bin/bash horizon/production` to poke around inside the container if you want. [21:34:25] * andrewbogott prepares to wait [22:05:09] bd808: that has moved me on to runtime dependency failures, which is maybe progress. [22:05:11] mostly [22:05:18] https://www.irccloud.com/pastebin/SgyRuMET/ [22:07:00] uhhhhh.... the rust crypto bindings don't support mulitprocessing? Is that what the upstream bug is saying? [22:09:48] seems like but that can't be the actual problem since presumably this code works for someone [22:10:23] can I ask what things you have changed since this last worked? [22:10:44] * bd808 has asked so apparently he *can* ask [22:11:11] Merged w/upstream head [22:11:15] so, too many things [22:11:34] I can certainly try to unwind if needed (especially if I have a way to build, which I now have thanks to you) [22:11:54] ok, so you jumped 1+ Horizon release versions? [22:12:30] yeah, from 2024.1 [22:12:50] and the head is approximately 2025.2 so that's... three releases. [22:18:16] I need to finish prepping for a meeting tomorrow morning so I'm going to back slowly away from this tarpit, but shout if you end up needing quick help andrewbogott. Hopefully I can make some time to check in with you tomorrow to see how things are going. [22:19:17] thanks bd808, you've gotten me past an issue I didn't know how to approach and on to issues I do know how to approach, so I think I'm good now apart from verrrry slow test/dev cycle [22:20:02] progress! ;) [22:20:38] I left some notes at https://phabricator.wikimedia.org/P83350#334471 that may or may not be helpful. [22:22:08] thx