[00:02:44] trying again with --use-pep517 back in [00:03:56] * bd808 kicks off `DOCKER_DEFAULT_PLATFORM=linux/amd64 docker build --tag horizon/production --target production -f .pipeline/blubber.yaml .` [00:09:53] nah, still no templates. [00:10:19] I need to go make dinner. Thank you for looking but please stop when you're tired of looking :) [00:11:21] andrewbogott: it works for me. I'm going to post my local commands on that paste again. Maybe you will spot the difference... [07:55:05] morning [07:56:04] there's a bunch of workers with D processes, I'll leave them for a few minutes so godo.g can have a look [08:04:40] greetings [08:04:46] dcaro: thank you! taking a look [08:05:46] I restarted kubelet in tools-k8s-worker-nfs-17, as one of it's threads was in 'D' state and wanted to see if it would hung, it was able to stop and come up online. From lsof I did not see any files under the NFS directories though for it [08:07:17] it look like the usual "there was a blip, a few tools got stuck (usually lighthttpd/php webservices) on a file, then wmf-auto-resart starts piling up stuck lsofs [08:08:58] there's some logs about nfs not being around [08:09:00] https://www.irccloud.com/pastebin/RiL65GON/ [08:09:52] ack, yeah I'm rebooting the affected nfs workers [08:10:06] ack [08:10:38] I note that there was an OOM event that killed a process doing NFS stuff (from the stack trace) ~15h before those two logs [08:13:39] not sure it's related to anything, just dumping info xd [08:14:52] hehe ok! thank you definitely good to know [08:15:26] feel free to update T404584 too if you want, more info defo helps [08:15:27] T404584: Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584 [08:43:07] godog: done, mentioned also a few old tasks, I think there's some missing though, as I remember writing about the wmf-auto-restart thingie [08:46:12] thank you dcaro ! [08:53:06] hmm... tools-k8s-worker-nfs-82 has had 3 instances in the last few days in which it got clearly stuck, for long [08:53:30] tonight actually [08:53:32] https://usercontent.irccloud-cdn.com/file/B2jyYFKj/image.png [08:53:44] it's the green one there with the serrated three spikes [08:54:41] but that one was not rebooted, so probably real usage? maybe the NFS server is getting too loaded? [08:57:18] could be yeah [08:59:43] oh, if you have some time, can you test https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/963 ? [08:59:49] should fix the loki logs in lima-kilo [09:00:25] yep will test in a little bit [09:02:52] thanks :) [09:16:18] dcaro: how would I go about testing the patch in lima-kilo locally ? I don't know the toolforge-deploy and lima-kilo relationship FWIW [09:16:34] sure! [09:16:46] so inside your lima-kilo, there's a clone of toolforge-deploy under ~/toolforge-deploy [09:17:23] that repository is what the script `toolforge_deploy_mr.py` uses to deploym mrs (changes the charts in the files, you might see the diffs if you deployed anything) [09:17:35] it's also where ansible deployed the components from during install [09:18:05] so there you can `git fetch --all` and `git reset --hard origin/`, then manually deploy `./deploy.sh ` [09:18:26] where `` is one of the dirs under `~/toolforge-deploy/components` [09:19:13] in this case, it's `logging` [09:19:59] ok! all clear, thank you [09:20:00] so, in short, inside lima-kilo, `cd ~/toolforge-deploy && git fetch --all && git reset --hard origin/fix_alloy_local && ./deploy.sh logging` [09:20:16] note that it will not revert to main unless you do so manually [09:20:21] and then try your reproducer in T404226 [09:20:21] T404226: [logging,lima-kilo] loki setup fails to start on linux - https://phabricator.wikimedia.org/T404226 [09:20:30] yep :) [09:20:56] there's also https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/963#note_164313 [09:21:29] (to test) [09:21:42] neat, thank you, ok I had destroyed the lima-kilo VM on the last test so it is recreating it, which will take some time [09:21:48] taking a break in the meantime [09:22:10] https://xkcd.com/303/ [09:22:21] xd [09:22:26] hahaha! [09:33:57] oh, tools-prometheus-9 died again [09:48:40] :( [09:49:30] my workstation froze and I'm not sure why, anyways I've resumed lima-kilo bootstrap and will test after lunch [09:49:45] okok, bon appetite! [09:50:31] cheers [10:31:51] I opened T404833 for puppet failures for worker nfs-17, not sure why it's failing to mount dumps 1001 [10:31:52] T404833: [infra,pupppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833 [10:32:12] * dcaro lunch [10:32:25] if anyone is interested in debugging you'll be very welcome :0 [10:32:27] :) [11:26:09] dcaro: I tested locally just now with https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/963 applied and then ./deploy logging though toolforge jobs -f still hangs for me [11:27:00] I'll take a look at the puppet failure [11:30:09] godog: oh, interesting, what do the logs for the alloy pod show? [11:32:00] nothing of note (to me anyways) https://phabricator.wikimedia.org/P83403 [11:34:19] hmm, it's not complaining that it can't send the batch, but it did not pick up any logs either [11:34:25] what job are you running? [11:36:26] it should show logs like [11:36:33] `│ ts=2025-09-17T11:35:09.396642531Z level=info msg="tail routine: started" component_path=/ component_id=loki.source.file.pod_logs component=tailer path=/var/log/pods/tool-tf-test_test-17110-29301815-ld52v_d30e9d7e-de8e-4f36-ac82-e28a106ebe44/job/0.log` [11:41:26] oh, the job is in the paste [11:41:39] I see, that should create the logs and alloy should try to pick them up :/ [11:42:21] oh yeah the job didn't start [11:42:27] | testlogs | continuous | Not running | [11:43:40] not sure why tbh, anyways I'll keep looking at the dumps1001 issue [12:06:41] we're back on with nfs-17, not sure what happened to get the client in the status reported in the task [12:06:58] oh, so now it works again? [12:07:04] yes [12:07:12] :/ [12:07:17] you saw it fail also right? [12:07:32] (just double checking I'm not seeing things) [12:07:34] I did yes, I kicked the nfs-17 client from clouddumps1001 nfs server [12:07:46] so maybe some stuck session of sorts? [12:08:00] btw. I found why jobs don't work on lima-kilo I think [12:08:03] `MountVolume.SetUp failed for volume "etcopenstack-clouds" : hostPath type check failed: /etc/openstack/clouds.yaml is not a file ` [12:08:05] looking [12:15:19] this should fix it https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/275 [12:15:42] testing it (requires recreating the kind cluster... as the file will have to be inside the k8s controller node) [13:08:18] okok, that does fix the hostPath issue, merged [13:08:32] with a clean lima-kilo, I'm getting `│ ts=2025-09-17T13:07:50.193253582Z level=warn msg="error sending batch, will retry" component_path=/ component_id=loki.write.loki_tools component=client host=loki-tools.loki.svc.cluster.local:3100 status=-1 tenant=tool-tf-test error="Post \"http://loki-tools.loki.svc.cluster.local:3100/loki/api/v1/push\": context deadline exceeded"` from alloy logs [13:11:19] godog: I went ahead and merged the logs patch, as I re-reproduced in a clean lima-kilo install again, let me know if you find more issues next time [13:12:51] godog: when you have 5 minutes free I could use your brain for some tracing-related monitoring questions ;) [13:38:16] dcaro: ack! [13:38:22] volans: yes ready when you are [13:38:35] in a meeting now, will ping you when finished [13:38:37] thanks [13:39:49] ack [13:41:16] I need to update hiera 'bastion_hosts' with v6 addresses, and I'm failing to find the cloud vps v6 subnet that covers all instances [13:41:39] https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_IP_space? [13:43:49] thank you, that's indeed what I was after [13:58:18] godog: tools-k8s-worker-nfs-32 is having the same mount issue that 17 had [13:58:18] somewhat surprised that the /56 isn't in puppet anywhere [13:58:54] mmhhh I did reboot 32 earlier today [14:00:33] dcaro: thank you I'll take a look [14:08:45] ok dcaro, I have one immediate (probably unrelated) question, which is that 'pip install' dumps all of its working files (e.g. .dist-info things) in /srv/app and I absolutely cannot configure it to do that someplace else. I've tried changing cwd, setting TMP, setting --cache-dir, nothing works [14:08:50] that ring any bells? [14:10:05] which environment are you running this in? [14:10:10] blubber build? [14:10:12] yeah [14:10:36] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/blob/main/bin/installpanels.sh?ref_type=heads [14:10:39] do you have the blubberfile thingie? (I'm suspecting that blubber overwrittes the env stuff) [14:10:43] yep that :) [14:11:00] that's actually not the blubberfile, that's the thing that's invoked by the blubber file [14:11:15] blubber.yaml is https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/blob/main/.pipeline/blubber.yaml?ref_type=heads [14:12:07] (also, that installpanels.sh does /not/ include my attempts to change where the build files go, those are all in local changes. But you can see the context) [14:13:58] * dcaro trying to build [14:14:36] it's going to be a long trip :( [14:14:46] Here's me trying to get pip to use someplace else: [14:14:49] https://www.irccloud.com/pastebin/VHVpQSep/ [14:17:44] nice, I left a podman-builder script [14:19:24] dcaro, the real problem is trying to figure out why I don't get .html templates in /opt/lib/python/site-packages/horizon. The workdir issue is just me speculating that maybe there's some kind of dir clobbering happening during the build step since in theory it would be making working trees with the same name as the source trees. [14:20:05] ack [14:20:07] So I think, oh, I'll just move the build dir someplace else so I can see what's happening and that's when I learned that I have no influence over the course of events [14:20:40] was it installing them in that path before? (iirc blubber is tricky when trying to put things inside /) [14:21:48] I believe that if I specify the template files in MANIFEST.in that they show up in /opt where I want them. [14:22:47] during the build it runs as `runuser` I think, so might fail to put them there? [14:22:51] got the container built, looking [14:25:13] hmpf... I saw a warning about installing packages in that directory, but it went away too quick [14:26:41] I think the build runs as `someuser`, manually running the install script from within the container... [14:27:15] this is the warning I saw (there's a bunch similar) [14:27:17] `WARNING: Target directory /opt/lib/python/site-packages/python_glanceclient-4.9.0.dist-info already exists. Specify --upgrade to force replacement.` [14:31:03] this exists /opt/lib/python/site-packages/wikimediapuppetdashboard/templates/puppet/prefix_panel.html [14:33:00] andrewbogott:can you give me a specific dashboard and the path it should end up in? [14:33:08] (a specific html file) [14:41:11] find . -name "*.html" /opt/lib/python/site-packages/horizon [14:42:10] templates/horizon/common/_detail_tab_group.html [14:54:15] godog: sorry, meeting went longer than expected [14:54:34] ain't that the truth of meetings [14:55:18] so for the tracing stuff of nfs usage (file access) we will have sparse data like in this graph [14:55:23] https://grafana.wmcloud.org/goto/feoRuhCNg?orgId=1 [14:56:06] but what we really care about is what talked to what in the last X days with X that could potentially be something like 7 or 30, dunno yet [14:56:40] and I was wondering if prometheus is actually a good backend for them or maybe we could do something easier/better [14:57:16] it does feel more like logging [14:59:05] yeah, I wonder also what's the timeline, could be that a tool depends on another tool but only once a year? [15:00:20] no idea how likely that is tbh [15:01:33] I guess we will have to gather some data to know what the data looks like ;) [15:01:58] almost anything is possible in tool interdependencies, but I would generally expect that very few tools are actually interdependent. Magnus is known to have a bunch of older tools that use one tool as a shared library, but that has always felt like an outlier. [15:02:16] there's pywikibot tool used by others [15:02:47] do you mean the /shared stuff for pywikibot? [15:04:00] from the paths they are using they are doing something that touches also git files in /mnt/nfs/labstore-secondary-tools-project/pywikibot/public_html/core/ [15:04:42] my bet some git command on that directory [15:04:53] yeah, that's the /shared thing. Let me see if I can find the docs on it. [15:15:06] https://github.com/pywikibot/Pywikibot-nightly-creator/blob/master/nightly looks to be the thing happening in pywikibot/public_html/core. That is different than the thing I was remembering, but would very much be git on NFS every night. [15:22:58] * godog off, will read backscroll later [15:23:10] The files in /data/project/shared/pywikibot are the thing I was thinking about. That was once the recommended place to run pywikibot from -- https://wikitech.wikimedia.org/w/index.php?title=Help:Toolforge/Running_Pywikibot_scripts_(advanced)&direction=prev&oldid=2025687#Using_the_shared_Pywikibot_files_(recommended_setup) [15:26:44] bd808: David suggested this change which makes everything behave reasonably. We don't have a complete theory for why it matters, but... it does. https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/commit/8a63ded02e778e563185c80d79e1bb099aca8665 [15:28:11] It shouldn't hurt anything. I saw things whining about needing upgrade at some point, but I thought it went away when the double install of the horizon package was removed. [15:29:14] thx for the pointer for pywikibot [15:30:58] iirc there's also a few other tools that share the home (having one tool for frontend, one for backend and stuff like that) [15:31:04] but not many [15:31:16] bd808: the half-a-theory is that pip was doing some dependency nonsense where it installed horizon from an upstream repo, then tried to install our horizon over it but failed because of no --upgrade. But of course that wouldn't explain why /upstream/ horizon wouldn't have templates... [15:32:01] ...but now I'm worried that with --upgrade it's installing upstream Horizon over OUR horizon, better check that [15:38:55] yeah, dcaro, I think all we did was --upgrade our local Horizon install with a downloaded install. So it's the other way 'round: the install we /want/ doesn't have the templates but then that gets overwritten (--upgrade'd) with a downloaded one that does. [15:39:01] So back to the drawing board I think :( [15:39:16] ack [15:40:04] need to run a couple more tests to see for sure [15:40:22] export PIP_NO_CACHE_DIR=1 [15:40:31] that might be the one forcing pulling all deps every time [15:42:08] ok, will try that next [15:42:21] right now I'm trying with --upgrade but installing horizon /last/ just to see [15:45:01] how did you check that the package installed was not the custom one? [15:49:43] just grepped for a code comment in a custom change. "When this is allowed the panel render looks like garbage with" [15:50:20] when I --upgrade the horizon install at the end I'm back to getting our custom code installed but w/out templates [15:56:36] I get the same yep [15:58:56] did you try using `data_files`? https://static.opendev.org/docs/pbr/2.1.0/#files [16:04:22] it tries something [16:04:24] error: can't copy 'horizon/templates': doesn't exist or not a regular file [16:05:56] maybe because dir? [16:06:55] yep, I used the wrong thing, used `horizon/templates = horizon/templates` and I think it wants `horizon/templates = horizon/templates/*` [16:07:48] I'm sure that I can enumerate every file (e.g in MANIFEST.in) but that leaves me wondering why upstream doesn't have to [16:08:55] good point [16:09:50] but we can always fall back on that! [16:10:07] I'm trying the same build with the 'upstream' branch which should == real upstream [16:10:17] (for horizon submodule) [16:11:07] do you have the upstream git url? [16:11:29] it's the submodule already ack [16:11:38] https://review.opendev.org/openstack/horizon [16:11:41] (it's in the readme) [16:12:05] Not Found [16:13:08] huh, https://opendev.org/openstack/horizon/src/branch/refs/heads/master also not found [16:15:29] hmm.... the commit 95e68e89c is not found [16:15:56] oh, I see, we are not just fetching, we are merging upstream [16:16:38] I would not claim that my git process is optimal [16:17:21] just trying to find the upstream code [16:18:39] git clone https://review.opendev.org/openstack/horizon [16:19:21] * andrewbogott replacing our horizon with a raw upstream checkout and rebuilding, just in case... [16:24:12] https://docs.openstack.org/horizon/latest/install/from-source.html#static-assets [16:24:22] maybe they are not included with the python package? [16:27:14] it is in pypi though [16:27:19] (the templates I mean) [16:29:09] running `python setup.py bdist` under `horizon` in our fork, gets the templates in the tarball [16:29:10] `copying horizon/templates/horizon/common/_workflow_base.html -> build/bdist.linux-x86_64/dumb/usr/local/horizon/templates/horizon/common` [16:29:24] ok, building with raw upstream code produces the same results (not templates) [16:29:33] So do not think the issue is in the horizon submodule [16:32:33] maybe you have to `python3 setup.py install` instead of `pip`? [16:33:11] worth a try, although I don't think that pulls dependencies [16:33:12] or create the package with setup.py, then install the tarball with pip? [16:35:00] does setup.py install take a target dir? [16:36:37] I guess, looking [16:37:22] --install-lib [16:38:13] doing `python3 setup.py sdist` and then `python3 -m pip install ... /path/to/tarball` works [16:38:51] (if I'm not mistaken) [16:39:06] huh, ok, trying... [16:40:03] so something like [16:40:11] bah, can't copy/paste from this VM for some reason [16:40:19] xd [16:40:45] * andrewbogott laboriously retypes [16:40:50] function pip_install { [16:41:00] cd "$@" && python3 setup.py sdist [16:41:15] python3 -m pip install --use-pep517 -c etc. etc. [16:41:17] } [16:41:58] I think so yes (make sure to remove the --upgrade) [16:46:23] hm.... running from the container build did not work though [16:46:54] I'm confused [16:49:10] hmm. the sdist run inside the build step did not get in the static files [16:50:56] oh... I think I might have found something [16:50:57] https://docs.openstack.org/pbr/latest/user/packagers.html#tarballs [16:51:04] SKIP_GIT_SDIST=1 set in the script [16:52:12] https://www.irccloud.com/pastebin/wde4DF7F/ [16:52:44] yeah, didn't work for me either [16:52:47] and removing that from the script, installs the html files for me using the setup.py trick... [16:52:56] let me retry from scratch [16:53:28] well I will feel very silly if that's it [16:53:59] it might even work with the old method xd [16:54:41] I'll try that while you try the incremental approach [16:55:08] templates are there [16:55:24] this definitely fits as an explanation [16:55:35] and grep finds the string [16:55:36] https://www.irccloud.com/pastebin/o5g9lNGO/ [16:55:49] * andrewbogott tries not to wonder how this ever worked before [16:56:37] I did not try to install all though, but installing just horizon gets something there [16:56:39] https://www.irccloud.com/pastebin/wsgqlerD/ [17:04:43] the upgrade I think messes up the horizon install, trying now with `--upgrade` on the horizon install [17:07:16] yeah, I got upstream horizon rather than local horizon, trying again with slightly different settings (and fresh base images) [17:07:46] the grep hits on the utils.py file in my local build too. [17:08:18] I'm really confused about why it works for me locally now, but apparently not anyone else [17:14:42] andrewbogott: I see in the paste that you are doing builds using `make restart` which is then using `docker compose up --build --detach`. I wonder if that is making something strange happen by reusing old layers from prior wmcs/horizon builds that you have cached locally. [17:15:08] could be, I'm doing a clean build now [17:16:06] I guess you reproduced the missing templates in the CI pipeline too though correct? [17:16:18] yeah [17:16:26] I'm running the checked-in code now except for [17:16:27] -export SKIP_GIT_SDIST=1 [17:16:46] and as best I can tell it isn't installing the patched Horizon, just one from pip upstream [17:17:39] the upstream pip one has the templates in though, or at least did in the test I did with a venv yesterday... [17:17:47] yep, that's correct [17:18:02] we're consistent, we can install our horizon w/out templates, or the upstream one with [17:18:15] and every change that I've tried today just flips between those two options [17:18:28] ooooh wait [17:18:33] I'm making a silly mistake, hang on [17:18:36] oh, so you think you are seeing the upstream installed and then overwritten by the local and the local not gathering the staic files [17:18:42] * andrewbogott going to push so as to avoid the silly mistake again [17:18:52] I still have --upgrade in place [17:19:07] without the upgrade though it also skips it's own package I think :/ [17:19:12] though it retains the first installed one [17:19:55] ooook trying again [17:20:01] going to walk around a bit while it builds [17:20:20] when it gets to octavia-dashboard for example, it shows: `WARNING: Target directory /opt/lib/python/site-packages/octavia_dashboard already exists. Specify --upgrade to force replacement.` [17:20:51] maybe from all scratch making sure the next package is not installed by the previous might work xd [17:21:03] or installing all at the same time? [17:21:10] I don't know where we would've gotten octavia-dashboard from before though? [17:21:23] It's not a dep of anything else as far as I know... [17:21:31] other than horizon we shouldn't be installing anything that is pulled in as a dep [17:23:45] * bd808 deletes local images and kicks off another build to see if local state is reproducable [17:24:11] * andrewbogott plans to bring the entire WMF to a halt with this one build issue [17:25:11] it's one of the things in the script that gets installed [17:28:27] dcaro: right, it's only installed once (from the script, but not as a dep elsewhere) so why would it hit an upgrade attempt? [17:28:47] that might be me retrying, let me test with a clean container [17:30:00] ok, now I have a build that has templates /and/ wmf horizon fixes. Going to turn this back over to gitlab [17:30:17] fresh local build. the `grep` test passes. /opt/lib/python/site-packages/horizon/templates/ exists. And yes lots and lots of "Specify --upgrade to force replacement." warnings [17:32:13] thinking about the long history here and wondering if there are bits of this left over from when it was building wheels to be installed later that is making things more confusing. [17:32:35] could be, although I did start over from scratch when I moved to blubber [17:32:54] following a blubber how-to guide [17:33:24] the blubber part here is almost nothing though. "run this shel script" [17:34:27] true... [17:37:15] rerunning from scratch (with a clean /opt/lib/python/site-packages), with sdist + pip install */dist/* seems to work for me [17:37:47] have not checked though if any of the dashboards does not get installed but pulled before getting to it [17:38:10] bd808: except when I moved to blubber I also moved to docker, right? And that was a real change. [17:38:54] yeah, it had been scap3 before which was where the wheel build step existed [17:45:07] ok, progress. Static resources still aren't aggregated properly, but at least they're there to be aggregated. [17:45:07] * andrewbogott goes back to the hot laptop [17:45:32] just commenting the SKIP_GIT_DIST also works for me (without sdist) [17:45:57] yeah, that's what I'm trying [17:46:10] now I need to get back to fixing the manage.py invocation [17:46:17] which hopefully works better now that it has things to manage [17:46:22] xd [17:50:24] anyhow, vme off [17:50:26] * dcaro off [17:50:29] cya tomorrow! [17:50:35] good luck with package wrangling [17:50:46] thanks, later! [17:51:40] combining all of the `pip_install` invocations into one also seems to work for me and gets rid of the "X already exists" warnings by having the solver do all of the work once [17:53:57] oh, that's a good idea. Will have to rearrange things quite a bit but that's fine... [18:25:25] last build actually deploys and runs! https://labtesthorizon.wikimedia.org/project/ [18:25:44] thank you bd808 and dcaro [18:26:40] bd808: how did you do 'pip install' for everything at once when you need to specify PBR_VERSION for horizon but not the other packages? [18:40:02] andrewbogott: I removed the default PBR_VERSION from the top of the script and just set PBR_VERSION to what the constraints picked for horizon. This means that all the custom installs end up with that horizon version number. [18:40:21] ok, will try! [19:53:32] it works :) [20:10:31] time for a happy dance? [21:09:43] 💃