[00:19:14] I have just tried 7 times in a row to get `webservice buildservice shell --buildservice-image containers/bnc:latest --mount none` to work. It fails with a timeout every time which is making me think that either the registry is sad or that the cluster is saturated. [00:20:21] PEBCAK! The image name is wrong. It should be `tool-containers/bnc:latest` [00:37:03] I have now submitted 4 separate patch revisions, each of them to correct the word 'hearbeat' to 'heartbeat'. Sounds like neither of us should really be trying to do things tonight. [01:14:47] andrewbogott: agreed! I'm off to eat supper :) [07:43:21] btullis: morning! we're getting puppet constant change alerts from the clouddumps boxes and they seem to be about file permissions on /srv/dumps/xmldatadumps_airflow_temp/xmldatadumps/public, is that something you know about? [08:20:36] taavi: just pinged in slack #data-platform-sre about the same thing :) [08:23:38] this is the sshd config change i mentioned yesterday: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146661 [08:28:57] +1d [08:34:02] good idea [09:01:32] hmm, the tools project is almost out of quota [09:02:06] i was about to create new prometheus nodes for T393697 but that's not going to work [09:02:06] T393697: Rebuild Toolforge Prometheus nodes in v6-dualstack network - https://phabricator.wikimedia.org/T393697 [09:30:49] you know you're writing too much puppet when you try to put opentofu files a manifests/ subdirectory [09:37:54] anyway, https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/29 [09:51:31] taavi: LGTM, may need a rebase (I just merged a previous MR) [09:51:42] already done :-) [09:52:44] great [11:45:32] next: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/30 [11:46:45] taavi: for your consideration https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/68 [11:50:46] taavi: in your patch, what would be the workflow to reuse the volume for another VM, let's say use volume 2 for VM 3 ? [11:51:23] I see, they are detached, nevermind [11:51:28] (detached in the map) [11:52:24] arturo: moving it to some other vm declaration but keeping the name same [11:52:43] i was also thinking about creating a separate volumes = {} argument, and then referencing them just by name [11:52:59] arturo: did you intentionally merge that for me? [11:53:12] Oh sorry, I was meant to click 'approve' :-( [11:53:35] it is merging now OK? otherwise I can revert [11:53:59] i was about to rename the volume from -2 to -a to make the naming independent of VM numbering, but otherwise it's fine [11:54:13] I didn't run the apply stage [11:54:29] please send the follow up then run the apply stage [11:54:31] sorry for the noise [11:56:40] https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/31 [11:57:16] lets see if I can hit the right button this time [11:57:25] done [12:53:13] arturo: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/68#note_9765 [12:53:24] also https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/32 [12:58:21] LGTM [13:30:30] I'm working on cleaning up our sshd config, there is an option profile::ssh::server::enable_hba which seems to habe been added by Yuvi a decade ago for some infrastructure hosts of the legacy Toolserver stack. cloud.yaml also defaults to false, could you please check if there is any remaining place in cloud where this is needed? [13:30:42] otherwise I'd remove it from the sshd templates [13:31:36] moritzm: it's still set as true in toolforge, but I'm 99% sure that's no longer actually in use after the grid engine decom [13:32:47] https://phabricator.wikimedia.org/T98714 seems to align with that [13:33:05] could we disable it next week and see if anything breaks? [13:33:31] I'm working on https://phabricator.wikimedia.org/T393762 and before splitting this up, I'd like to remove all the historical baggage [13:34:36] yeah, I'll flip it off in toolsbeta now and we can do the same on tools next week? [13:34:37] maybe you could discuss this in your next weekly team meeting or so? [13:34:43] sgtm [13:35:57] speaking of sshd config, do we still have some legacy authentication methods or similar that we need to disable at some point? I have a vague memory of something like that but can't find a task for it [13:36:15] yeah that was used for grid engine and is no longer needed [13:36:27] (the hba thing) [13:40:39] there are two more options that I'd like to phase out soon as well: [13:41:12] the custom setting for "MACs" (it was only for some legacy clients) [13:42:31] and we should also disable agent forwarding for cloud as well [13:42:41] it's already disabled in prod since many years [13:43:06] and then going forward starting with trixie I'll switch the config to a scheme were we follow the sshd defaults [13:43:34] and only ship the WMF-specific ones via /etc/ssh/sshd_config.d/wikimedia.config [14:07:44] sounds good [15:40:08] * arturo offline [15:52:09] bd808: have a few minutes to help me understand a thing about pip/setup.py behavior? The repo in question is https://review.opendev.org/openstack/octavia-dashboard [15:52:19] (I can't remember if you're working Fridays these days, ignore me if not) [16:21:27] asking the question itself usually helps with getting answers fater [16:21:43] The question is: why do I need to do this? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/commit/417078e942770d29c959277279aba48bb3aad020 [16:22:53] pip totally fails to install the static dir. But if I do a local pip install and then run the docker build process it works properly -- so I assume something is out of order, but I don't understand setup.py enough to even know where the script is that is out of order. [16:24:24] generally I would expect to see a MANIFEST.in file specifying any not *.py files that need to be included, like in wmf-proxy-dashboard https://gerrit.wikimedia.org/r/plugins/gitiles/openstack/horizon/wmf-proxy-dashboard/+/refs/heads/main/MANIFEST.in [16:25:28] yep, you're right, all the other submodules that I'm using have that [16:26:00] I will see if I can get that submitted upstream [16:27:29] andrewbogott: are you covering clinic duty this week? I think a.rturo ended yesterday and you're next in the list [16:27:48] yes! Or at least I plan to. Are there immediate/pending issues? [16:27:58] not super urgent, but T394520 [16:27:58] T394520: Request increased quota for campwiz Toolforge tool - https://phabricator.wikimedia.org/T394520 [16:28:34] there's already a patch, it needs merging and "deploying" which can be confusing if you never did it before [16:28:47] so perhaps a good chance to test the docs? :) [16:29:20] sure, I'll look at that in a bit [16:29:36] thanks! also I see a couple of tools-nfs workers with procs in D state, they just need a restart [16:30:02] (when I say "I see" I mean there are alerts in https://alerts.wikimedia.org/?q=team%3Dwmcs) [16:30:20] I meant to reboot them myself but I was sidetracked reviewing T394516 [16:30:21] T394516: [builds-api] Main branch is used even when a different "ref" is specified - https://phabricator.wikimedia.org/T394516 [16:31:46] yeah, I was eyeballing those workers too [16:36:33] dhinus: is there more to it than 'merge the patch, run the cookbook'? [16:38:17] thanks taavi, this seems to fix the build (although I don't yet know if it actually makes the dashboard work) https://review.opendev.org/c/openstack/octavia-dashboard/+/950205 [16:41:24] andrewbogott: no, running the cookbook should be enough, I also haven't done it in a while [16:42:25] andrewbogott: you can easily check if it worked with "sudo become campwiz; toolforge build quota" [16:46:03] hm I imagine the 'component' is maintain-harbor but that does not work at all [16:47:35] taavi, chuckonwu, is this somehow wrong? "sudo cookbook wmcs.toolforge.component.deploy --cluster-name tools --component maintain-harbor --task-id T394520" [16:47:35] T394520: Request increased quota for campwiz Toolforge tool - https://phabricator.wikimedia.org/T394520 [16:48:48] much easier to say based on the error you're getting? [16:49:20] The error is not great [16:49:25] https://www.irccloud.com/pastebin/JOpIfjjt/ [16:49:51] I can debug, just wondering if I'm somehow missing the point of that cookbook entirely [16:49:58] > INFO: git checkout branch 'bump_maintain-harbor' on /tmp/cookbook-toolforge-k8s-component-deploy-ilrajlojgr/toolforge-deploy [16:50:06] i don't see a branch called that in toolforge-deploy [16:50:50] > "git branch in the source repository, will use 'bump_{component}' by default (force it to be 'main' " [16:50:50] > "if you want to deploy main)" [16:51:02] seems like someone has tried to make the script a bit too clever [16:51:17] huh [16:51:47] * andrewbogott adds --git-branch main [16:52:04] ah, I think the script expects you to run the cookbook first, merge later. I always forget! [16:52:33] but also the branch will be "bump_" only if it's an automatic branch, not if it's a manual one like here [16:52:44] adding "--git-branch main" should do [16:52:55] the quota seems to have updated as intended. Hopefully none of these other things the cookbook is doing are damaging [16:53:03] as a side note, https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/branches/all could use some cleanup [18:03:49] yep, the flow usually is "create MR on component" -> "on merge, creates branch in toolforge-deploy `bump_`", deploy toolforge-deploy branch (toolsbeta->tools), if it works ok->merge toolforge-deploy branch [18:03:58] that ensures that the merged commit is a commit that is tested to work [18:38:57] >500 deploys already with the toolforge-deploy repo \o/ [18:39:01] https://www.irccloud.com/pastebin/zknlO0zx/ [18:50:16] that's ~1 deploy/day since we started with the `bump` commits, not bad :) [20:18:53] In case anyone wants to mess with it, there's now a network->Load Balancers panel in labtesthorizon. And I hacked up a proxy to point to a round-robin loadbalancer that will balance between an Apache welcome page and an Nginx welcome page: https://roundrobin.codfw1dev.wmcloud.org/ [20:49:09] andrewbogott: sorry I missed you ping until now. It looks like t.aavi helped you find something and now you are in the upstream "but why is this needed" hell. Sorry :/ [20:49:30] lbaas is exciting :)