[07:45:39] andrewbogott: thank you for the review/merge of the labs_lvm make-instance-vol script ( was https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054916 ) [08:15:01] hmm... we still have tools-sgebastion-10 running [08:16:11] that one should have been shutoff no? [08:19:29] oh wait, that's the login-buster one [08:19:39] that's the controversial one :-P [08:21:17] yep, it has not gotten any toolforge-* package upgrades in a bit [08:21:19] :/ [08:24:09] hmm... apt update does not show the latest packages [08:24:10] https://www.irccloud.com/pastebin/Y1otI25i/ [08:24:17] the package is there in the repo [08:24:25] (did the publish also) [08:25:45] it seems it does not like anything that does not start with 0.* ? [08:26:00] I'm confused xd [08:27:35] hmm... if I curl from the bastion for the Packages file, the new packages are there [08:27:36] https://www.irccloud.com/pastebin/kfT25iYS/ [08:27:42] root@tools-sgebastion-10:~# curl http://tools-services-05.tools.eqiad1.wikimedia.cloud/repo/dists/buster-tools/main/binary-all/Packages | vim - [08:28:30] any idea why it would not show up in `apt policy toolforge-jobs-framework-cli` or why it would not upgrade it? [08:29:23] hmmm [08:29:29] forcing the version says it's a downgrade [08:29:31] https://www.irccloud.com/pastebin/kR1oEkRg/ [08:30:41] oh, now it shows all of them [08:30:46] and upgrades to 16.0.5 :/ [08:31:45] *16.0.15 [08:31:53] https://www.irccloud.com/pastebin/8B0sQE9T/ [08:32:09] something weird happening there :/ [08:33:43] seems all upgraded now, 🤷 [08:35:17] I have no idea what happened! [09:14:29] * arturo back in a bit [10:02:07] there was a OOMkill alert for cloudcontrol2006-dev and I created T370401 [10:02:08] T370401: cloudcontrol2006-dev struggling with memory - https://phabricator.wikimedia.org/T370401 [10:36:09] 👍 [10:36:11] * dcaro lunch [10:50:40] dhinus: I gave https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/13 another spin. [10:50:51] and I replied to your comment [10:51:42] I'm tempted to refactor, but maybe we should add more code, to see if a pattern surfaces more clearly, then do the refactor [10:53:00] I agree, let's start lean, then refactor in a separate MR [10:54:02] if this import works, I may add subnets next [10:54:08] I have another thought (but that's also a refactor): maybe we should keep networking in a separate directory, so we can run "tofu plan" and "tofu apply" just for networks and it will be faster. having one "tofu plan" for all of codfw (or all of eqiad) is probably not going to scale [10:54:18] we can keep everything together for now and split later [10:55:32] yeah, we can solve scale problems when they arise [10:56:01] I think we should solve them "just a little before", to avoid the refactor being too painful :) [10:56:08] but I agree now is too early [10:56:20] ok, I'll merge & apply now then, and see how it goes [10:56:26] sgtm [10:56:39] thanks [10:58:24] it applied cleanly! \o/ [11:07:32] importing into the state was surprisingly smooth [11:44:00] dcaro: is there a reason builds-api has `/v1/tool/{toolname}/clean` instead of `/v1/tool/{toolname}/builds/clean` ? it cleans up both build pipelines and old harbor images, but I think both of these can be thought of as 'build' resources [11:48:11] dhinus: this is a tiny refactor to remove the `_set` indirection and the `cloudvps` keyword https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/16 [12:09:55] blancadesal: not a strong reason, by the time, it was meant to be a generic clean, for anything and everything to free space. For example, when you delete a build, it does not flush the images from harbor, but clean does (as a build can have an image, or not, or an image can have a build, or not, currently at least) [12:16:00] arturo: "data" as a top-level folder can be slightly confusing, because tofu has a concept of "data resources" [12:16:07] but not a big deal, we can refactor later [12:19:51] dhinus: ack [12:21:14] I tried running "plan" in eqiad and it seems to be broken. I'm not sure if it will also trigger an alert [12:21:26] yes, it is broken, I'm working on a fix [12:21:30] ok! [12:21:33] the refactor was not clean :-( [12:22:56] we should probably document in wikitech how alerting works, I'm not finding it [12:23:00] I think there's a systemctl timer? [12:23:27] I haven't looked yet [12:23:54] yep, it's in cloudcontrol1007 [12:24:08] opentofu-infra-diff.timer [12:24:16] dhinus: ok, the state is fixed, running plan now should work [12:24:43] OnCalendar=*-*-* 3:10:00 [12:25:36] so it only runs once per day [12:25:47] that should be good enough, no? [12:27:10] I think so [12:27:28] maybe every 12 hours? but I'd leave it at 24h for now [12:27:46] 12 sounds also fine [12:27:57] hmm tofu plan is still broken I think [12:28:15] ? [12:28:18] I just got [12:28:19] No changes. Your infrastructure matches the configuration. [12:28:30] eqiad? [12:28:30] oh, wait, I'm running it for codfw1dev [12:29:06] patch !17 is not merged yet [12:29:46] ok I see the error now [12:29:48] https://www.irccloud.com/pastebin/IaCMXpjv/ [12:30:02] let me fix it real quick [12:32:04] that's where a CI running "tofu plan" on both envs would be useful [12:32:17] and it shouldn't be hard to set up [12:32:45] I thought this would be the fix, but apparently not [12:32:46] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/19 [12:33:21] yeah I was thinking of something similar [12:33:27] same error? [12:33:34] yes [12:34:22] ok, now I got [12:34:24] https://www.irccloud.com/pastebin/psBLx1gm/ [12:34:51] maybe you can use eqiad1-r: [] [12:35:13] yeah, let's try [12:35:29] now [12:35:30] https://www.irccloud.com/pastebin/wOnM8Rft/ [12:35:37] LOL [12:35:58] my bad, it needs to be P{ [12:36:00] {} [12:37:02] ok [12:37:23] No changes. Your infrastructure matches the configuration. [12:37:25] that's it!! [12:37:28] 🎉 [12:37:58] I'm merging the patch [12:38:15] yay! [12:38:25] re: CI, there's https://gitlab.com/components/opentofu [12:39:04] maybe even before CI we need 2 cookbooks: [12:39:32] 1) cookbook to run git rebase main; init / plan / apply on both deploys [12:39:45] 2) cookbook that given a MR, tests a init / plan / apply [12:39:58] (on both deployments) [12:40:18] so, before CI, a similar workflow to what we have with puppet-merge in that sense [12:40:20] you just gave me an idea: we can leverage the new "locking" functionality in spicerack! [12:40:30] because we don't have state locking at the moment, and that can be a pain [12:40:47] I see [12:41:08] so if we tried to never use "tofu" directly and always go through the cookbook [12:41:15] we ensure that 2 people are not applying at the same time [12:41:23] (plan should be safe) [12:41:30] that works for me [12:41:45] to test an MR I'd kinda prefer the CI [12:42:05] and given `tofu` is a thin wrapper, we could have a confirmation prompt if called with plan directly without the cookbook [12:42:12] true [12:43:03] the cookbook to test the MR can be created in a couple hours, can the CI be enabled also in a couple hours? [12:43:19] actually, it can be the same cookbook, just with an additional branch parameter [12:43:25] well it seems all is implemented in that gitlab component [12:43:33] but I'm not sure if/how to add it to our gitlab instance [12:43:48] dhinus: but we cannot just pull stuff across gitlab instances :-( we need to manually import stuff [12:44:14] yeah that's what I meant, how does that work? do you need to clone that repo? or do you also have to "enable" it in some way? [12:44:21] see also https://gitlab.com/components/opentofu#usage-on-self-managed [12:45:08] we would need to mirror the repo (a clone, seems easy) then refresh the yaml files to point to the new URL [12:45:36] I wonder if we did it already for some other components [12:48:06] https://docs.gitlab.com/ee/ci/components/#use-a-gitlabcom-component-in-a-self-managed-instance <-- self to be for premium gitlab [12:48:08] do we have that? [12:48:37] "Tier: Free, Premium, Ultimate " [12:49:04] ah no you're right [12:49:23] I see this [12:49:25] https://usercontent.irccloud-cdn.com/file/6Rd2nsSQ/image.png [12:50:00] yes sorry I was reading at the top of the page, rather than in the section you linked [12:50:18] I'll create a ticket to track the cookbook work [12:50:20] so "components" are available in Free, but "Use a gitlab.com component" is not :/ [12:50:28] * dhinus yells at gitlab [12:50:49] I'm not exactly sure how they are blocking that, given the component is open source? [12:51:15] I guess the scheduled mirror function would be part of the premium offering [12:51:38] if that's the only difference, I'm fine with updating manually every few weeks/months [12:52:01] we should probably ask in releng or slack #developer-experience [12:52:21] as it seems a feature others will want to use (not the tofu one specifically, but components in general) [12:53:26] * dhinus starts to think this will take more than 2 hours :P [12:53:36] maybe let's do the cookbook first :) [12:54:04] yeah :-P [12:54:34] I will start a thread on slack anyway to get the ball rolling on the components thing [12:54:35] T370414 [12:54:35] T370414: tofu-infra: create a cookbook automation to run tofu - https://phabricator.wikimedia.org/T370414 [13:00:27] * arturo food time [14:00:42] arturo: the pipeline might be easier than I thought https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/pipelines/65359 [14:01:15] I skipped the "component" and just created a super-simple pipeline using the official OpenTofu docker image [14:01:32] "tofu fmt" is working, and "tofu validate" is only failing because it's missing some vars [14:02:04] for "tofu plan" the problem is how to access the openstack API from a GitLab runner [14:02:33] but if we use the cloud-vps runners, that should not be hard [14:14:28] cool! [14:14:39] the openstack APIs should be open to the internet [14:15:30] ah ok, I thought they were not [14:15:50] There are pending opentofu changes (which eventually cause puppet alerts). arturo, is that you? [14:16:12] andrewbogott: where? [14:16:13] andrewbogott: that should be fixed now, not sure why the alert triggered [14:16:31] puppet on cloudcontrol1006 says: [14:16:34] https://www.irccloud.com/pastebin/oZS0QfHm/ [14:17:03] oh, OK, it's not a pending change it's a merge conflict [14:17:13] seems to be present on 1007 as well [14:17:25] hm, and 2005-dev [14:17:59] * arturo cleans by hand [14:18:19] ah ok that's not the "tofu plan" that failed, but the git sync [14:18:41] oh I know what happened here [14:18:48] this is the force-rebase from the other day [14:19:23] right [14:19:26] makes sense :) [14:19:53] we may get puppet to do `git pull --rebase` and other required commands to just ignore and flush local changes [14:27:03] andrewbogott dhinus this should do it https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055241 [14:29:59] arturo: +1d [14:30:05] thanks! [14:41:35] arturo: "tofu fmt" and "tofu validate" working correctly! https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/pipelines/65370 [14:41:55] "plan" will probably work but needs credentials, I think we should create read-only ones for now [14:43:49] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/20 [14:45:37] +1'd [16:14:02] the task I mentioned during the meeting is probably better discussed here in IRC: T360488 [16:14:03] T360488: Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488 [16:15:00] dhinus: my latest experiments for that problem are at https://gitlab.wikimedia.org/toolforge-repos/bd808-buildpack-perl-bastion/-/merge_requests/1 [16:15:30] bd808: thanks. do you think that Anomie is still relying on login-buster? [16:15:39] because it looks like login-buster will be shut off soon [16:15:42] I 100% know they are, yes [16:16:08] if you shut it off without a fix for Brad he will probably shutdown Anomiebot :(( [16:16:39] https://en.wikipedia.org/wiki/User:AnomieBOT -- 6 million edits on the main bot [16:16:58] I will add it as a subtask so if we do break their use case at least we do it knowingly and not because we forgot about it :) [16:18:32] The short term cheap fix would be putting all of the Perl back on the bastions. The mid term cheap fix would be T363033 and then I'll make a container to do the needful for Brad. [16:18:33] T363033: [builds-builder] Support using custom buildpacks - https://phabricator.wikimedia.org/T363033 [16:19:23] I'm working on the container solution already, but it is looking more and more like I will end up stuck because of limitations of the current buildpacks. [16:20:46] I don't think that supporting custom buildpacks is the way to go, that makes things even more complicated (and more tangled to the internals of the build process) [16:21:17] dcaro: what's the alternative? re-engineering the bot? [16:21:46] that's one [16:23:11] as in, instead of using perl libraries to run things, use something else [16:23:30] Brad's bot is like 15 years old. He's not likely to rebuild it from the start to accommodate your choice of runtime changes. It's just one bot, but it will end and end loudly if you force him into a strict workflow that you dictate [16:23:52] it does not need to be rebuilt from the start [16:24:08] I wasn't thinking that /brad/ would re-engineer it :) [16:24:08] and I'm not dictating xd [16:24:41] another option is to offer an image with those libraries (or similar) [16:24:50] that is what I think might be the best route [16:25:45] that's what I'm trying to build in that MR way up there [16:26:32] bd808: and that's greatly appreciated [16:26:35] it needs perl bits, sshd bits, and the suite of `toolforge ...` cli interfaces [16:27:16] this also links with a "bastion container" idea that has been floating around, as it's not only the image needed, but "direct" ssh access from outside [16:27:46] Sorry that I'm so far behind -- our existing buildpacks setup supports arbitrary installation of packages into a container, right? Is that what that patch is doing, or is this something else? [16:28:09] bd808: btw. you can retry https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/49 for webservice [16:28:20] (that enables webservice to be run within a container) [16:28:23] andrewbogott: not arbitrary. only packages from a specific subset of ubuntu 22 [16:28:42] yep ^ [16:28:47] which doesn't include the perl version anomie needs? [16:29:14] it has perl libs yes. but not toolforge cli tools or an ability to configure sshd [16:29:29] why does it need sshd? [16:29:31] (curious) [16:29:59] it's a remote workflow feature for him. let me find the short description... [16:30:02] ah, I see [16:30:47] dcaro: see https://phabricator.wikimedia.org/T360488#9654517 and the bit about the DBI driver [16:31:28] that magic ends up needing an sshd to terminate the tunnel. `sshd -i` (inetd mode) can do that [16:32:24] I remember diving into the DBI code at some point, I remember not finding support for what I wanted to check [16:32:28] dcaro: thanks for pointing out your latest webservice fixes! I will try to test that out tonight/tomorrow. [16:32:54] the DBI issue could be replaced by an ssh tunnel right? [16:32:58] dcaro: dbi:Gofer is the bit that does the sshd tunneling [16:33:29] yeah, a more complicated local setup of tunnels could replace that dbi:Gofer bit I think [16:33:31] yep, I was checking how it did start the other side, and it did need the libraries installed in the remote (unlike ansible for example) [16:34:35] * dcaro slowly regaining the memories of that debugging session [17:01:03] andrewbogott: I just noticed your comment to the effect that someone other than Brad could work on his bot. Technically, yes, but in practice I'd say not likely. If only for the reason that you would need to find a perl wizard who matches with Brad's personal flavor of Perl development. ;) [17:01:47] yeah, that makes sense. Although surely we have folks with latent perl skills kicking around. [17:03:17] one of the most interesting to me things about AnomieBOT -- https://en.wikipedia.org/wiki/User:AnomieBOT/source [17:03:25] it self publishes to wiki [17:04:17] https://en.wikipedia.org/wiki/User:AnomieBOT/source/ChangeLog goes back to 2008 [17:08:21] huh, cool [17:11:03] The bot's core also includes its own cron-like scheduler and job watchdog service. Its really a whole platform unto itself. I guess as one might expect from a 16 year old project built by someone like Brad. [17:14:55] * dhinus offline [17:15:28] bd808: have you also considered just porting it intact to a cloud-vps project? [17:15:50] Or does it need external scheduling things? [17:16:49] andrewbogott: I haven't talked with Brad about that, no. I don't know if he would find admining a Cloud VPS project worth his time or not. [17:17:56] I would give it a 50/50 chance that he would just leave WMCS if he felt pushed out of Toolforge. [17:18:55] 'k [17:28:51] * andrewbogott drifts lunchward [17:39:54] btw. we almos. have a stable api, that means that if Brad wants to use that instead of the clients + kubectl, is a really good option [17:43:07] * dcaro off [17:43:12] cya on monday [17:56:43] thx for https://phabricator.wikimedia.org/T364761#9996014 . q: if i want to access the mysql tables of toolviews, should i just use the creds in the access controlled yaml on the proxy servers and is access via login.toolforge.org with its mysql executable an acceptable way to connect? or is something else preferred? [18:37:06] dr0ptp4kt: `sudo become toolviews; sql tools` and then `use s53734__toolviews_p;` [18:37:39] Or actually any tool can do that. the important bit is `use s53734__toolviews_p;` [18:37:51] thanks bd808 [18:37:54] the `_p` is for public [18:38:23] dr0ptp4kt: if you want to take over that tool I think I'd be glad to be rid of the responsibility.