[00:02:27] T354579 is ready when someone has a chance to run a quota increase [00:02:28] T354579: Increase disk qouta for math - https://phabricator.wikimedia.org/T354579 [01:08:11] * bd808 off [14:53:44] blancadesal: Raymond_Ndibe dhinus taavi komla I'll use https://docs.google.com/document/d/12fzFPE96KpHMXqdZzrGH6WApqDa5ZxNjL3iQE6KSAPc/edit as a template for the new document, I think that's where the component idea started (please forward me any other places where we might have had some discussion about it if you remember/find them) [14:56:23] looks good, I think it would be great if we could define the next user story we want to implement. is it "push to deploy"? what does the MVP version look like from the user perspective? how does the component API help in achieving that? [15:03:46] that was kinda defined yes (push to deploy), the user side was just configuring it on gitlab, and having a webhook that when called triggers the build + deploy. The problem is that we don't have anything that stores the build configuration (but you can build different git urls/images) and we don't have any API that's capable of deploying (and probably we don't want the build service API to do it directly) [15:04:00] so there's some dependencies there (in my mind at least) [15:04:47] nobe [15:04:51] some discussion here T341065 [15:04:52] T341065: [builds-api] Automatically deploy the webservice when the image is built - https://phabricator.wikimedia.org/T341065 [15:05:44] it's the one that triggered the toolforge re-architecture decision request xd [15:31:32] taavi: once moved will the cloudrabbit node be able to cluster with the others are will it be a new cluster? [15:33:14] andrewbogott: both are possible, which one is easier/better? [15:33:56] hmmm it'll have a different fqdn? So clients will need to know about the change? [15:34:06] (I assume that's the point of the move) [15:36:28] yeah, but the `rabbitmqXX.eqiad1.wikimediacloud.org` aliases are not changing [15:37:08] Oh! In that case I'd say just join it back to the original cluster after the move [15:37:33] If that goes poorly we can do something less elegant with incremental puppet patches adding/removing nodes [15:40:22] taavi: some openstack clients will likely freak out any time you touch a rabbitmq node but we can cross that bridge when we come to it. And that's what wmcs.openstack.restart_openstack is for. [15:43:39] https://gerrit.wikimedia.org/r/c/operations/dns/+/989196/ [15:45:04] might work! Let's try it :) [15:57:34] rabbitmq_detect_partition.service didn't seem to like that :/ [15:58:39] started the decom cookbook [16:53:06] quick review to unblock .net? https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/27 [16:54:02] it's adapting the same "shim" that we use to wrap heroku-type buildpacks to cloud-native API for the dotnet one (I was using a modified version instead before, that had some bugs) [16:58:04] is cloud/toolforge/buildpacks/dotnetcore-buildpack a modified version of the upstream one? [16:58:19] and what is the move_to_api_0.10 branch? [16:59:18] I trust that it works but I'm a bit confused by the layers :D [17:00:31] it's a clone of it, not modified (we were using branch move_to_cnb, that modifies it to adapt to the cloud-native API). The move_to_api_0.10 branch is the same code as upstream (under `target`) with some scripts that adapt from the cloud-native API to the heroku one (got it from https://buildpack-registry.heroku.com/cnb/jincod/dotnetcore, that is what heroku uses) [17:02:25] So the main issue is that most community heroku buildpacks, are built using heroku buildpack API (using bin/compile, etc.) and not the cloud-native API (using bin/build), there's some more differences on parameters and envvars passed around. So we can't use the community heroku buildpacks directly, we need that "shim" layer that heroku adds when you download them from that url [17:03:53] what do you mean "(under target)"? [17:04:53] in the `move_to_api_0.10` branch, there's a directory called `target`, that has a copy of all the files in the master branch of the upstream buildpack [17:05:00] oh ok [17:05:41] the bin/build script just sets the envvars properly, and calls target/bin/compile, and then fixes a couple other things after (it's the same for all the custom buildpacks that we have) [17:06:00] I see, and those files in bin/ come from the buildpack-registry.heroku.com url [17:06:24] if you could add a note explaining this to the repo (maybe a README?) it would be great :) [17:06:47] but maybe let's wait to see if this actually works :D [17:06:52] yes, though I had to modify them a bit, as the ones coming from buildpack-registry.heroku.com are only compatible with cloud-native API 0.4 (and the builder supports >= 0.6) [17:07:19] there's some info here https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Build_Service#Adding_a_new_buildpack [17:08:42] cool, didn't see that page! [17:08:51] hmm, I'll add that to the project description :) [17:10:21] I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/988669 is finally working :P [17:15:07] great! [17:15:13] just updated the repos descriptions, not sure I'm happy with the wording though, feel free to rephrase [17:15:26] \o/ [17:16:59] I think it's fine! Are all repos in https://gitlab.wikimedia.org/repos/cloud/toolforge/buildpacks shimmed in exactly the same way? [17:17:42] the apt one is a bit different [17:17:46] the others are the same [17:18:16] (I might do the same with the apt one, but we need some fixes that are not upstream, see https://phabricator.wikimedia.org/T348746) [17:18:16] I'm also confused by the fact the dotnet has both "move_to_010" and "move_to_cnb" branches. [17:19:44] I'll remove the move_to_cnb once I deploy the fix I just merged [17:21:54] ok I will ignore it then :) [17:48:33] * dcaro off [17:48:36] cya tomorrow [19:22:44] * bd808 lunch [21:08:20] dhinus, taavi, either of you still working by chance? I'm trying to add a new prometheus metric and having the same problem I always have... [21:08:37] https://www.irccloud.com/pastebin/i5RLqiiM/ [21:08:44] but can't find it in grafana [21:09:01] Rook, have you done prometheus/grafana things recently? [21:09:16] no [21:10:14] ok! Me neither apparently [21:11:48] andrewbogott: Jan 09 21:11:16 cloudservices1006 prometheus-node-exporter[310540]: ts=2024-01-09T21:11:16.654Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=designateleaks.prom err="failed to parse textfile data from \"/var/lib/prometheus/node.d/designateleaks.prom\": text format parsing error in [21:11:48] line 1: invalid metric name in comment" [21:12:26] I think `# HELP https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks` needs to be `# HELP cloudvps_designateleaks https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks` [21:12:38] are you using the python library or hand-crafting the .prom file? [21:13:31] hand-crafting at the moment [21:13:35] where is that error message? [21:14:10] systemd journal for prometheus-node-exporter.service [21:15:34] * andrewbogott tries regenerating that file [21:15:47] ...which takes forever because openstack apis are so slow :( [21:16:26] taavi: what am I doing wrong now? [21:16:34] root@cloudcontrol1006:~# journalctl -u prometheus-node-exporter.service [21:16:34] -- No entries -- [21:17:09] try cloudservices1006 [21:17:32] yep, just caught that [21:24:15] well, still can't find it in grafana but at least the exporter isn't complaining anymore [22:48:41] Anybody around who knows how to debug "BuildClientError: The build service seems to be down – please retry in a few minutes." [22:49:08] --debug? [22:49:53] "Error: no such option: --debug" :) [22:50:49] It does dump out some logging before that, but nothing that looks useful. [22:51:39] hm indeed, I don't think the debug flag is parsed by builds-cli correctly [22:52:09] and `[tools.majavah-test@tools-sgebastion-11 ~] $ toolforge build list` works fine for me [22:53:10] `toolforge build start https://gitlab.wikimedia.org/toolforge-repos/gitlab-account-approval` crashes for me. And Hawkeye7 is reporting this has happened to him for 30 minutes or so [22:53:46] maybe https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Build_Service has something to help... [22:54:40] managed to get some debug at least https://phabricator.wikimedia.org/P54583 [22:56:23] the build-api logs just log that there was a request that returned a 503, but don't give any extra details [22:57:22] my best guess is that david's last builds-builder deploy broke something, but I have no clue how to verify that [23:00:27] "No resources found in buildpack-admission namespace." sounds bad if is up to date [23:04:52] is it straightforward to revert the last deploy? [23:05:10] probably reverting https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/commit/d5c4af0e2867e5d17f62dbd480ab08e029b2b4de ? [23:05:34] I honestly don't know how any of this new k8s stuff works. I have no cookbook fu [23:05:36] should be as simple as reverting the toolforge-deploy commit and running the deploy cookbook. [23:05:41] let's do it unless taavi is on the trail of a fix [23:06:20] nope, it's late enough that I'm just getting offline and worrying about it tomorrow I think [23:06:25] taavi: do you have time + attention to revert + deploy? [23:06:42] no :/ [23:07:21] ok! I will figure it out [23:08:48] taavi: how about a +1, have time for that? https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/167 [23:09:20] or do you think I should leave it alone until tomorrow? I'm assuming that if it's totally broken then I won't make it worse by reverting [23:10:31] * andrewbogott self-merges [23:11:42] bd808: I'm trying to roll back, will be relying on on you to tell me if it helps [23:12:04] andrewbogott: okey doke. I can try a build when you think you have deployed things [23:12:34] hm, apparently the deploy cookbook doesn't work [23:12:36] for me at least [23:13:28] "Error: pulling from host tools-harbor.wmcloud.org failed with status code [manifests 0.0.84-20240105143530-d419fb15]: 401 Unauthorized" [23:13:45] unless the actual problem is harbor and not this component... [23:15:33] andrewbogott: you may have hit the problem. I can't login to https://tools-harbor.wmcloud.org/. It dies with a "core service is not available" message [23:16:14] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Build_Service#Harbor [23:16:26] I'm trying to kick harbor but for systemd says it doesn't know what nginx is... [23:17:05] * andrewbogott tries docker-compose restart [23:17:38] can you log in now? [23:17:47] * andrewbogott remembers that ta.avi was doing ldap things earlier [23:18:32] still tells me 'core service is not available' [23:18:42] andrewbogott: no joy. same "core service is not available" response in the web gui [23:18:54] and that's new behavior, I take it? [23:19:08] I logged into it sometime last week [23:20:07] https://www.irccloud.com/pastebin/VR2VWSoY/ [23:20:18] don't know if it's stuck or just slow [23:20:46] going to do a proper stop/start [23:23:16] something like `docker-compose logs --tail=100 -f harbor-core` might tell you what's happening in the container [23:24:10] meanwhile trying to figure out how to revert a revert... [23:24:16] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/167/commits [23:25:16] lots of harbor services are trying to start, failing, and restarting. including nginx [23:25:21] nothing in logs [23:25:43] I mean, there aren't logs [23:25:51] https://www.irccloud.com/pastebin/Z9RLvkDU/ [23:27:25] bd808: think I should stop and reboot the VM? I don't have a lot of good ideas here :) [23:28:45] andrewbogott: I don't know anything about this stuff really, so I your guess is as good as mine. You can try reboots, or we can just write up a task and wait for d.caro I guess. [23:29:30] I'll give it one more go. [23:29:51] Are you any handier with the gitlab ui? I don't understand why the original commit could be reverted in the UI but the revert commit can't [23:30:26] these admin docs could use some more info, but that's sort of common until others are helping fix things [23:30:32] andrewbogott: I'll look :) [23:30:37] thanks [23:31:37] andrewbogott: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/168 -- I made that from https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/168 and then options -> revert [23:32:02] bah. second link should have been https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/commit/12cbadc5f8225b244a4d73a2f5771756fdebcf89 [23:32:23] yep, that's the page I couldn't find... [23:32:25] thank you! [23:33:13] harbor is being slightly different post-reboot but still not starting [23:33:20] So I think it's time to write this up and leave it for tomorrow [23:33:45] I'll start a task [23:36:46] "Jan 9 23:35:32 172.18.0.1 core[628]: 2024-01-09T23:35:32Z [FATAL] [/core/main.go:180]: failed to initialize database: register db Ping `default`, failed to connect to `host=5ujoynvlt5c.svc.trove.eqiad1.wikimedia.cloud user=harbor database=harbor`: server error (FATAL: the database system is starting up (SQLSTATE 57P03))" [23:36:57] hmmmm [23:36:58] That's from /var/log/harbor/core.log [23:37:43] * bd808 found that log via crumbs on https://goharbor.io/docs/2.0.0/install-config/troubleshoot-installation/ [23:37:53] ok, I'm kicking the database... [23:38:11] which is postgres running on Trove [23:38:55] joy! So a docker container in a vm managed by hopes and prayers :) [23:40:16] 'the database system is starting up'... what does that mean I wonder [23:40:52] oh, hm, it means [23:40:52] psql: error: FATAL: the database system is starting up [23:41:30] I for sure don't know how to debug postgres [23:44:02] I'm seeing things on SO that sound like this is a normal postgres client message while the server is actually starting, but like you I don't know how to actually debug if it is starting or not in reality [23:46:30] ok, yes, according to the logs I've just made things worse by kicking it [23:46:33] https://www.irccloud.com/pastebin/vezNi5Gx/ [23:46:42] but that doesn't explain why it freaked out in the first place [23:47:03] OH! The drive is full [23:47:10] Ok, that I think I know how to fix, stay tuned [23:47:49] * andrewbogott really counting on Trove to not screw this up [23:54:14] bd808: ok, I'm not seeing what I expected. lsblk says [23:54:16] https://www.irccloud.com/pastebin/5zHnGiTp/ [23:54:34] But that sdb should be 10G drive with a 1G partition, shouldn't lsblk show the whole thing? [23:55:24] * bd808 manifests as a rubber duck in this scenario