[08:14:40] <dcaro>	 \o/ the pre-commit autoupdate worked, I might make it run every month instead though, so we don't have to regenerate the ci image once a week
[08:16:06] <blancadesal>	 \o/
[08:16:22] <blancadesal>	 once a month sounds good
[08:26:40] <dcaro>	 hmmm... I think we are having issues connecting to github from cloud, a build on toolsbeta is hanging on `[step-inject-buildpacks] 2024-04-15T08:25:24.891597419Z Connecting to github.com (140.82.112.4:443)`
[08:27:50] <dcaro>	 ci has been having issues also lately with getting rate-limited by gihub
[08:28:14] <arturo>	 shall we be caching in our registry as much as possible for buildpacks stuff?
[08:28:37] <dcaro>	 it's when downloading a toml parsing tool
[08:29:43] <dcaro>	 we could use some kind of generic caching server
[08:35:27] <dcaro>	 hmpf, gitlab does not allow to upload artifacts to a release xd
[09:11:50] <dcaro>	 harbor is giving errors on tools
[09:11:55] <dcaro>	 looking
[09:17:25] <taavi>	 need any help?
[09:20:22] <dcaro>	 harbor is behaving weird, can't get to the UI (loading, some NS_ERROR_NET_INTERRUPT in the network tab on ffx), just restarted with docker-compose but UI is still stuck, the build was able to push though
[09:21:43] <taavi>	 not the trove database this time: /dev/sdb        9.9G  4.1G  5.4G  44% /var/lib/postgresql
[09:21:59] <dcaro>	 chrome says `ERR_HTTP2_PROTOCOL_ERROR`
[09:22:24] <dcaro>	 images are pushing ok so far, 3 in a row
[09:24:07] <dcaro>	 curl: `HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)`
[09:24:42] <taavi>	 does it work when bypassing the front proxy and ssh tunneling to the harbor host directly?
[09:25:52] <dcaro>	 I was about to try that yep :)
[09:27:00] <dcaro>	 yep, it does, I think it might be one of the options we tweaked on the instance proxy
[09:28:08] <dcaro>	 UI is less urgent, so I'm less worried
[09:29:54] <taavi>	 the tmpfs on proxy-03 hash filled again
[09:30:51] <dcaro>	 did you clean it?
[09:30:55] <taavi>	 I restarted nginx, and now it works again
[09:31:00] <dcaro>	 ah, okok
[09:31:21] <dcaro>	 hmmm, interesting
[09:33:23] <arturo>	 what changes did you do on the front proxy?
[09:33:53] <arturo>	 I only remember the support for arbitrary domains happening recently
[09:39:38] <dcaro>	 the /tmp filling up is an old issue (alleviated by trying to avoid buffering in the proxy), but we have not found yet a long-lasting solution, it seems it still buffers stuff :/
[09:41:04] <dcaro>	 previous work on that T354116
[09:41:04] <stashbot>	 T354116: Harbor uploads sometimes fail due to tmpfs space on project-proxy - https://phabricator.wikimedia.org/T354116
[09:42:51] <arturo>	 I see
[09:45:17] <dcaro>	 taavi: were you able to check which subdirectory was taking the most space? I'm thinking that it might not just be the proxy buffering, but some other sort of cache the one that fills up
[09:46:25] <taavi>	 i restarted it without checking, sorry
[09:46:36] <dcaro>	 np
[11:20:24] <taavi>	 we have a new spicerack release which means we can do this: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1018272
[11:22:32] <dcaro>	 \o/
[11:23:21] <arturo>	 taavi: nice
[11:23:36] <arturo>	 I just noticed this T362521
[11:23:36] <stashbot>	 T362521: toolforge jobs api logs internal datetime error - https://phabricator.wikimedia.org/T362521
[11:24:57] <dcaro>	 arturo: that rings a bell, I saw that at some point when k8s would not send a date pre-fixed log (I think it was an empty line)
[11:25:12] <dcaro>	 bet that was because the pod was failing or something else, as in it was not expected to do that
[11:25:16] <dcaro>	 *but
[11:25:20] <dcaro>	 quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/74
[11:26:35] <dcaro>	 there might be a task around for that one (I don't remember if I opened one or not)
[11:26:53] <arturo>	 dcaro: LGTM
[11:29:30] * arturo brb
[11:34:01] <taavi>	 hm, seems like the new spicerack version updated click which forces us to update black
[11:34:19] <taavi>	 that's https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1019727 now
[11:36:15] <taavi>	 oh ffs, there's a circular dependency
[11:36:20] <dcaro>	 we are running python =>3.9 right?
[11:36:35] <taavi>	 cloudcumins are on bullseye, so 3.9 everywhere, yes
[11:37:28] <taavi>	 so both https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1019727 and https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1019724 are needed for the CI to pass. I'll force merge the first one once both have been approved as I don't want to squash them into one
[11:37:42] <dcaro>	 there's many errors though on ci for the black change
[11:38:39] <dcaro>	 essentially 11:35:11     pylint: no-name-in-module / No name 'ALERTMANAGER_URLS' in module 'spicerack.alertmanager'
[11:38:43] <taavi>	 yes, those are all fixed by https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1019724 (which would not pass CI without the black change, that's the circular dependency I just mentioned)
[11:39:05] <dcaro>	 ah, okok, did not see the circular dependency message xd
[11:41:23] <dcaro>	 we could have merged them together imo, I prefer each commit to be "big and usable by itself" than "small but breaking by itself", personal preference I guess
[11:42:13] <taavi>	 the first commit did not break the CI
[11:42:40] <taavi>	 the CI was broken due to the new spicerack release, which was immediately pulled in due to our dependency in setup.py not having an upper range
[11:42:50] <taavi>	 and you needed both of those commits to unbreak it after the release
[11:46:15] <dcaro>	 I understand that the failure did not come from a change in the code, but you still have one commit that would not work with the previous deps, nor the new deps, essentially, that will not work in any way
[11:55:33] <taavi>	 fair
[11:56:35] <dcaro>	 it's not a big issue though
[12:07:43] <blancadesal>	 why do we require https for each backend api? The api gateway handles tls termination for external traffic. wouldn't it be easier to do http internally, then restrict traffic to/from whichever internal services need it using network policies? (not critiquing, honest question)
[12:10:02] <taavi>	 blancadesal: using client certificate authentication with that is an easy way to block any arbitrary tool in the k8s cluster from impersonating the api gateway and bypassing the authentication that way. end-to-end encryption within the cluster is a nice bonus too
[12:24:46] <blancadesal>	 how would a random tool impersonate the gateway? 
[12:25:10] <arturo>	 crafting the HTTP headers
[12:28:24] <arturo>	 I believe each individual API only checks for the http headers to validate the user. So we need a robust way to ensure only a known source has injected the header
[12:29:38] <arturo>	 ssl-client-subject-dn: CN=user, O=toolforge
[12:29:49] <arturo>	 ^^^ I think that's the HTTP header
[12:34:13] <blancadesal>	 Is there no way to see that the request comes from an unauthorized pod in that case (assuming there are network policies restricting pod-to-pod communications)?
[12:56:09] <dcaro>	 the ideal would be to do both, with tls + firewall rules, or even better, using a different network completely (vlan for example)
[13:03:38] * arturo food time
[14:05:36] <andrewbogott>	 dcaro: do you remember much about when you wrote all this backy2 wrapper code? I'm confused about the expired backup cleanup process... I'm seeing two different code paths for cleaning up expired backups.
[14:05:48] <andrewbogott>	  There's wmcs-purge-backups (which just tells backy2 "ok, now remove your expired backups" and is about three lines long) and then there's a lot of logic in wmcs-backup (which I think you added) which queries backy2 for expiration dates and then deletes things manually
[14:05:59] <andrewbogott>	 Any idea why we have both? Did the backy2 auto clean not work right?
[14:08:30] <dcaro>	 I think we wanted to delete more than just backy2 stuff (probably), might also just be that we never got to deduplicate code (either because one was older and the new never got to replace it, or because they were written in parallel)
[14:09:01] <dcaro>	 wmcs-backups was written with the idea to wrap backy and anything else, and only care about image/vm specifically, backy2 is focused on rbd volumes, but has no notion of VMs/project/quotas/etc.
[14:09:40] <andrewbogott>	 hmmmm that seems possible I'll re-read the manual bits
[14:12:47] <dcaro>	 let me know if you want me to give it a read, it might refresh my memory xd
[15:06:24] <taavi>	 this is what we just discussed: T362539
[15:06:25] <stashbot>	 T362539: [api-gateway] Explore using network policies to further secure traffic between toolforge api-gateway router and backends - https://phabricator.wikimedia.org/T362539
[15:08:53] <arturo>	 taavi: thanks
[15:16:33] <dcaro>	 👍
[15:21:01] <blancadesal>	 👍
[15:26:57] <dcaro>	 From the video that dhinus shared about harbor, they are thinking about it, but not sure if it makes sense https://usercontent.irccloud-cdn.com/file/pVn6AywG/image.png
[15:27:23] <arturo>	 ok, so a "maybe" future thing
[15:28:13] <arturo>	 they should add .deb support too :-P
[15:29:03] <dcaro>	 xd
[15:35:53] <dcaro>	 I think though that they mean "storing other artifacts in OCI registries", not creating other types of registries (ex. helm supports pulling charts from oci registries or helm repos)
[15:54:56] * arturo offline
[16:56:39] <dcaro>	 this should improve the golang checks considerably https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/35, getting them under 1m :)
[16:56:44] * dcaro off
[16:56:45] <dcaro>	 cya tomorrow
[17:59:43] * bd808 lunch
[23:57:38] * bd808 off