[09:38:05] elukey: I see you went down an nice rabbit hole with APUs and blob unknown [09:42:04] elukey: Overall OK, all of this is needed, but I disagree with the backend for the /ml part being S3/Ceph from the beginning. [09:42:31] There's a task that Ben authored where I explain my reasoning and it's about risk mitigation of ending up with a 2nd system effect [09:43:03] https://phabricator.wikimedia.org/T413080 [09:44:07] put simply, we should avoid the eventuality that we fail to migrate other things to APUs but end up keeping it just for the one blocked use-case, cause in that case we end up with duplicate maintenance costs and an inability to migrate away from the 2 system situation [09:45:10] the mitigation being simply that we prove that this work with the mediawiki images, which is the other big space consumer we have from the get go. [10:00:06] akosiaris: o/ [10:00:28] I've read the task, I missed your comments till now [10:04:32] I partially agree with what you proposed, but my reasoning for apus and ML stems from these points: [10:04:32] 1) We are not adding a big new system, it is effectively the same one (offering the same API/endpoints thanks to you refactoring work) but for some prefixes we are choosing a different storage. [10:04:32] 2) Swift is already deprecated and highly problematic across the board, we have spent countless hours debugging its issues. [10:04:32] 3) Not trying a new solution for ML would probably mean blocking them for a long time, with the high risk of the team rightfully choosing to try their own solutions. [10:05:59] so it was more a way to properly test apus with two different use cases, and then decide [11:05:36] For 2) I am aware and in agreement (it's a fact after all), but also we haven't proven that S3/Ceph APUs will actually work fine. And if it doesn't, we can still fall back to swift for the time being. For 1) I 'd argue that while we do our best to abstract away aspects of the system, it does remain a new system. As for 3) I am aware of that risk, [11:05:36] but I am more worried about the risk of ending up maintaining 2 backends instead. [11:21:27] yeah so part of my reasoning including ML is to test bigger images and see how it goes, as extended testbed together with MediaWiki. The VLLM image is designed so that the layers are not more than 4 GBs compressed, so that one can go to swift as well if needed. New images may not respect this constraint, so in that case once pushed and used in production we may become stuck, but I count on having some for of answer about apus/s3 before that [11:21:27] (in agreement with ML) [11:22:04] so at any given point there is the possibility of getting back to swift [11:22:28] this is why I didn't worry too much about adding ML, it was just to have a broader test [11:22:41] (they will use it with docker-pkg etc..) [11:23:17] And for the two backends, I think we'll end up having to use both for a bit of time in any case [11:23:41] IIUC apus is a limited cluster for now, so we'll have to talk with data persistence about its prioritization etc.. [11:23:59] and we'll surely have to keep both swift and apus in parallel for some months [11:24:21] I don't see it problematic, but if you do I can backoff and just revert ML to swift [11:26:07] Sorry, if vLLM is not problematic currently and fits even in swift, where does the urgency stem from? [11:26:39] I was under the impression there are workloads already that do not fit in the registry. Is that a wrong impression? [11:28:29] But otherwise, I 'll again point out the MediaWiki is a far better 1st client for this. It pushes daily via a well defined and exercised process, pulls a ton, we can iron out kinks more easily. I am not sure what is the benefit for ML if they are not getting unblocked on something (I was under the impression they would be unblocked, but that was [11:28:29] wrong?) [11:28:30] yes and no, if you take "vanilla" vLLM image/config for example it does exceed the layer limit, but Kevin Bazira in ML worked with docker-pkg to adjust layers to respect the limits and on paper it should be good for swift (but we haven't tested it yet due to the ml-build work, vLLM needs a gpu and a ton of ram to be built). [11:29:02] and in the medium term the ML team will likely find a use case that exceeds the limit [11:29:45] yes. But also medium term here means that it can be the 2nd client after MediaWiki to onboard on APUs. [11:29:52] not the first. [11:32:51] it seemed to me easier to couple this work with their efforts on ml-build, but the code review is still in flight so we can leave the s3 ml backend running and just use swift for the moment. I still don't see the problem in letting them try apus, but I trust your judgement and I'll ask them to just use swift [15:55:20] thanks [15:56:41] I have no problem btw ml really being the 2nd client. Once MW+scap is proven to work, it's the perfect 2nd thing. The rest is a long tail of not that often deployed services which are being anyway pushed by the pipeline, making the migration entirely contained within SRE. [15:57:44] akosiaris: totally unrelated question - IIUC we are using the same redis instance for the caching, I don't see problems but I today I wanted to test the GC with the restricted registry instance (so delete an image, and see if it worked with s3/apus etc..) and I wondered if I could have affected the other instances or not (via redis) [16:01:23] elukey: Swift would be more likely to suffer a bit if the caching redis isn't around at all [16:01:37] but even that, only during a large deployment and for a short amount of time [16:05:36] akosiaris: I am asking for two reasons - 1) possibly cache pollution happening cross-instances (is it a thing? Just raising it to discuss it) 2) I have no idea if the GC acts on the redis cache, I assumed so but I didn't find traces in the code (I am starting to think that they just leave the cache entries until they get out the LRU)