[06:40:47] <wikibugs>	 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10kevinbazira) Managed to install front-end dependencies that rely on bower (an old package manager) then pointed the front-end resources path to bower components. The `404 (NOT FOUND)` error...
[06:51:47] <wikibugs>	 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10elukey) @kevinbazira some info in https://wikitech.wikimedia.org/wiki/Deployment_pipeline/Migration/Tutorial#Migrating_a_service_to_Kubernetes  The first step is probably to create a blubbe...
[07:52:17] <wikibugs>	 (03PS5) 10Ilias Sarantopoulos: llm: add the ability for facilitate various Open Source LLMs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930847 (https://phabricator.wikimedia.org/T333861)
[07:52:31] <isaranto>	 good morning folks!
[07:58:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] llm: add the ability for facilitate various Open Source LLMs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930847 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[08:12:11] <elukey>	 kalimera :)
[08:14:34] <isaranto>	 We get the same message again in CI: "No space left on device"..
[08:19:26] <elukey>	 yeah I know, usually with a recheck it work
[08:19:28] <elukey>	 s/7
[08:21:03] <wikibugs>	 (03CR) 10Elukey: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930847 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[08:23:48] <isaranto>	 a ok, I thought we needed to clean out some space
[08:31:43] <wikibugs>	 10Machine-Learning-Team, 10Automoderator, 10Moderator-Tools-Team: Retrain revert risk models on a regular basis via moderator false positive reports - https://phabricator.wikimedia.org/T337501 (10Samwalton9)
[08:40:07] <klausman>	 Grüeziwohl :)
[08:42:10] <elukey>	 o/
[09:34:22] <wikibugs>	 (03PS11) 10Ilias Sarantopoulos: feat: add Response Models in ores-legacy API [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929743 (https://phabricator.wikimedia.org/T330414)
[09:39:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] events: propagate the event time with the dt field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929735 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse)
[09:39:36] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Unify the meta subfield in events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929690 (owner: 10DCausse)
[10:29:57] * elukey lunch
[11:15:45] <klausman>	 gah, ISP maintenance at my POP. Might be spradically offline today
[11:29:11] <isaranto>	 I'm thinking of deploying nllb model with and without GPU and run some tests
[11:29:18] * isaranto going for lunch
[11:44:09] <wikibugs>	 (03CR) 10Elukey: "Left some comments but I like the idea!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930847 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[12:11:54] <elukey>	 isaranto: o/ (when you are back) - I was wondering why the auto load for our LLM class didn't work, and why the GPU's VRAM got maxed out
[12:13:35] <elukey>	 and then I re-thought about the current preprocess hack, namely loading as later as possible
[12:13:52] <elukey>	 IIRC we tried it since we didn't exactly know what was broken, but now it seems not right
[12:14:33] <elukey>	 check_gpu() should really be after load, I don't see any reason not to have it in there
[12:16:48] <elukey>	 ah no wait my bad, the fork stuff
[12:16:52] <elukey>	 totally forgot it
[12:16:53] <elukey>	 sigh
[12:21:48] <isaranto>	 elukey: regarding the first question about GPU VRAM are you referring to falcon? if yes then it makes sense as it is trying to use more RAM, perhaps doing something of copying the model 
[12:21:55] <isaranto>	 *something like
[12:22:10] <elukey>	 isaranto: yes correct, but it saturates all the GPU ram at once and then it fails
[12:22:20] <elukey>	 I am readin https://huggingface.co/docs/accelerate/usage_guides/big_modeling and in theory it should be more conservative
[12:22:21] <isaranto>	 I'm going to do a bit more profiling in transformer pytorch code
[12:22:50] <isaranto>	 what do u mean by "more conservative"?
[12:23:39] <elukey>	 I mean that AutoModelForCausalLM.from_pretrained shouldn't max out all the gpu memory and then fail
[12:23:48] <isaranto>	 aa ok
[12:24:09] <elukey>	 because the weird thing IIRC was that it failed while loading the model
[12:24:15] <elukey>	 something that doesn't happen with the current code
[12:24:20] <isaranto>	 but if it uses 14GB and then tries to use 28GB wouldnt that make sense?
[12:24:44] <elukey>	 maybe I am missing something, why 28G?
[12:25:09] <elukey>	 we are currently maxing out the VRAM when uploading the tokens to the GPU IIUC
[12:25:14] <isaranto>	 let me find the reference above or in phab
[12:25:17] <elukey>	 but the model is already safely loadeded
[12:26:48] <isaranto>	 due to this https://phabricator.wikimedia.org/T334583#8934759
[12:27:04] <isaranto>	 it is loaded but there is a big spike in memory usage
[12:28:49] <isaranto>	 I'm going to do more profiling now, to figure out if I we can bypass this
[12:28:54] <elukey>	 isaranto: yeah but IIUC from that post the double memory usage that you pointed out happened during inference
[12:29:01] <elukey>	 like preprocess etc.., not loading the model
[12:29:03] <elukey>	 right?
[12:29:31] <isaranto>	 right...
[12:29:45] <isaranto>	 ok so then Im confused and we're talking about different stuff
[12:29:47] <isaranto>	 sorries
[12:29:50] <elukey>	 https://phabricator.wikimedia.org/T334583#8939025 is very weird, it allocates more than 15G of ram
[12:30:07] <elukey>	 isaranto: nono sorry I am trying to understand as well, too many new things, good to brainbounce :)
[12:30:38] <elukey>	 device_map auto must be more aggressive and/or not really paying attenction to the available VRAM
[12:31:13] <isaranto>	 there are two issues if Im not mistaken:
[12:31:14] <isaranto>	 1. load model on cpu and then on gpu but we max out GPU VRAM on inference (the issue I mentioned)
[12:31:14] <isaranto>	 2. set device_map to auto and cant even load the model
[12:31:16] <isaranto>	 right?
[12:34:09] <elukey>	 correct, this is my understanding
[12:38:40] <isaranto>	 ack
[12:39:18] <isaranto>	 I'm going to try to deal with the first one at the moment + continue with nllb. wdyt?
[12:41:22] <elukey>	 isaranto: we can concentrate on nllb, the other one is less of a priority don't worry
[12:42:07] <isaranto>	 ok. I'm trying some stuff in the background..What is weird is that this seems to happen with falcon and not with bloom (the 2x memory usage in inference)
[13:03:52] <isaranto>	 I posted here as well -> https://huggingface.co/tiiuae/falcon-7b-instruct/discussions/36
[13:05:16] <isaranto>	 And then I just noticed something else which is odd . I use the function model.get_memory_footprint() which returns 24GB. which again is weird...More l8er..
[13:10:00] <elukey>	 weird
[13:10:15] <elukey>	 I am pretty sure we are going through the unknown unknows 
[13:14:36] <isaranto>	 also we are on a really tight fit with the specific model so it makes sense to experience issues others dont have
[14:00:30] <klausman>	 Finally I have Internet again \o/
[14:00:44] <klausman>	 latency is all over the place, though :-/
[14:13:55] <wikibugs>	 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10elukey) I filed some code changes for Restbase, but to the wrong repo since we have gerrit mirroring github. The CI settings are broken...
[14:13:59] <wikibugs>	 (03PS6) 10Ilias Sarantopoulos: llm: add the ability for facilitate various Open Source LLMs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930847 (https://phabricator.wikimedia.org/T333861)
[15:10:59] * elukey taking a break
[15:12:09] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: llm: add the ability for facilitate various Open Source LLMs (035 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930847 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[16:00:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] llm: add the ability for facilitate various Open Source LLMs (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930847 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[16:00:57] <elukey>	 going afk folks!
[16:01:00] <elukey>	 have a nice rest of the day
[16:02:36] <aiko_>	 o/ bye luca!
[16:06:34] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] llm: add the ability for facilitate various Open Source LLMs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930847 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[16:07:00] <isaranto>	 I'm going afk as well, cu tomorrow!
[16:07:36] <wikibugs>	 (03Merged) 10jenkins-bot: llm: add the ability for facilitate various Open Source LLMs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930847 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)