[05:46:25] 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10kevinbazira) It's good to know that we have an existing scaffolding for fastapi-apps. The recommendation-api project is a good example of projects we are likely to be handed to host on Lift... [05:54:34] 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10elukey) Sure, I am fine with the approach, the only thing that I asked earlier on was if you had thoughts/time to figure out how long would it take to migrate to fastapi (if even possible),... [05:56:48] (03PS3) 10Elukey: llm: add clean up steps when GPU errors are raised [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930622 (https://phabricator.wikimedia.org/T334583) [05:57:18] (03CR) 10Elukey: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930622 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [06:03:15] (03CR) 10CI reject: [V: 04-1] llm: add clean up steps when GPU errors are raised [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930622 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [06:17:50] (03CR) 10Elukey: [C: 03+2] "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930622 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [06:23:54] (03CR) 10CI reject: [V: 04-1] llm: add clean up steps when GPU errors are raised [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930622 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [06:27:08] (03Merged) 10jenkins-bot: llm: add clean up steps when GPU errors are raised [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930622 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [07:13:25] (03CR) 10Elukey: [C: 03+1] ores-legacy: Change message in RevisionNotFound error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [07:21:29] (03CR) 10Elukey: "Really nice! I left some comments to better understand the patch :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929743 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [07:32:19] (03PS1) 10Elukey: llm: fix call to empty_cache() [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930747 [07:49:12] (03CR) 10Ilias Sarantopoulos: [C: 03+1] llm: fix call to empty_cache() [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930747 (owner: 10Elukey) [07:49:32] 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10kevinbazira) Rebuilding a project of this scale using a different framework requires careful planning as we would have to rethink the implementation architecture to keep the current app fun... [07:50:09] (03CR) 10Elukey: [C: 03+2] llm: fix call to empty_cache() [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930747 (owner: 10Elukey) [07:51:10] (03Merged) 10jenkins-bot: llm: fix call to empty_cache() [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930747 (owner: 10Elukey) [07:51:16] \o Morning! [07:52:38] (03PS8) 10Ilias Sarantopoulos: feat: add Response Models in ores-legacy API [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929743 (https://phabricator.wikimedia.org/T330414) [07:52:57] (03PS5) 10Ilias Sarantopoulos: ores-legacy: Change message in RevisionNotFound error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) [07:53:11] Morning as well! [07:56:14] 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10elukey) Totally get your point but I don't agree 100%, in this case we don't really need a complete design doc nor roadmaps, it would just be moving the API from Flask to fast-api and uvico... [08:01:01] (03CR) 10Ilias Sarantopoulos: feat: add Response Models in ores-legacy API (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929743 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [08:02:04] (03PS6) 10Ilias Sarantopoulos: ores-legacy: Change message in RevisionNotFound error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) [08:02:23] (03PS9) 10Ilias Sarantopoulos: feat: add Response Models in ores-legacy API [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929743 (https://phabricator.wikimedia.org/T330414) [08:05:17] (03CR) 10Elukey: [C: 03+1] "great work!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929743 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [08:31:58] elukey: any objections to me deploying the replicacount change immediately? [08:32:16] nope [08:32:30] Alright, will do so once Jenkins merges [08:48:10] 10Machine-Learning-Team, 10Spike: [Spike] Run models and frameworks on AMD GPU and identify challenges - https://phabricator.wikimedia.org/T334583 (10elukey) Challenge with Falcon 7b, this is the first call to the model server (model tensors loaded to the GPU, plus tokens related to features): ` 2023-06-16 08... [08:48:40] isaranto: added some thoughts to --^ I think that the model itself doesn't fit into the GPU's VRAM, so we cannot do much [08:49:17] the gpu is left into an inconsistent state, I think that this is why we get the second error msg (all the way to predict()) [08:51:31] So I watched the logs of the predictor-kserve container as I updated the setup in eqiad (using kubectl logs -f --tail=30 -n articletopic-outlink outlink-topic-model-predictor-default-XXXX for both old and new) and you could see the traffic migrating from one to the other over a few seconds, and then at 90s, the old pods started to terminate. Very nice. [08:52:22] the complete termination and cleanup of the old pods usually happens after 5m or so (I suspect we specifically configured that or it's a k8s default) [08:52:55] yep it is all knative doing the hard work [08:53:41] It's nice to see that deployment with transparent cutover has made it to the outside world. I was completely amazed when I first saw this kinda stuff in 2010 after joining the Goo [08:55:22] it can do also more, like canary deployments etc... [08:55:49] Yeah, I suspected as much, once you have versioned config with Helm, rollbacks and all that become a lot more feasible. [08:58:17] isaranto: when you talked about falcon-7b-8bit, was it something like https://huggingface.co/legendhasit/falcon-7b-instruct-8bit ? [08:58:41] that is not from https://huggingface.co/tiiuae but it may work for us [09:04:02] elukey: I'll poke Hugh/Kamila about how they feel about updating changeprop on a Friday [09:05:39] klausman: I'd suggest to not rush it, if we have problems with the firehose we'll need to do more deployments etc.. [09:05:45] it is fine to wait for monday [09:06:29] Yeah, there also was some oddity with 1-2 of the changeprop Grafana graphs (processing time incresing quite a bit). But Hugh mentioned that might just be an artifact of how changeprop does sharding. [09:07:24] Also, for watching traffic in multiple pods etc, `kubetail` in my homedir on deploy1002 is great. [09:07:46] it's from https://github.com/johanhaleby/kubetail [09:08:23] elukey: I was referring to just loading the same model in 8bit (perhaps the link you posted has done this and then saved the model). https://huggingface.co/docs/transformers/main_classes/quantization [09:08:23] This is what I was referring to [09:08:23] ``` [09:08:23] model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True) [09:08:23] ``` [09:08:24] but didnt work out of the box. I'll add this info on phab [09:08:42] btw huggingface has amazing resources [09:10:58] ahhh nice! [09:11:02] I was reading https://huggingface.co/tiiuae/falcon-40b-instruct/discussions/15 [09:11:15] "in bfloat16 it takes ~65GB of VRAM on A1000 80GB, in 8bit ~46GB [09:11:28] so bigger llms are probably out of range for us [09:11:57] As an 80s kid, I am a big fan of 8bit :) [09:12:52] isaranto: what is the diff between setting the load_in_8bit param and running torch_dtype=torch.bfloat8 in transformers.pipeline? [09:13:49] ah right we have the issue while loading, so the latter is already at predict time [09:13:53] okok self answered :) [09:14:06] :) [09:14:36] but probably if we want to load the model in 8bit integers then we'd also need inference to run in 8bit right? [09:14:41] to preserve memory [09:14:49] (sorry for all the qs trying to get the code :) [09:15:07] once you load it in 8bit, it also runs in 8 bit [09:15:18] perfect, without any extra settings [09:15:34] Also, can this transformation be done offline, i.e. to the on-disk version and reduce its footprint? [09:15:37] I mean you lose the extra information by downcasting the model weights on loading time [09:17:47] klausman: I think so https://huggingface.co/docs/transformers/main_classes/quantization#push-quantized-models-on-the-hub [09:19:07] That would also help with the /var/lib/{docker,kubelet} thing [09:19:20] yes, we could do that! the only downside it the extra step required, but we already download the models and upload them manually to swift [09:21:46] 10Machine-Learning-Team, 10Spike: [Spike] Run models and frameworks on AMD GPU and identify challenges - https://phabricator.wikimedia.org/T334583 (10elukey) We are discussing https://huggingface.co/docs/transformers/main_classes/quantization#load-a-large-model-in-8bit, as a way to reduce the model's footprint... [09:22:36] klausman: did you see https://phabricator.wikimedia.org/T339231? Ok for you? [09:23:19] LGTM [09:23:24] we could also think about using less disk space for the partition, to leave something out for emergencies, but not sure if worth it [09:23:31] Do we also need to change the partman recipe? [09:23:43] later yes [09:24:19] I wonder about the spinning rust as well. How fast it is etc. [09:24:59] I would be on the fence using them, the ssd vs hdd latency could be hard to diagnose [09:26:13] yeah, I would not wnat to have them in the same VG [09:26:28] but I wonder what uses they might be suitable for. [09:27:16] yep I meant latency in general, we'd need to be aware of what disk calls are slow and what not [09:27:34] for example, adding the kubelet partition or the docker one on hdds may be a recipe for a big trouble [09:28:11] and I am very scared about having things possibly hitting different latency-class partitions [09:28:21] Yep. I was wondering if there is anything we're downloading from somewhere that would be faster to load from a spinning disk, but I can't think of anything [09:28:24] (then we'd need to add that variable when debugging) [09:30:23] I think if we actually run out disk space for the VGs we already have, replacing the existing rust with more SSDs would be another option. I suspect DCops wouldn't mind having spare disks [09:30:50] And SSD prices have plummeted in the last 6-9 months [09:31:08] yeah but we need approved budget for those in the CapEX [11:01:50] * elukey lunch! [11:07:48] same [13:07:43] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ores-legacy: Change message in RevisionNotFound error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [13:13:59] (03Merged) 10jenkins-bot: ores-legacy: Change message in RevisionNotFound error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [13:19:40] (03PS10) 10Ilias Sarantopoulos: feat: add Response Models in ores-legacy API [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929743 (https://phabricator.wikimedia.org/T330414) [13:33:23] (03PS1) 10Elukey: llm: wipe VRAM memory when an out of memory event occurs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930797 (https://phabricator.wikimedia.org/T334583) [13:34:20] (03CR) 10CI reject: [V: 04-1] llm: wipe VRAM memory when an out of memory event occurs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930797 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [13:34:32] (03PS2) 10Elukey: llm: wipe VRAM memory when an out of memory event occurs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930797 (https://phabricator.wikimedia.org/T334583) [13:35:36] (03PS3) 10Elukey: llm: wipe VRAM memory when an out of memory event occurs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930797 (https://phabricator.wikimedia.org/T334583) [13:41:51] really nice https://vilsonrodrigues.medium.com/run-your-private-llm-falcon-7b-instruct-with-less-than-6gb-of-gpu-using-4-bit-quantization-ff1d4ffbabcc [13:43:15] maybe a bit aggressive [13:46:32] It's obviously a tradeoff between memory usage and quality [13:47:00] Then again, isn't "so-so predictions" better than "none at all, because we don't have a GPU big enough"? [13:50:03] no idea yet.. [14:00:35] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "I'm curious if we need to (un)set anything in the torch model. I'm wondering if it expects the model to still be on GPU." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930797 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [14:09:44] (03CR) 10Elukey: [C: 03+2] llm: wipe VRAM memory when an out of memory event occurs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930797 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [14:10:49] (03Merged) 10jenkins-bot: llm: wipe VRAM memory when an out of memory event occurs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930797 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [14:12:08] elukey: this is a good read as well https://huggingface.co/docs/accelerate/usage_guides/big_modeling [14:12:08] Using device_map ="auto" prioritizes over GPU,CPU and disk according to their capacity [14:18:41] isaranto: ah wow nice, should we test it? [14:21:54] seems supported by AutoModelForCausalLM.from_pretrained as well [14:22:03] sure! at the moment I'm working on the model registry [14:22:12] ok lemme file the code change [14:22:30] I am also trying to download falcon from hf's website and build the llm image locally [14:22:33] not easy [14:22:41] I'm trying to think of a faster way to try out stuff instead of going through ci/cd.. [14:23:01] perhaps attaching to the pod and changing the code could even work [14:24:26] you can do this if you want to run falcon locally https://phabricator.wikimedia.org/P49441 [14:25:22] replace model_path with 'tiiuae/falcon-7b' and set the local_files_only to false [14:25:35] ah nice [14:26:01] do you think that we should still use .to(device) in the tokenizer when using device auto? [14:26:04] or just remove it? [14:26:38] maybe AutoTokenizer has the same option [14:26:53] I think it doesnt cause tokenizer is only on cpu [14:27:22] the tokenizer remains on cpu but we load the inputs on the device where the model exists [14:28:23] btw when u download huggingface models they go to a cache directory with separated blobs and symlinks , if you want to use specific dir you can do a snapshot download using this script https://phabricator.wikimedia.org/P49442 [14:29:11] mmm so the tokenizer runs on cpu, then the inputs go on the GPU otherwise they cannot be computed? [14:29:22] or can we do model on gpu and inputs in regular memory? [14:33:01] the latter will result in an error. both need to be on same device for the computation (inference) to take place [14:35:07] (03PS1) 10Elukey: llm: test device_auto functionality in AutoModelForCausalLM [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930806 [14:36:16] (03PS2) 10Elukey: llm: test device_auto functionality in AutoModelForCausalLM [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930806 [14:36:58] (03PS3) 10Elukey: llm: test device_auto functionality in AutoModelForCausalLM [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930806 [14:38:32] (03PS4) 10Elukey: llm: test device_auto functionality in AutoModelForCausalLM [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930806 [14:38:56] there you go --^ [14:39:00] this is the idea right? [14:42:50] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Hopefully!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930806 (owner: 10Elukey) [14:42:56] yep! [14:44:29] (03CR) 10CI reject: [V: 04-1] llm: test device_auto functionality in AutoModelForCausalLM [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930806 (owner: 10Elukey) [14:53:17] (03CR) 10Elukey: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930806 (owner: 10Elukey) [15:02:24] (03CR) 10Elukey: [C: 03+2] llm: test device_auto functionality in AutoModelForCausalLM [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930806 (owner: 10Elukey) [15:03:24] (03Merged) 10jenkins-bot: llm: test device_auto functionality in AutoModelForCausalLM [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930806 (owner: 10Elukey) [15:29:24] 10Machine-Learning-Team, 10Spike: [Spike] Run models and frameworks on AMD GPU and identify challenges - https://phabricator.wikimedia.org/T334583 (10elukey) Tried to use the device_auto setting in , but this is the result: ` Explicitly passing a `revision` is encouraged when loading a model with custom code... [15:29:30] I get this very weird error --^ [15:29:36] wondering if the GPU is into a weird state [15:30:42] yeah this is weird [15:30:59] seems so from https://grafana.wikimedia.org/d/ZAX3zaIWz/amd-rocm-gpu?orgId=1&var-source=eqiad%20prometheus%2Fops&var-instance=ml-serve1001:9100 [15:31:05] one gpu is completely used [15:31:26] lol [15:31:36] actually I mean "lol" [15:31:49] but in theory the new pod should be scheduled on the other one free [15:31:50] mmmmm [15:32:26] it makes zero sense [15:33:21] I do see https://grafana.wikimedia.org/d/ZAX3zaIWz/amd-rocm-gpu?orgId=1&var-source=eqiad%20prometheus%2Fops&var-instance=ml-serve1001:9100&from=1686928654949&to=1686929576451&viewPanel=7 [15:33:30] so maybe the device_auto is scheduled on the new gpu but it fails [15:34:04] I've reset both gpus from ml-serve1001, maybe it helps [15:34:35] nope [15:35:26] How do you reset GPUs? [15:36:11] sudo /opt/rocm/bin/rocm-smi --gpureset -d 0 [15:36:23] but it doesn't always restore them in a good state [15:36:41] anyway, the device_auto thing seems to lead to a worse situation [15:36:49] ack [15:41:09] (03PS1) 10Elukey: Revert "llm: test device_auto functionality in AutoModelForCausalLM" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930779 [15:52:37] isaranto: I tested the previous version of the LLM image as well, and the cleanup of the GPU works [15:52:51] I keep seeing the same message now, " HIP out of memory. Tried to allocate 80.00 MiB (GPU 0; 15.98 GiB total capacity" etc.. [15:53:00] ack [15:57:21] 10Machine-Learning-Team, 10Spike: [Spike] Run models and frameworks on AMD GPU and identify challenges - https://phabricator.wikimedia.org/T334583 (10elukey) Current status for falcon: * I deployed the last version of the docker image that boots correctly, but that leads to a consistent VRAM out of memory eve... [15:58:50] all right heading out for the weekend folks [15:58:53] have a nice one [15:58:56] see you on monday :) [15:59:30] ciao Luca! [16:02:10] \o [16:02:20] I'm heading out as well. Have a great weekend, everyone [17:02:47] (03PS1) 10Ilias Sarantopoulos: llm: add the ability for facilitate various Open Source LLMs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930847 (https://phabricator.wikimedia.org/T333861) [17:03:33] (03CR) 10Ilias Sarantopoulos: [C: 03+2] Revert "llm: test device_auto functionality in AutoModelForCausalLM" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930779 (owner: 10Elukey) [17:08:18] (03CR) 10CI reject: [V: 04-1] Revert "llm: test device_auto functionality in AutoModelForCausalLM" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930779 (owner: 10Elukey) [17:09:03] (03CR) 10Ilias Sarantopoulos: [C: 03+2] "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930779 (owner: 10Elukey) [17:10:12] (03Merged) 10jenkins-bot: Revert "llm: test device_auto functionality in AutoModelForCausalLM" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930779 (owner: 10Elukey) [17:18:25] going away too, have a nice weekeeend o/ [22:02:11] 10Machine-Learning-Team, 10API-Portal, 10Platform Team Initiatives (API Gateway): Add documentation about LiftWing to the API Portal - https://phabricator.wikimedia.org/T325759 (10apaskulin) Hi @elukey and @achou! Just wanted to let you know that I moved some things around in the API Portal. You can now acce...