[00:09:36] (03PS2) 10Kevin Bazira: article-country: normalize sums using a fixed minimum sum of 1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100009 (https://phabricator.wikimedia.org/T371897) [00:12:07] (03CR) 10Kevin Bazira: article-country: normalize sums using a fixed minimum sum of 1 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100009 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [00:38:40] (03PS2) 10Kevin Bazira: article-country: return wikidata_properties as a list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099524 (https://phabricator.wikimedia.org/T371897) [00:40:45] (03CR) 10Kevin Bazira: article-country: return wikidata_properties as a list (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099524 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [01:17:04] (03CR) 10Kevin Bazira: [C:03+1] "was able to build this image locally without issues." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100078 (owner: 10Ilias Sarantopoulos) [04:27:44] morning folks :) [04:42:34] 10Lift-Wing, 06Machine-Learning-Team: [LLM] quantization: allow loading model weights as int8/int4 with HF - https://phabricator.wikimedia.org/T377848#10378230 (10achou) **AWQ** I was able to quantize aya-expanse-8b using AWQ. The quantized model is saved in my home directory at `/home/aikochou/aya-expanse-8b... [04:50:57] running an example from the AWQ repo that loads Meta-Llama-3.1-8B-Instruct-AWQ-INT4 for inference, but encountered a permission error [04:50:59] PermissionError: [Errno 13] Permission denied: '/srv/hf-cache/hub/.locks/models--hugging-quants--Meta-Llama-3.1-8B-Instruct-AWQ-INT4/db88166e2bc4c799fd5d1ae643b75e84d03ee70e.lock' [04:53:43] has anyone seen this error? I'm using my hf token and haven't had this problem downloading other models before [05:01:54] (03PS6) 10Santhosh: performance: Use asynchronous iterator for fetching from collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1100055 (https://phabricator.wikimedia.org/T381366) [05:02:25] (03CR) 10Santhosh: performance: Use asynchronous iterator for fetching from collections (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1100055 (https://phabricator.wikimedia.org/T381366) (owner: 10Santhosh) [08:23:48] o/ [08:25:14] I've run latency benchmarks for aya-expanse-8b: [08:25:14] 1. non-quantized succeeded: https://phabricator.wikimedia.org/P71521 [08:25:14] 2. GPTQ quantized failed: https://phabricator.wikimedia.org/P71523 [08:25:14] looks like it's failing because of a memory issue: [08:25:14] ``` [08:25:14] Memory access fault by GPU node-1 (Agent handle: 0xbd86680) on address 0x6f3ef82bd000. Reason: Unknown. [08:25:14] Fatal Python error: Aborted [08:25:15] ``` [08:25:18] good morning! [08:40:01] Looking at the errors above --^ [08:43:14] kevinbazira: I encountered the same error https://phabricator.wikimedia.org/T377848#10378230 [08:49:41] aiko: o/ interesting ... [08:51:11] lets talk about these in a bit then [08:51:16] okok [09:27:09] (03CR) 10Ilias Sarantopoulos: [C:03+2] llm: use torch base image and update deps [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100078 (owner: 10Ilias Sarantopoulos) [09:29:57] (03Merged) 10jenkins-bot: llm: use torch base image and update deps [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100078 (owner: 10Ilias Sarantopoulos) [09:39:37] (03PS4) 10Ilias Sarantopoulos: llm: move dir under src/models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099166 (https://phabricator.wikimedia.org/T369344) [09:40:23] (03CR) 10CI reject: [V:04-1] llm: move dir under src/models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099166 (https://phabricator.wikimedia.org/T369344) (owner: 10Ilias Sarantopoulos) [09:44:33] aiko: in the meeting, you asked whether I was able to run inference after the GPU restart. [09:44:33] I've tested and both `CohereForAI/aya-expanse-8b` and `2z299/aya-expanse-8b-GPTQ-4bit` are running fine, just like before the restart. [09:44:53] thanks Kevin!! [09:46:09] okok [09:48:55] ahh I think found it!! \o/ [09:49:00] I set the device_map to {'': 'cuda:0'}, then everything works! [09:49:19] it was set to 'auto' [09:51:27] ok! [09:55:18] (03PS5) 10Ilias Sarantopoulos: llm: move dir under src/models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099166 (https://phabricator.wikimedia.org/T369344) [09:56:03] (03CR) 10CI reject: [V:04-1] llm: move dir under src/models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099166 (https://phabricator.wikimedia.org/T369344) (owner: 10Ilias Sarantopoulos) [09:56:38] (03CR) 10Ilias Sarantopoulos: [C:03+1] article-country: return wikidata_properties as a list (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099524 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [09:56:41] (03PS3) 10Kevin Bazira: article-country: return wikidata_properties as a list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099524 (https://phabricator.wikimedia.org/T371897) [09:57:27] (03CR) 10Ilias Sarantopoulos: [C:03+1] article-country: normalize sums using a fixed minimum sum of 1 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100009 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [09:57:30] (03PS3) 10Kevin Bazira: article-country: normalize sums using a fixed minimum sum of 1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100009 (https://phabricator.wikimedia.org/T371897) [09:59:16] (03PS6) 10Ilias Sarantopoulos: llm: move dir under src/models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099166 (https://phabricator.wikimedia.org/T369344) [10:07:26] finally looked at this carefully and fixed it https://gerrit.wikimedia.org/r/1099166 [10:07:40] just a move of the llm dir under src/models/ [10:22:31] device_map sure seems more and more magic to me :) [10:23:22] (03CR) 10Kevin Bazira: [C:03+2] article-country: return wikidata_properties as a list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099524 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [10:24:08] (03Merged) 10jenkins-bot: article-country: return wikidata_properties as a list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099524 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [10:29:16] (03PS4) 10Kevin Bazira: article-country: normalize sums using a fixed minimum sum of 1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100009 (https://phabricator.wikimedia.org/T371897) [10:31:10] (03CR) 10Kevin Bazira: [C:03+2] article-country: normalize sums using a fixed minimum sum of 1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100009 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [10:31:56] (03Merged) 10jenkins-bot: article-country: normalize sums using a fixed minimum sum of 1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100009 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [10:39:55] I think that device_map='auto' is supposed to be magic by design :) [10:41:15] Something about sorcerer's apprentices and buckets and brooms :) [11:25:50] (03CR) 10Kevin Bazira: [C:03+1] llm: move dir under src/models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099166 (https://phabricator.wikimedia.org/T369344) (owner: 10Ilias Sarantopoulos) [11:27:17] (03PS7) 10Ilias Sarantopoulos: llm: move dir under src/models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099166 (https://phabricator.wikimedia.org/T369344) [11:33:00] (03CR) 10Ilias Sarantopoulos: [C:03+2] llm: move dir under src/models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099166 (https://phabricator.wikimedia.org/T369344) (owner: 10Ilias Sarantopoulos) [11:33:45] (03Merged) 10jenkins-bot: llm: move dir under src/models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099166 (https://phabricator.wikimedia.org/T369344) (owner: 10Ilias Sarantopoulos) [11:58:15] (03PS1) 10Ilias Sarantopoulos: llm: change local and docker dir to be the same [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100432 [12:06:00] (03PS2) 10Ilias Sarantopoulos: llm: change local and docker dir to be the same [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100432 [12:34:21] (03PS1) 10Ilias Sarantopoulos: (WIP) llm: add aya with bitsandbytes [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100441 [12:34:49] * isaranto afk lunch o clock [12:35:05] (03CR) 10CI reject: [V:04-1] (WIP) llm: add aya with bitsandbytes [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100441 (owner: 10Ilias Sarantopoulos) [13:52:33] I ran 2 `llama-3.1-8b-instruct` models on ml-lab GPUs with the same prompt, and here are their inference speeds: [13:52:33] 1. non-quantized: ~6s as shown in https://phabricator.wikimedia.org/P71538 [13:52:33] 2. GPTQ quantized: <25s as shown in https://phabricator.wikimedia.org/P71539 [13:52:33] these results are similar to what I got with `aya-expanse` [14:25:47] I see that uses a version of the model available from huggingface. did you try to quantized the llama/aya model yourself using gptq and the c4 dataset? [14:27:39] I mean to follow the process documented here https://huggingface.co/docs/transformers/main/quantization/gptq#gptq [14:30:54] please add your findings in the task https://phabricator.wikimedia.org/T377848 [14:38:28] (03PS5) 10Ilias Sarantopoulos: llm: add aya with bitsandbytes [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100441 (https://phabricator.wikimedia.org/T379052) [16:57:36] (03CR) 10Kevin Bazira: [C:03+1] llm: change local and docker dir to be the same [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100432 (owner: 10Ilias Sarantopoulos) [16:58:26] (03CR) 10Ilias Sarantopoulos: [C:03+2] llm: change local and docker dir to be the same [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100432 (owner: 10Ilias Sarantopoulos) [16:59:50] logging off today! have a nice evening folks :) [17:06:25] (03Merged) 10jenkins-bot: llm: change local and docker dir to be the same [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100432 (owner: 10Ilias Sarantopoulos) [17:14:58] night Aiko! logging off as well folks, cu tomorrow o/ [17:45:53] night all! [18:11:30] (03PS1) 10Nik Gkountas: Use strategy pattern to support different recommendation usecases [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1100512