[02:35:32] morning folks :) [03:28:56] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use vllm for ROCm in huggingface image - https://phabricator.wikimedia.org/T370149#10369892 (10achou) I successfully built Triton flash attention using a miniconda env. However, when attempting to build vllm from source, I encountered an error r... [05:16:49] ----^ so HIP_ROOT_DIR is a cmake variable, not env one. that's why it does not work when I set it as env var [05:25:33] 06Machine-Learning-Team, 13Patch-For-Review: Run unit tests for the inference-services repo in CI - https://phabricator.wikimedia.org/T360120#10369946 (10kevinbazira) [05:30:38] 06Machine-Learning-Team, 13Patch-For-Review: Run unit tests for the inference-services repo in CI - https://phabricator.wikimedia.org/T360120#10369947 (10kevinbazira) 05In progress→03Resolved All test images that previously had `tox.ini` in a specific dir have been updated. The CI entrypoint now assume... [07:12:45] (03PS1) 10Kevin Bazira: article-country: return wikidata_properties as a list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099524 (https://phabricator.wikimedia.org/T371897) [08:08:18] good morning! [08:32:47] 10Lift-Wing, 06Machine-Learning-Team: [LLM] Allow loading model weights as int8 with HF - https://phabricator.wikimedia.org/T377848#10370115 (10isarantopoulos) [08:37:47] 10Lift-Wing, 06Machine-Learning-Team: [LLM] Allow loading model weights as int8 with HF - https://phabricator.wikimedia.org/T377848#10370130 (10isarantopoulos) [09:44:40] 10Lift-Wing, 06Machine-Learning-Team: [LLM] quantization: allow loading model weights as int8/int4 with HF - https://phabricator.wikimedia.org/T377848#10370321 (10isarantopoulos) [09:49:07] 06Machine-Learning-Team, 10Recommendation-API, 06SRE, 10SRE-Access-Requests: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108#10370350 (10elukey) @klausman Hi! Could ML take care of this request? [09:51:10] ^^^ yes, once today's planned power outage is over [09:52:09] aiko kevin the 1st GPU in ml-lab is occupied so if you want to use the second one inside a notebook make sure to comment out the env var `export CUDA_VISIBLE_DEVICES=1` [09:52:39] otherwise you will only see the first GPU which has 90+% of VRAM occupied [10:31:55] (03PS3) 10Ilias Sarantopoulos: llm: move dir under src/models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099166 (https://phabricator.wikimedia.org/T369344) [10:31:59] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099166 (https://phabricator.wikimedia.org/T369344) (owner: 10Ilias Sarantopoulos) [10:33:03] (03CR) 10Ilias Sarantopoulos: [C:03+1] article-country: handle empty country results gracefully [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099158 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [10:33:17] (03CR) 10CI reject: [V:04-1] llm: move dir under src/models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099166 (https://phabricator.wikimedia.org/T369344) (owner: 10Ilias Sarantopoulos) [10:45:12] (03PS2) 10Kevin Bazira: article-country: handle empty country results gracefully [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099158 (https://phabricator.wikimedia.org/T371897) [10:45:48] 10Lift-Wing, 06Machine-Learning-Team: [LLM] quantization: allow loading model weights as int8/int4 with HF - https://phabricator.wikimedia.org/T377848#10370632 (10isarantopoulos) [10:46:33] (03CR) 10Kevin Bazira: [C:03+2] article-country: handle empty country results gracefully [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099158 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [10:47:12] 10Lift-Wing, 06Machine-Learning-Team: [LLM] quantization: allow loading model weights as int8/int4 with HF - https://phabricator.wikimedia.org/T377848#10370626 (10isarantopoulos) [10:50:34] (03Merged) 10jenkins-bot: article-country: handle empty country results gracefully [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099158 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [10:54:40] aiko: kevinbazira sry I forgot to attach the AMD docs on quantization https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/model-quantization.html [10:54:51] I'm adding it to the task as well [10:55:57] thanks Ilias. I had found this one from HF: https://huggingface.co/docs/transformers/en/quantization/gptq [10:57:23] exactly! that is the one from the hf docs I mentioned this morning [10:57:46] rocm docs dont mention AWQ so I'm curious why [11:01:15] thanks [11:02:27] 10Lift-Wing, 06Machine-Learning-Team: [LLM] quantization: allow loading model weights as int8/int4 with HF - https://phabricator.wikimedia.org/T377848#10370734 (10isarantopoulos) [11:03:13] that could be because of the lack of a 8bit option [11:16:59] 10Lift-Wing, 06Machine-Learning-Team: [LLM] quantization: allow loading model weights as int8/int4 with HF - https://phabricator.wikimedia.org/T377848#10370814 (10isarantopoulos) a:03achou [11:27:34] let's try to keep track of the things that we try so that we have clear and reproducible installation instructions [11:28:15] you can think of the above message as me talking to myself :P [11:37:02] ack! [11:54:44] FIRING: LiftWingServiceErrorRate: ... [11:54:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-backend=ruwiki-goodfaith-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [11:59:44] RESOLVED: LiftWingServiceErrorRate: ... [11:59:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-backend=ruwiki-goodfaith-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [12:10:51] isaranto: klausman should we deploy rec-api in production as well? I can come back in an hour and ping you back if you're available. Need to step out a bit for last minute errands. [12:11:30] kart_: o/ let's do it in an hour! ping me [12:12:07] we can probably do that. my electricity company said they'll turn off the power for ~1h today "sometime between 0730 and 1700", which hasn't happened yet. Giving it a go in 1h sounds like a good plan [12:14:52] 10Lift-Wing, 06Machine-Learning-Team: [LLM] quantization: allow loading model weights as int8/int4 with HF - https://phabricator.wikimedia.org/T377848#10371069 (10isarantopoulos) **bitsandbytes** There are two pages with similar installation instructions. The [[ https://huggingface.co/docs/bitsandbytes/main/... [12:15:09] cool :) [12:15:53] I'll be here, so we'll ping you Tobias in the case where things don't go as planned :) [12:16:05] * isaranto afk lunch! [12:57:35] and power is gone! now to wait for it to come back. [13:19:51] 10Lift-Wing, 06Machine-Learning-Team: [LLM] quantization: allow loading model weights as int8/int4 with HF - https://phabricator.wikimedia.org/T377848#10371241 (10achou) **AWQ** Building from source according to the [[ https://github.com/casper-hansen/AutoAWQ_kernels?tab=readme-ov-file#build-from-source | rep... [13:32:02] I'm back. klausman isaranto are you back? [13:35:26] i'm here! [13:37:43] kart_: shall I give it a go? [13:41:35] sure! [13:41:47] going with codfw first... [13:42:39] done! [13:44:13] Nice. Watching logstash.. [13:45:39] checking the logs on the cluster: I see requests coming in (completed successfully) [13:45:44] FIRING: LiftWingServiceErrorRate: ... [13:45:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-backend=ruwiki-goodfaith-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [13:46:07] unrelate alert --^ will check it afterwards [13:46:12] going for eqiad as well [13:47:50] logs seems fine.. [13:50:28] isaranto: is it done? [13:50:44] RESOLVED: LiftWingServiceErrorRate: ... [13:50:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-backend=ruwiki-goodfaith-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [13:50:50] yes all seems good [13:50:55] Nice! [13:51:09] Thanks a lot! [13:51:23] 🙌 [15:06:29] I ran 3 LLMs on ml-lab GPUs with the same prompt, and here are their inference speeds: [15:06:30] 1. non-quantized: `aya-23-8B` >60s and `aya-expanse-8B` >6s as shown in https://phabricator.wikimedia.org/P71472 [15:06:30] 2. quantized: `aya-23-8B-GPTQ` <40 seconds as shown in https://phabricator.wikimedia.org/P71473 [15:06:30] It will be interesting to see the results of the GPTQ quantized `aya-expanse`, since inference speed improved for GPTQ `aya-23-8B` [15:07:50] (03CR) 10Sbisson: Filter out articles in other NS for IW links (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098130 (owner: 10Sbisson) [15:21:28] kevinbazira: nice work! [15:21:55] (also: still now power ;_; the tech said "someone" let out the magic smoke from the meter he was installing) [15:27:35] kevinbazira: nice work! how about if you load it on the GPU? [15:27:44] use cuda:0 or cuda:1 as device map [15:27:59] okok [15:34:14] good morning all [15:37:21] \o [15:38:33] isaranto: `sudo nvtop` should now work for you and others. [15:39:57] I've been hammering my macbooks GPU (or whatever tthe M2 has) for a week now. ~20000 LLM responses so far [15:41:39] Are any of them _good_? ;) [15:42:11] Ive made 6702 llama3:70b responses, with an average of 19.58s per response [15:42:37] The rest are Aya-Expanse:32b 4bit [16:03:29] o/ Chris [16:03:39] klausman: I was in a meeting, thaaanks [16:04:16] works like a charm! [16:05:45] klausman: re: rocm config and hipcc setup. I found some stuff last week https://phabricator.wikimedia.org/T371344#10368656 [16:06:17] shall I open a new task for this so we can work it out? if you have time we can pair tomorrow after the standup [16:06:26] yeah, sounds good [16:06:28] or ping me whenever we can have a chat about it [16:06:33] thank u! [16:06:39] I got power back a few mins ago \o/ [16:06:45] how is it having power again? [16:06:48] \o/ [16:17:37] You don't realize how much of your life depends on electricty until it's gone [16:26:44] FIRING: LiftWingServiceErrorRate: ... [16:26:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-backend=ruwiki-goodfaith-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [16:27:55] Same alert as 3h ago, maybe try restarting the pods? [16:28:50] I'm looking at the logs as we speak [16:36:58] I see the same errors on logstash multiple times https://logstash.wikimedia.org/goto/636962662af8caad9a856617329f548e [16:37:31] https://phabricator.wikimedia.org/P71476 (The paste is private , I don't know if it needs to be, I did it just in case) [16:38:14] ack [16:38:33] any idea what that particular code is trying to do? [16:40:12] feature extraction - I don't recall anything more than that. digging into it now [16:46:44] RESOLVED: LiftWingServiceErrorRate: ... [16:46:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-backend=ruwiki-goodfaith-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [16:46:45] nevermind that is a 400 error it is handled properly [16:48:50] 10Lift-Wing, 06Machine-Learning-Team: [LLM] quantization: allow loading model weights as int8/int4 with HF - https://phabricator.wikimedia.org/T377848#10372269 (10achou) **AWQ** I was able to build and install AutoAWQ using a miniconda env with [[ https://phabricator.wikimedia.org/P71477 | these steps ]]. How... [16:50:22] ----^ at least making some progress [16:50:31] logging off for today! have a nice evening folks :) [16:51:48] \o [16:53:00] nightyy Aiko! [17:24:32] night all! [17:30:49] going afk folks, cu tomorrow! [23:18:08] night