[08:13:34] good morning folks! [09:42:35] 06Machine-Learning-Team, 10Observability-Metrics, 10SRE Observability (FY2024/2025-Q2): Gap in metrics rendered from Thanos Rules - https://phabricator.wikimedia.org/T352756#10440046 (10elukey) Something interesting while checking Pyrra today: * https://w.wiki/Ceyc * https://w.wiki/Ceyk The above are examp... [10:51:52] (03PS1) 10Gkyziridis: logo_detection: add model card link Bug: T370759 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1109040 (https://phabricator.wikimedia.org/T370759) [10:53:26] Georgios \o/ [10:55:35] (03CR) 10Gkyziridis: "First commits!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1109040 (https://phabricator.wikimedia.org/T370759) (owner: 10Gkyziridis) [10:58:24] (03CR) 10Kevin Bazira: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1109040 (https://phabricator.wikimedia.org/T370759) (owner: 10Gkyziridis) [11:00:15] (03CR) 10Gkyziridis: [C:03+2] "Thnx guys!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1109040 (https://phabricator.wikimedia.org/T370759) (owner: 10Gkyziridis) [11:00:57] (03Merged) 10jenkins-bot: logo_detection: add model card link Bug: T370759 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1109040 (https://phabricator.wikimedia.org/T370759) (owner: 10Gkyziridis) [11:01:15] (03CR) 10Ilias Sarantopoulos: "Congrats on your first commit!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1109040 (https://phabricator.wikimedia.org/T370759) (owner: 10Gkyziridis) [12:00:11] * klausman lunch [12:21:16] * isaranto lunch! [13:02:30] (03PS4) 10Nik Gkountas: Group functionality into classes based on the recommendation usecase [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1100512 [13:25:51] Did someone reboot ml-serve--ctrl001? [13:30:16] slyngs: o/ uptime shows a ton of days, why do you ask? Anything wrong with it? [13:30:35] Just got a DownProbe on it [13:30:49] It's back now, but 2002 had the same issue yesterday [13:31:06] Sorry, 2001 [13:31:50] those are all ganeti VMs, not sure if there was maintenance on the underlying nodes or not [13:32:25] Just thought it was interesting that I had 2001 yesterday and 1001 today. [13:32:54] Hm, that is a lot of uptime. [13:37:28] Ah, okay, this the kube-apiserver that was restarted [13:42:29] okok [13:42:51] klausman: while checking https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DSmartNotHealthy I noticed that there is an alert for ml-serve2001, not sure if it is known or not [13:43:39] elukey: I was not aware, thank you for the heads-up [13:51:51] elukey: created ticket and silenced the alert (for ml-serve2001) [13:53:13] ack [14:11:18] o/ I've shared the benchmark results we ran and the examples we used for quantized LLM text generation inference in the model cards published on HF: [14:11:19] gptq: https://huggingface.co/kevinbazira/aya-expanse-8b-gptq-4bit [14:11:19] awq: https://huggingface.co/kevinbazira/aya-expanse-8b-awq-4bit [14:11:19] If you would like to try the examples on ml-lab, feel free to use the wheels from: https://gitlab.wikimedia.org/repos/machine-learning/huggingface-optimum-benchmark-automation/-/tree/main/wheels [14:14:13] nice work Kevin! [14:14:40] we can talk about this in the meeting afterwards [14:14:51] sure sure, np! :) [14:16:52] we can do the same for flash-attention 2 (with and without fa2) [14:18:24] yes we can. [14:25:36] good morning [14:26:17] 10Lift-Wing, 06Machine-Learning-Team: [draft] Update ROCm driver version on Lift Wing nodes - https://phabricator.wikimedia.org/T383230 (10isarantopoulos) 03NEW [14:27:11] 10Lift-Wing, 06Machine-Learning-Team: [draft] Update ROCm driver version on Lift Wing nodes - https://phabricator.wikimedia.org/T383230#10441054 (10isarantopoulos) [14:27:28] o/ Chris [16:59:56] going afk folks, have a nice evening/rest of day!