[00:43:25] (03CR) 10Eamedina: [C:03+1] fix return type for all __hash__ methods to be int [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1088377 (owner: 10Nik Gkountas) [00:44:42] (03CR) 10Eamedina: [C:03+1] remove level 1 and 2 pages from "Vital articles" default collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1088382 (https://phabricator.wikimedia.org/T374597) (owner: 10Nik Gkountas) [05:05:31] (03PS4) 10Kevin Bazira: article-country: update response schema [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1088214 (https://phabricator.wikimedia.org/T371897) [05:06:53] (03CR) 10CI reject: [V:04-1] article-country: update response schema [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1088214 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [05:09:35] (03PS5) 10Kevin Bazira: article-country: update response schema [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1088214 (https://phabricator.wikimedia.org/T371897) [05:12:33] (03CR) 10Kevin Bazira: article-country: update response schema (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1088214 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [08:32:45] mooorning o/ [08:49:23] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM, thanks!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1088214 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [08:59:08] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the reviews :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1088214 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [08:59:54] (03Merged) 10jenkins-bot: article-country: update response schema [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1088214 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [11:55:08] * isaranto afk lunch [14:05:45] (03PS10) 10Sbisson: API Continue support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1061713 (https://phabricator.wikimedia.org/T379037) (owner: 10Santhosh) [14:06:10] (03CR) 10Sbisson: API Continue support (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1061713 (https://phabricator.wikimedia.org/T379037) (owner: 10Santhosh) [16:15:40] klausman: o/ Is there any possibility that we could increase memory in experimental ml-staging-codfw to 64GB? [16:16:08] I want to deploy a bigger model and it is failing (getting OOMkilled) [16:16:27] otherwise we can do it next week :D [16:21:58] a single pod? :D [16:23:35] 06Machine-Learning-Team, 06Data-Platform-SRE, 05Goal: Goal 2: People outside the ML team can ssh into an ml-lab machine, run a Jupyter Notebook, and run PyTorch powered by a GPU. - https://phabricator.wikimedia.org/T371396#10304516 (10Ottomata) [16:31:06] oui, a single pod :D [16:36:45] it should be possible, it is just a matter of adding the pods limit ranges for experimental [16:37:07] assuming you have a host with 64G of ram to allocate [16:37:15] if not the kube scheduler will be really sad [16:47:09] * isaranto nods [16:48:29] I know, I was just asking if we could do it "on the fly" for experimental but I see the values are in admin_ng so I guess it would be best to go through CI/CD [16:51:36] I'll check the resources - since I don't have permission to view the nodes is there any way to tell the allocatable memory of a node? from grafana I can see the sum (which is 644GB) https://grafana.wikimedia.org/d/pz5A-vASz/kubernetes-resources?orgId=1&var-ds=thanos&var-site=codfw&var-prometheus=k8s-mlstaging [16:52:56] you have the new pod already scheduled right? [16:53:16] aya23-predictor ? [16:53:50] I just did - I had another one that was failing [16:53:56] yes aya23-predictor [16:54:44] I mean I just created a new revision (0007) which has 64Gi [16:55:22] bumped limitranges to 70GB [16:55:27] for experimental I mean [16:55:47] that is awesome, thank you [16:55:49] seems to have found home at ml-staging2001.codfw.wmnet [16:56:25] I guess k8s devs never expected this would be a use case when they first thought of k8s :D [16:57:52] I am curious to see how much time it takes to bootstrap [16:58:06] does it take so much memory because it loads a huge model? [16:58:53] yes, model is 61GB on disk [16:59:28] ahahahah [16:59:42] I guess that maybe 70GB could not be enough [17:00:02] https://huggingface.co/CohereForAI/aya-expanse-32b [17:00:05] it is this one [17:00:33] TIL cohere for AI [17:06:58] readiness probe failed, it probably needs the longer probes [17:07:59] btw in the end I don't think we'll be needing that much pod memory as we'd work in loading directly to GPU (or stream the weights from cpu to gpu) [17:08:29] that would be great yes [17:09:43] yes, there is no reason to occupy resources that are used just for model load [17:10:01] also -> https://www.amd.com/en/developer/resources/technical-articles/introducing-the-first-amd-1b-language-model.html [17:10:30] I remember you share the first olmo model (by allenAI). I guess these models will work great on amd gpus :P [17:11:26] \o/ [17:13:20] it seems that the server has an error and is restarting `2024-11-08 17:09:38.478 7 kserve ERROR [__main__.py:():259] Failed to start model server: You can't move a model that has some modules offloaded to cpu or disk` [17:13:32] I'll look into it, thank for helping Luca! [17:18:18] ack! good luck :) [17:20:24] tbh for now I think I'll just revert everything [17:37:22] https://m.mediawiki.org/wiki/Wikimedia_Hackathon_2025 !! [17:46:48] 06Machine-Learning-Team: Test the feasibility of deployment of Aya-23 model in LiftWing - https://phabricator.wikimedia.org/T379052#10304826 (10isarantopoulos) I made a first attempt to deploy the 32B model on LiftWing and I'm dumping some notes for future reference: It seems that the model couldn't fit on the... [18:38:15] I uploaded the latest 8b aya model to deploy that one instead of aya23. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1088609 [18:38:49] going afk have a nice weekend folks o/