[04:32:32] o/ [05:55:16] (03PS1) 10Santhosh: performance: Use background task for logging [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1099849 [06:34:31] (03PS1) 10Kevin Bazira: article-country: normalize sums using a fixed minimum sum of 1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100009 (https://phabricator.wikimedia.org/T371897) [08:18:54] \o [09:31:12] aiko both GPUs on ml-lab are available now! [09:32:27] thanks! [09:38:33] (03PS1) 10Santhosh: performance: Use asynchronous iterator for fetching from collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1100055 (https://phabricator.wikimedia.org/T381366) [09:39:00] quantizing the model now.. [09:59:39] ack! [10:16:27] isaranto: o/ you should probably be added as approver for ML groups (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1100057 for example) [10:17:59] ack [10:18:28] if there are multiple approvers is an OR or AND condition applied? [10:18:28] yes, I was planning on suggesting that [10:24:16] isaranto: I think it's OR'ed [10:24:47] e.g. the ops group (SREs) has Joanna and Mark as approvers, but I am pretty sure you don't need approval from both [10:25:16] Similar, statsbox roles have four approvers, :) [10:25:20] yes it is OR-based [10:25:27] so people can have backups etc.. [10:25:43] but I think Ilias should be there, so we have both timezones covered [10:25:48] ok, thanks. I will add myself then :) [10:26:43] alreayd have a patch [10:26:50] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1100063 [10:27:52] argh, that was a brainfart [10:29:29] There fixed it [10:30:35] nice,thanks! [10:32:08] Do you want to wait for Chris's ok on the change? [10:32:26] Add Moritz to the change so he is aware [10:32:32] acl [10:32:46] yes, since his approval is required [10:33:26] done [10:33:35] (it didn't seem that urgent) [10:36:37] quantizing the model is done! [10:36:38] 06Machine-Learning-Team, 10Recommendation-API, 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108#10375065 (10klausman) a:03klausman [10:36:54] did you end up using both GPUs? [10:40:28] I ended up setting device_map to cuda:0 since setting auto resulted in an error [10:40:41] saying tensors in different devices [10:41:06] that is interesting [10:41:36] did that error occur during inference? [10:42:10] I'm going to use the GPUs to try to load aya32b and then the 4bit version will let you know once I'm done [10:52:21] (03PS2) 10Nik Gkountas: performance: Use asynchronous iterator for fetching from collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1100055 (https://phabricator.wikimedia.org/T381366) (owner: 10Santhosh) [10:52:23] aiko: do you need both GPUs? If yes I can stop (I am just using cuda:0) [10:52:31] I just saw you spawned sth [10:54:56] oooh I am trying inference, I can do it later after you're done [10:57:51] I tried flash attention with bitsandbytes 4bit aya-expanse32b. got down to 10s from 36s for a single request! [10:58:22] I'm sure there are improvements to be done over there to properly handle input/output types [11:01:07] anyway we'd need to run the benchmark - not a single query. At least we are moving in the correct direction. [11:01:20] aiko: I'm done you can have em! [11:08:18] (03PS2) 10Nik Gkountas: performance: Use background task for logging [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1099849 (owner: 10Santhosh) [11:08:22] (03CR) 10Nik Gkountas: [C:03+2] performance: Use background task for logging [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1099849 (owner: 10Santhosh) [11:08:31] (03PS3) 10Nik Gkountas: performance: Use asynchronous iterator for fetching from collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1100055 (https://phabricator.wikimedia.org/T381366) (owner: 10Santhosh) [11:08:42] * aiko afk ~30m [11:09:02] (03Merged) 10jenkins-bot: performance: Use background task for logging [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1099849 (owner: 10Santhosh) [11:22:18] (03PS1) 10Ilias Sarantopoulos: llm: use torch base image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100078 [11:24:48] (03PS2) 10Ilias Sarantopoulos: llm: use torch base image and update deps [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100078 [11:54:02] * isaranto afk lunch [11:55:47] notebook just died when running inference.. without any error msg [11:58:09] I'm going to try quantizing the model using a different setting [11:58:14] yeah, the GPU crashed [11:59:58] is there any logs? [12:00:25] dmesg.txt in my homedir, but I doubt it's useful [12:02:45] I can reboot the machine if there's any suspicion the GPUs might be in bad state [12:04:16] ahh ok that'd be great [12:04:52] thanks! [12:41:47] aiko: machine is back and ready [12:50:43] ack! [13:19:44] try inference using aya-expanse-32b-AWQ, gpu crashed again.. :( [13:26:15] nice, that is progress! [13:26:49] was it the prebuilt one or the one you created? [13:27:53] you could save the model you created so that we can have it ready to test if needed [13:46:30] I'm fighting with some python dependencies while updating the llm image [13:50:10] (03CR) 10Sbisson: performance: Use asynchronous iterator for fetching from collections (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1100055 (https://phabricator.wikimedia.org/T381366) (owner: 10Santhosh) [14:01:40] (03CR) 10Sbisson: [C:04-1] performance: Use asynchronous iterator for fetching from collections (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1100055 (https://phabricator.wikimedia.org/T381366) (owner: 10Santhosh) [14:16:07] both. aya-expanse-8b-AWQ I created and aya-expanse-32b-AWQ that you shared [14:16:41] yeah I saved the model in my home dir [15:11:30] 06Machine-Learning-Team, 10Recommendation-API, 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108#10375993 (10klausman) This should work now. Stephane, if you could verify that it does and then resolve... [15:42:49] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad: Q2:install SSD (hot swap additions) to ml-lab100[12] - https://phabricator.wikimedia.org/T381394 (10RobH) 03NEW [15:43:08] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad: Q2:install SSD (hot swap additions) to ml-lab100[12] - https://phabricator.wikimedia.org/T381394#10376139 (10RobH) [15:57:33] 06Machine-Learning-Team, 10Recommendation-API, 06SRE, 10SRE-Access-Requests: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108#10376308 (10SBisson) >>! In T381108#10375993, @klausman wrote: > This should work now. Stephane, if you could verify that it d... [15:59:00] isaranto: can you add a command to try for Stephane in https://phabricator.wikimedia.org/T381108 All my commands depend on me being an SRE :D [15:59:19] sure! [15:59:24] ty :) [16:01:05] (03CR) 10Ilias Sarantopoulos: article-country: return wikidata_properties as a list (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099524 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [16:07:09] 06Machine-Learning-Team, 10Recommendation-API, 06SRE, 10SRE-Access-Requests: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108#10376383 (10isarantopoulos) You could verify that you can deploy recapi in ml-staging-codfw check if there is any diff ` cd /... [16:31:20] (03PS3) 10Ilias Sarantopoulos: llm: use torch base image and update deps [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100078 [16:31:42] (03CR) 10CI reject: [V:04-1] llm: use torch base image and update deps [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100078 (owner: 10Ilias Sarantopoulos) [16:32:39] (03PS4) 10Ilias Sarantopoulos: llm: use torch base image and update deps [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100078 [16:33:11] going afk folks (running to a doc appt) - see you tomorrow o/ [16:34:58] \o [16:37:30] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Just a suggestion for additional info in the docstring. other than that LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1100009 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [16:55:36] 06Machine-Learning-Team, 10Recommendation-API, 06SRE, 10SRE-Access-Requests: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108#10376638 (10SBisson) The diff produced: ` skipping missing values file matching "values-main.yaml" Comparing release=main, cha... [17:47:41] (03PS5) 10Nik Gkountas: performance: Use asynchronous iterator for fetching from collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1100055 (https://phabricator.wikimedia.org/T381366) (owner: 10Santhosh) [17:48:22] (03CR) 10CI reject: [V:04-1] performance: Use asynchronous iterator for fetching from collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1100055 (https://phabricator.wikimedia.org/T381366) (owner: 10Santhosh) [17:53:18] (03CR) 10Nik Gkountas: performance: Use asynchronous iterator for fetching from collections (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1100055 (https://phabricator.wikimedia.org/T381366) (owner: 10Santhosh) [18:38:32] 06Machine-Learning-Team, 10Recommendation-API, 06SRE, 10SRE-Access-Requests: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108#10377155 (10SBisson) 05Open→03Resolved I guess the results of the diff and sync commands above confirm that I do have... [19:51:41] (03CR) 10Sbisson: performance: Use asynchronous iterator for fetching from collections (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1100055 (https://phabricator.wikimedia.org/T381366) (owner: 10Santhosh)