[08:40:18] Hey! [08:40:31] Ouphh had a long commute this morning [09:12:18] https://github.com/TimDettmers/bitsandbytes/issues/107#issuecomment-1873414309 [09:12:48] rocm support seems to be one of high priority for most libraries nowadays! [09:27:13] Morning! [09:27:35] WHy was you commute long, Ilias? Too much snow? ;) [09:30:12] o/ [09:30:25] hehe we don't have snow yet, although winter finally came to Athens [09:30:58] Calling what we have snow is stretching it quite a bit. it's basically a spinkling on lawns and other surfaces that don't get warm [10:19:54] (03PS1) 10Ilias Sarantopoulos: locust: first example [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T348850) [10:20:34] don't pay attention to --^ it refers to some work I did last week. I'll share my findings with the team though! [10:23:32] 10Machine-Learning-Team, 10Patch-For-Review: Establish a standard load testing procedure - https://phabricator.wikimedia.org/T348850 (10isarantopoulos) Indeed locust can support our needs. We can define all load tests in one file and get one final report like in the image below where I have just 2 models, but... [10:30:18] hello! :) [10:31:46] (03CR) 10AikoChou: [C: 03+2] revert-risk: add missing force_http arg in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/985347 (https://phabricator.wikimedia.org/T353622) (owner: 10AikoChou) [10:36:43] 你好! Aiko! [10:37:06] hope I got it right :) [10:37:20] (03Merged) 10jenkins-bot: revert-risk: add missing force_http arg in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/985347 (https://phabricator.wikimedia.org/T353622) (owner: 10AikoChou) [10:39:11] cooollll yeah you got it right 😁 [10:42:36] New year's resolution #2: learn new languages [10:48:39] Ας το κάνουμε! [10:50:32] (03PS3) 10AikoChou: revert-risk: refactor checking empty result for wikidata model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/985349 [10:51:09] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review! :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/985349 (owner: 10AikoChou) [10:56:53] (03Merged) 10jenkins-bot: revert-risk: refactor checking empty result for wikidata model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/985349 (owner: 10AikoChou) [11:01:57] (03PS1) 10Kevin Bazira: test: add shared module for wrk load tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989766 (https://phabricator.wikimedia.org/T354722) [11:04:36] (03PS2) 10Kevin Bazira: test: add shared module for wrk load tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989766 (https://phabricator.wikimedia.org/T354722) [11:08:19] aiko: nice! [11:08:36] (03PS3) 10AikoChou: revertrisk-batch: async fetching data and batch prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/985350 (https://phabricator.wikimedia.org/T352987) [11:10:04] ---^ just rebased the patch [11:14:15] I'm on it before lunch! [11:14:24] was already reviewing! [11:32:42] * klausman lunch [11:32:49] (and an errand) [11:34:28] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Nice work! LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/985350 (https://phabricator.wikimedia.org/T352987) (owner: 10AikoChou) [11:34:42] aiko: iirc in ores-legacy we chose to open a new session each time instead of reusing it because we had failing connections remained open. [11:35:54] reusing the same session makes sense and should be faster [11:36:36] I can't recall the exact reason, but perhaps some kind of throttling was happening because of not great connection handling [11:36:44] * isaranto lunch! [13:28:06] I'm trying to download mixtral-7b from HF and it takes 2h :( [13:37:06] klausman: how can I check the version of rocm that we are using? I remember we have 5.4.2 but don't remember where to find it [13:38:15] since I don't have access to the node I remember I'm suppose to look somewhere in the puppet repo, but I can't find it or I dont recall correctly [13:44:12] There may not be an easy way to check without logging into the machine (or teasing apart the puppet conf) [13:45:54] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/hosts/ml-staging2001.yaml is the current file where that machine is configured [13:48:57] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/aptrepo/files/updates#188 then lists what e.g. "rocm54" means in more detail [13:50:06] great, thank you! [13:51:31] Note that in the future, we'll likely not configure it machine-by-machine, but more broadly, then the first file would be a different one [13:55:20] ack [13:56:18] (03PS1) 10Ilias Sarantopoulos: llm: update transformers module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989831 [13:58:33] (03CR) 10CI reject: [V: 04-1] llm: update transformers module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989831 (owner: 10Ilias Sarantopoulos) [14:07:07] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Nice!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989766 (https://phabricator.wikimedia.org/T354722) (owner: 10Kevin Bazira) [14:10:31] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989766 (https://phabricator.wikimedia.org/T354722) (owner: 10Kevin Bazira) [14:14:11] (03PS2) 10Ilias Sarantopoulos: llm: update transformers module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989831 [14:15:13] (03CR) 10Kevin Bazira: [V: 03+2 C: 03+2] test: add shared module for wrk load tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989766 (https://phabricator.wikimedia.org/T354722) (owner: 10Kevin Bazira) [14:39:21] 10Machine-Learning-Team: Deploy 7b parameter models from HF - https://phabricator.wikimedia.org/T354870 (10isarantopoulos) [14:40:19] 10Machine-Learning-Team: Deploy 7b parameter models from HF - https://phabricator.wikimedia.org/T354870 (10isarantopoulos) a:03isarantopoulos [15:09:06] Morning all [15:09:09] I slept in [15:10:36] deservedly! mroning :) [15:23:35] Morning Chris! [16:28:33] hi Chris! :D [17:02:18] (03PS4) 10AikoChou: revertrisk-batch: async fetching data and batch prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/985350 (https://phabricator.wikimedia.org/T352987) [17:04:24] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/985350 (https://phabricator.wikimedia.org/T352987) (owner: 10AikoChou) [17:06:44] (03Merged) 10jenkins-bot: revertrisk-batch: async fetching data and batch prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/985350 (https://phabricator.wikimedia.org/T352987) (owner: 10AikoChou) [17:09:47] 10Machine-Learning-Team: Refactor wrk load tests to make them DRY - https://phabricator.wikimedia.org/T354722 (10kevinbazira) [17:10:20] 10Machine-Learning-Team: Refactor wrk load tests to make them DRY - https://phabricator.wikimedia.org/T354722 (10kevinbazira) Common functionalities that are used in multiple load tests have been moved to a [[ https://github.com/wikimedia/machinelearning-liftwing-inference-services/blob/main/test/wrk/utils.lua |... [18:08:25] 10Machine-Learning-Team: Deploy 7b parameter models from HF - https://phabricator.wikimedia.org/T354870 (10isarantopoulos) [18:09:18] (03PS3) 10Ilias Sarantopoulos: llm: update transformers module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989831 [18:10:27] in the end I'll be deploying the falcon 7b model to continue where we left off a couple of months ago with the old GPU [18:10:45] I will upload mistral/mixtral models but it would be a next step [18:14:25] 10Machine-Learning-Team: Deploy 7b parameter models from HF - https://phabricator.wikimedia.org/T354870 (10isarantopoulos) We start by deploying the falcon 7b model so that we can continue where we left this work a couple of months ago https://phabricator.wikimedia.org/T334583. At the time we hit a wall with our... [18:15:50] (03PS4) 10Ilias Sarantopoulos: llm: update transformers module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989831 (https://phabricator.wikimedia.org/T354870) [18:21:40] going afk folks, more stuff tomorrow! [18:27:41] night isaranto!