[06:27:31] Good morning! [07:44:42] I figured out we (I did it!) put the wrong image for langid so updating to the latest correct one https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/983214 [08:20:12] morning :) [08:27:23] o/ [08:33:38] (03PS1) 10Ilias Sarantopoulos: llm: set number of threads for ctranslate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/983353 [08:34:18] --^ work in progress [08:39:45] isaranto: one thing that I wanted to discuss with you - since we specifically set the threads, i'd use a more specific environment variable, like content translation uses for mint [08:39:57] (so we'll be even more "compatible" with their standards) [08:40:17] OMP_NUM_THREADS may be beneficial if we don't set threads at all in theory [08:40:30] but the more I look into this the more I like explicitly setting threads [08:40:40] it seems way more reliable and consistent [08:40:53] I am not 100% sure if OpenMP is always used [08:41:18] meanwhile setting thread should guarantee that whatever technology/library is used, it is limited [08:41:24] not sure if it makes sense [08:43:09] Yes it does! I'll look into that then [08:43:15] <3 [08:43:29] * isaranto afk commuting and errand [08:43:43] CT2_INTER_THREADS: 4 # Match available CPUs [08:43:44] CT2_INTRA_THREADS: 0 # Set to 0 so that CTranslate2 use a default [08:43:55] these are the two used in deployment-charts [10:02:00] * isaranto back! [10:05:13] good morning :) [10:09:25] Morning :) [10:33:46] hey hey! [10:50:03] elukey: I _am_ checking the CI output... [10:50:24] klausman: it is the third time that I am seeing the opposite :) [10:50:42] It's faster to comment than it is for me to check after getting a glass of water. [11:18:20] 10Machine-Learning-Team: Improving error message for Revertrisk models - https://phabricator.wikimedia.org/T351278 (10achou) The data models for revertrisk in knowledge_integrity are likely to be changed for work on task T352987. We will review this ticket afterwards. [11:36:26] deployed the new langid image everywhere [11:38:44] I'm having issues building the llm image locally cause it is too big. I was thinking if we would benefit from having 2 production variants (1 for cpu and 1 for gpu). the main difference is that the cpu one doesn't have the rocm version of pytorch [11:39:23] perhaps it is a discussion to have further down the road, just opening the discussion on this matter [12:10:59] yeah makes sense [12:11:08] we'll have these problems more often in the future [12:18:04] isaranto: o/ [12:18:04] based on your recommendation this morning, I have tried to run the article descriptions model with lower precision (float16) but it looks like this approach is not entirely achievable without making changes to the descartes package. [12:18:04] I remember in T353127#9401025 you had advised against going down the road of making changes to this package. :) [12:18:20] * elukey lunch! [12:20:49] kevinbazira: yes we'd have to change how the model is loaded. but it is a test that can be run locally without having to commit code or anything else [12:22:39] in the same comment you pasted I suggest exactly this though: to use torch.half() to use a dtype with lower precision [12:24:34] by changes to the package I meant mostly not changing how code is organized - trying to modify and improve for loops etc [12:26:51] let's discuss here what are the options that we have and lets prioritize them. We can't try everything and it would be best to first try things that would take a small amount of time to implement and would bring great benefit (if they are successful). [12:28:12] yes, I have tried loading the model with the float16 precision but at the point of generating a prediction. it looks the descartes package expects float32 and the error below is thrown: [12:28:25] https://www.irccloud.com/pastebin/2u6xmoIj/ [12:28:51] Here is the utils.py code I was using: [12:29:11] https://www.irccloud.com/pastebin/OtijbC7c/ [12:35:34] (03PS3) 10Ilias Sarantopoulos: llm: set number of threads for ctranslate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/983353 (https://phabricator.wikimedia.org/T351740) [12:42:27] kevinbazira: this is the same issue I encountered when I tested this. The error message `RuntimeError: mat1 and mat2 must have the same dtype` gives a hint that the tensors don't have the same dtype [12:44:17] * klausman lunch [12:45:17] I'm looking into it as well [12:50:01] isaranto: okok [12:51:56] in the meantime we can also discuss what options we have on the quantization front. did u have anything particular in mind? [12:54:51] I had started looking into it but run into memory issues on the ml sandbox [12:58:32] * isaranto late lunch! [13:02:53] kevinbazira: it seems that cpu operations for float16 aren't supported in torch at the moment. will take a look after lunch if we have other options [13:03:44] isaranto: okok .... enjoy your lunch! [13:04:22] Thank u <3 [13:15:56] (03CR) 10Elukey: [C: 03+1] llm: set number of threads for ctranslate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/983353 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [14:03:23] elukey: any objections to me pushing the resource increase today (it being Friday). First staging, soak, etc of course. [14:04:54] (03CR) 10Ilias Sarantopoulos: [C: 03+2] llm: set number of threads for ctranslate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/983353 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [14:05:53] (03Merged) 10jenkins-bot: llm: set number of threads for ctranslate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/983353 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [14:06:02] klausman: it is fine yes, be careful [14:06:27] Always on Fridays ;) [14:06:39] Morning all! [14:06:48] heyo Chris [14:06:58] I’m still dragging butt because I’m sick [14:07:02] Tobias! [14:07:52] morning! [14:08:15] Ok, resource increase applied in staging. Keep an eye on things [14:11:42] er keeping*. It was meant me doing that, not me telling you to :D [14:11:50] Morning Chris! [14:13:43] morning! o/ [14:26:12] CHange looks good, pushing to codfw [14:28:01] maybe eqiad first nowadays, buut ok anyway :) [14:28:30] (just for extra precaution, somehow less traffic, even if WME folks hit us in eqiad IIRC) [14:30:01] Yeah, I guess prod is everywhere these days [14:35:48] Hm. On codfw, nllb-200-gpu is not scheduling, because: "0/10 nodes are available: 1 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 8 Insufficient amd.com/gpu." [14:36:16] Oh hang on, I misunderstood that [14:36:42] I thought the one machine with not enough CPU had a GPU, but it's 8 without GPU (i.e. all of them) [14:36:56] and one _also_ wouldn't have had enough CPU [14:39:04] we shouldn't have nllb gpu in ml-serve-codfw, we don't have gpus in there [14:39:12] if we have a pod there it is surely a misconfig [14:40:11] I suspect it's because we moved it from staging to prod, and used an all-of-prod chart [14:43:15] chrisalbon: not sure if you got the news from the IRC log but we have one MI100 running in staging [14:43:39] Holy shit I didn’t! [14:43:45] yeah I figured :D [14:43:55] you seemed to quiet about it :D [14:44:35] I’m still really sick lol [14:45:01] Is there a screenshot or something [14:45:27] Like of nvidia-smi (but you know, for amd) [14:46:14] Try running this on a statbox or deployment machine: curl https://inference-staging.svc.codfw.wmnet:30443/v1/models/nllb-200:predict -X POST -d '{"prompt": "Spanish text here", ", "tgt_lang": "deu_Latn"}' -i -H "Host: nllb-200-gpu.llm.wikimedia.org" --header "Content-type: application/json" [14:47:00] screenshot in a few secs [14:47:14] arg, that curl command is broken because mispaste [14:48:14] https://phabricator.wikimedia.org/F41601767 <- screenshot of radeontop [14:50:12] this command line will work: time curl https://inference-staging.svc.codfw.wmnet:30443/v1/models/nllb-200:predict -X POST -d '{"prompt": "Donde esta la biblioteca", "tgt_lang": "eng_Latn"}' -i -H "Host: nllb-200-gpu.llm.wikimedia.org" --header "Content-type: application/json" [14:51:26] there are also metrics in https://grafana.wikimedia.org/d/ZAX3zaIWz/amd-rocm-gpu?orgId=1&var-source=codfw%20prometheus%2Fops&var-instance=ml-staging2001:9100 [14:51:35] so far all good it seems [14:51:51] holy shit that is fast [14:51:52] side note, working on a patch to remove the eqiad nllb200-gpu pod [14:52:30] klausman: what do you mean? [14:52:45] er. codfw. [14:52:50] we do have the old gpus on eqiad. the issue we have is codfw [14:52:53] aa [14:52:57] clear! [14:54:07] isaranto: let's deploy the ctranslate image if you have time, really curious about the intra/inter thread thing [14:54:40] kevinbazira: seems like we can't do anything with float16 on cpu. it isn't supported by pytorch! [14:56:14] chrisalbon: bigger chunks of text are of course slower, but the GPU-based one is still super fast compared to the CPU version [14:56:33] I'll write an update on the task later on.And I also don't think quantization is an option we can afford to work on. The descartes model is a custom one so running inference for a converted model is not going to be trivial at all [14:56:56] elukey: ack! [14:56:58] I'd love to know how much faster, but no hurry on that [14:57:06] 10x or so, IIRC [14:57:49] we do have both deployed in staging in theory [14:58:18] 10x is for the general case when we compare vanilla CPU to vanilla GPU (by vanilla i mean no optimization or tweaks) [14:58:28] really great [14:58:34] that is awesome [14:58:39] on a MI100 though :) [14:58:46] not sure with slower ones [14:58:53] I am just so happy to see an M100 [14:59:15] lets think about what tests we want to run before we make a bigger order [14:59:28] we do have the cpu optimized (ctranslate2) on staging which is faster e.g. at the moment CPU runs in ~5s and GPU in ~3s [15:00:21] Pushing resource update in eqiad in a moment [15:00:51] the comparison I'm mentioning is for a request that translates a paragraph. [15:02:31] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/983424 (make sure nllb-200-gpu is running only in codfw) is ready for review [15:05:28] ooh, and of course brainfart [15:14:04] elukey: I opened the patch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/983436 . However I think to speed up a single request we need to set intra_threads and not inter_threads (which applies for multiprocessing/batch) . all according to https://opennmt.net/CTranslate2/parallel.html [15:14:10] lemme know if I'm wrong here [15:16:20] cause now we're actually setting workers [15:16:45] can we check how many threads we have in the service that is running now? [15:25:50] nllb-200-gpu in serve-codfw has been removed. [15:31:16] isaranto: so IIUC from the comments made for mint, they set inter=0 to use the default, that may be 4? [15:35:12] in the ctranslate2 code I see [15:35:12] ``` [15:35:12] inter_threads: Maximum number of parallel generations. [15:35:12] intra_threads: Number of OpenMP threads per generator (0 to use a default value). [15:35:12] ``` [15:35:13] trying to dig what "default" means [15:36:35] shall I flip them to see how much a single request would benefit? [15:36:59] sure! if you want I can add the variable quickly via kubectl edit [15:37:42] I'll update the patch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/983436 as the current image deployed doesnt have the environment variable [15:42:03] I updated the patch to include 1 worker and 4 threads [15:42:43] +1ed [15:52:18] going afk earlier folks, will read later in case I miss any ping [15:52:27] have a nice rest of the day and weekend :) [15:55:08] ciao, have a nice weekend! [15:55:26] \o have a nice evening [15:55:34] I deployed the threads change. don't see any difference performance-wise. I get 5s latency [17:02:48] 10Machine-Learning-Team, 10Research: Allow to set Catboost's threads in readability-liftwing - https://phabricator.wikimedia.org/T353461 (10isarantopoulos) Since the number or threads used can be specified directly in the [[ https://catboost.ai/en/docs/concepts/python-reference_catboostclassifier_predict_proba... [17:03:39] logging off folks! have a nice weekend! [17:28:51] have a great weekend! [21:38:35] 10Machine-Learning-Team: Optimize response performance for the article-descriptions model-server - https://phabricator.wikimedia.org/T353127 (10Isaac) Thanks @kevinbazira for these tests! Does bumping the RAM help at all or is that largely out of the question for other reasons?