[07:41:11] isaranto: o/ [07:41:19] shall we merge the gpu support? [07:58:18] (03CR) 10Elukey: [C: 03+2] fix: add missing requirements for falcon-7b model and enable GPU support [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927733 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [07:58:32] going to merge an test :) [08:06:36] o/ [08:07:09] Yes, I'll also add separate images for CPU and GPU - unless I can figure out sth else [08:08:09] (03Merged) 10jenkins-bot: fix: add missing requirements for falcon-7b model and enable GPU support [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927733 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [08:08:33] isaranto: kalimera! I thought that the last fix was kinda agnostic about cpu vs gpu, namely that it auto-recognize when a gpu is present.. why we need separate images? [08:13:56] lol [08:14:27] yes you're right it is agnostic . that was the whole reason [08:14:31] 🤦 [08:18:56] ahhh okok I thought I lost something in the middle :D [08:23:06] I'll work on the profiling issue though - it takes 2xMODEL_SIZE memory to generate predictions [08:27:56] lovely [08:29:19] starting bloom-560 with the new image [08:30:34] I ran some stuff in colab yesterday with/without GPU to have some approx latencies we can compare [08:31:54] with GPU 8s - without 43s (although 43s is too much, this was just happening in colab) [08:33:27] 2023-06-07 08:31:23.746 1 root INFO [__init__():22] Using device: cuda [08:33:30] looks promising :) [08:34:10] took 9s for me (bloom-560) [08:34:12] isaranto: --^ [08:34:26] that is a huge improvement [08:34:56] and we just ran a LLM on an amd gpu \o/ [08:36:37] I am curious to see if with the new image 3b works [08:36:40] 🎉 [08:37:15] 🎉 [08:38:47] trying [08:39:59] still having network issues, hence the duplicate messages --^ [08:42:08] isaranto: something weird with 560 [08:42:08] {"error":"RuntimeError : Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"} [08:43:09] I noticed some warnings though, we may also need a debian package added to the image (with gpu firmwares etc..) [08:43:36] namely libdrm-amdgpu1 [08:43:52] but it doesn't crashloop anymore [08:45:01] https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing [08:49:59] (03PS1) 10Elukey: blubber: add libdrm-amdgpu1 to bloom's docker image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927983 (https://phabricator.wikimedia.org/T333861) [08:50:35] I get this error all the time indeed [08:50:50] weird that on 560 I got it only after the first time [08:52:37] it makes sense right? [08:52:48] it ran once and then it keeps failing [08:53:57] taking a 5 minute break and will rejoin with mobile data - for some reason network is really slow [08:58:31] yes yes but for 3b it started with that errors straight away [08:58:34] that is the confusing bit [08:59:11] (03CR) 10CI reject: [V: 04-1] blubber: add libdrm-amdgpu1 to bloom's docker image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927983 (https://phabricator.wikimedia.org/T333861) (owner: 10Elukey) [08:59:33] A ok [09:00:43] mmmm "failed to copy files: copy file range failed: no space left on device" [09:00:52] (03CR) 10Elukey: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927983 (https://phabricator.wikimedia.org/T333861) (owner: 10Elukey) [09:11:18] ok now it works :) [09:17:13] (03CR) 10Ilias Sarantopoulos: [C: 03+1] blubber: add libdrm-amdgpu1 to bloom's docker image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927983 (https://phabricator.wikimedia.org/T333861) (owner: 10Elukey) [09:18:44] (03CR) 10Elukey: [C: 03+2] blubber: add libdrm-amdgpu1 to bloom's docker image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927983 (https://phabricator.wikimedia.org/T333861) (owner: 10Elukey) [09:20:02] (03Merged) 10jenkins-bot: blubber: add libdrm-amdgpu1 to bloom's docker image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927983 (https://phabricator.wikimedia.org/T333861) (owner: 10Elukey) [09:29:32] (03PS1) 10Ilias Sarantopoulos: feat: add spawn method for cpu and gpu [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927993 (https://phabricator.wikimedia.org/T333861) [09:30:21] I added the spawn method in the patch above --^ [09:32:27] afaiu if a subprocess fails all subprocesses fail due to shared memory. setting spawn ensures separate python interpreter for each subprocess [09:32:36] (03CR) 10Elukey: [C: 03+1] feat: add spawn method for cpu and gpu [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927993 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [09:37:13] (03CR) 10Ilias Sarantopoulos: [C: 03+2] feat: add spawn method for cpu and gpu [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927993 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [09:37:45] I am restarting bloom-3b with the new image, to see if the warnings are gone [09:38:22] (03Merged) 10jenkins-bot: feat: add spawn method for cpu and gpu [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927993 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [09:40:40] klausman: o/ I checked https://wikitech.wikimedia.org/w/index.php?title=Machine_Learning%2FLiftWing%2FUsage&diff=2082465&oldid=2072695, but I think we'd need to add a little more generic context about tokens and how to change their tier/class [09:40:45] * isaranto sighs as network issues are finally resolved [09:40:55] for example, if I am a bot owner what should I do? And if I am WME? [09:41:12] otherwise we'll need to repeat this info in tasks a lot and people will get confused [09:41:19] does it make sense? [09:41:23] elukey: will do some writeup [09:41:47] More like a step-by-step thing and a "What do I do if my use case is X" stuff [09:43:58] yep thanks, sounds good [09:45:05] isaranto: first gotcha of the day [09:45:06] Warning FailedScheduling 85s (x6 over 8m47s) default-scheduler 0/10 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 8 Insufficient amd.com/gpu [09:45:19] and it makes sense, we already have two pod using the gpu [09:45:46] but if you want to create a new one, that uses the gpu as well, it is a problem [09:46:03] ack [09:46:13] this is a bit of a problem though [09:46:21] how can we deploy new versions then? [09:46:26] if all the gpus are booked [09:46:36] hmm [09:47:08] I am going to remove a gpu tag to unblock 3b [09:47:14] (removing it from 560) [09:47:27] so far we'd need to keep a GPU free for any deployment, that is a little crazy [09:47:34] maybe we can ask to kserve upstream [09:48:04] this means that previous deployment should be shut down before new one is deployed [09:49:02] right? I dont see any other way to overcome this for now.. [09:49:07] yeah but then the traffic that hits the service will be dropped [09:49:14] it is like an impactful deployment [09:51:21] we could leave one GPU free for any deployment, but it is a big waste [09:51:29] ofc, I mean just for now until we figure out how to do multimodel serving [09:51:54] even with multi-model serving we'll have the same issue [09:52:51] aa yes [09:52:59] because if pods use all gpus, we cannot spawn new ones etc.. [09:54:04] this happens with nvidias as well probably, surely it is not only our problem [09:54:18] well if people run on AWS or similar probably they don't care [09:54:44] the joy of running on bare metal [10:03:00] Or if they have so many GPUs, having 1-2 idling is fine [10:04:01] isaranto: 3b is crashlooping again sigh [10:04:11] yes I saw [10:10:34] everything seems related to ephemeral storage. do we need to set eph storage in all these pods? (istio, kserve etc) [10:12:17] it is weird because sometimes it works and others it doesn't [10:12:26] I am trying to see the current limits [10:13:01] because IIUC kserve also sets some ephemeral storage values [10:13:46] but the error msg says zero, that is not possible [10:14:03] we wouldn't see any logs in any pod if it was like that, in theory [10:14:17] nor use the storage initializer [10:15:07] is the current deployed model server using the "spawn" patch? [10:15:19] mmm I think it doesn't [10:15:49] applying it [10:18:29] also the ephemeral storage problem comes only with pods marked as ContainerStatusUnknown [10:18:52] I think for now we can focus on a smaller deployment (bloom-560m) where we know that we dont have any memory issues. [10:19:29] there seems to be high memory usage when generating samples which can be causing issues https://phabricator.wikimedia.org/T333861#8895646 [10:26:49] isaranto: interesting, maybe we can double 3b's memory availability and see how it goes [10:27:03] just to understand if the issue is that one [10:29:47] {{done}} [10:30:29] it could be good to know this since we have 128G of ram on every ml-serve node [10:30:32] maybe we need more [10:33:17] ok I am seeing different errors now [10:34:08] maybe we need more [10:34:43] err sorry [10:34:55] I am a little worried about how forecast for node memory [10:35:12] we ordered nodes with 128G of ram but given what we see they are not enough [10:35:14] klausman: --^ [10:35:29] we should ask if we can get at least 256 [10:35:43] going afk for lunch! ttl [10:35:47] ok! [10:36:05] now it is not showing logs again but before it failed I got this stack trace https://phabricator.wikimedia.org/P49067 [10:36:33] the new pod is still not up though [10:36:38] bloom-3b-predictor-default-00008-deployment-5c68ffff-pssnq 2/3 Running 0 6m42s [10:38:15] ack [10:39:41] ok I see same issues with ephemeral storage [10:41:04] (03PS4) 10AikoChou: revert-risk: handle unsupported edit types for wikidata model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/924912 (https://phabricator.wikimedia.org/T333125) [10:49:14] elukey: Phew. I was not aware that these models were that RAM hungry. VRAM, I get, but host RAM? [10:52:59] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/924912 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou) [10:59:50] (03Merged) 10jenkins-bot: revert-risk: handle unsupported edit types for wikidata model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/924912 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou) [11:14:15] I turned to work on ores-legacy app for the moment [11:41:54] * isaranto afk lunch time [12:06:40] <- late lunch and errands [12:33:40] 10Machine-Learning-Team, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Python torch fills disk of CI Jenkins instances - https://phabricator.wikimedia.org/T338317 (10hashar) [12:33:50] 10Machine-Learning-Team, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Python torch fills disk of CI Jenkins instances - https://phabricator.wikimedia.org/T338317 (10hashar) Looking at the diff overlay which is at `/var/lib/docker/overlay2/yzaiei2cl172qsj37gazfomm2/diff` gives fun: 1... [12:43:18] 10Machine-Learning-Team, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Python torch fills disk of CI Jenkins instances - https://phabricator.wikimedia.org/T338317 (10hashar) The `Build Cache` is for Buildkit which is "hidden" from regular docker but actable on via `docker buildx`. Fro... [12:56:51] 10Machine-Learning-Team, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Python torch fills disk of CI Jenkins instances - https://phabricator.wikimedia.org/T338317 (10isarantopoulos) Some info related to the above change: We have switched to this specific pytorch build (the one defined... [13:02:55] isaranto: how urgent is change 927620 (bloom3b GPU)? I can +2 now if needed, but maybe Luca wants to take a peek, too [13:04:02] klausman: not urgent at all, I'll mark it as WIP as luca is testing it manually at the moment [13:04:10] Alright [13:04:18] should have done it earlier [13:09:45] no worries [13:17:41] klausman: the main issue with those models, IIUC, is that they need to load themselves into memory and it is a very expensive step.. probably the consume less when the bootstrap is done [13:17:52] it was happening the same with the content translation models IIRC [13:21:09] 10Machine-Learning-Team, 10Patch-For-Review: Host open source LLM (bloom, etc.) on Lift Wing - https://phabricator.wikimedia.org/T333861 (10elukey) We have various errors at the moment, but this one seems the issue when bootstrapping bloom-3b: ` Traceback (most recent call last): File "/srv/bloom/model-serv... [13:31:59] there is also an AMD inference server https://kserve.github.io/website/0.10/modelserving/v1beta1/amd/ but I see that it poses restrictions on pytorch version (at least for the CPU one which uses ZenDNN supports torch up to 1.12 where we have started using 2.0) [13:32:41] 10Machine-Learning-Team, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Python torch fills disk of CI Jenkins instances - https://phabricator.wikimedia.org/T338317 (10hashar) 05Open→03Resolved a:03hashar I do not know how large the layer was before that change installing pytorch f... [13:43:38] (03PS1) 10Ilias Sarantopoulos: ores-legacy: return features in response [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928055 (https://phabricator.wikimedia.org/T330414) [13:44:18] folks I added some info to https://wikitech.wikimedia.org/wiki/ORES about the deprecation of revscoring models [13:44:31] I tried to add what Diego mentioned in the WME slack thread [13:44:32] (03PS2) 10Ilias Sarantopoulos: ores-legacy: return features in response [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928055 (https://phabricator.wikimedia.org/T330414) [13:46:29] 10Machine-Learning-Team, 10ORES, 10Documentation, 10User-AKlapper: Update docs that ORES will be replaced by Lift Wing - https://phabricator.wikimedia.org/T305963 (10elukey) Added some info about revscoring models being deprecated in https://wikitech.wikimedia.org/wiki/ORES [13:47:24] elukey: I wonder if something streamed loading could be done, i.e. loading the models inc chunks/steps [13:50:23] no idea [14:01:54] sorry joining the meeting, okta sigh [14:57:32] isaranto: removed the gpu from 3b [14:59:04] ack [14:59:24] (03CR) 10Elukey: [C: 03+1] ores-legacy: return features in response [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928055 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [15:00:24] I suggest the following: [15:00:24] deploy 560m with and without GPU as we discussed, after we make sure it works we can go and use the image with only the CPU version of pytorch to check if that would work [15:00:59] lemme know if u have a different thing/plan in mind [15:04:02] we can do it yes, I was eager to see if we could have avoided the extra big image [15:04:46] I mean I am almost positive that they don't maintain a separate codebase for torch, it is just that they add the huge list of libraries [15:05:02] I'm not going to built any other image, just the one we built earlier [15:05:15] yes it makes sense [15:07:17] isaranto: do you want to update https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/927620/1/helmfile.d/ml-services/experimental/values.yaml with 560 or should I? [15:07:44] I m doing that now [15:07:50] actually I did it in a new patch [15:08:01] ack! [15:10:45] done https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/928076 [15:10:56] free to go [15:14:23] just synced it [15:17:19] ok I am getting the same stack trace.. [15:19:44] and perhaps I need to use different inference-service name [15:29:17] isaranto: mmm weird, it seems related to the image though [15:30:18] shall we deploy with the following env var? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/928085 [15:32:10] it is def related to spawn and multiprocessing but at the moment I cant think of anything out of the box that would fix it [15:34:01] added it, let's see if it gives us any info [15:34:55] mmm a ton of Evicted isaranto [15:35:07] it may generate too much spam [15:36:07] my bad applied to the wrong pod [15:36:08] let's see [15:36:13] but I think it will be the same [15:42:56] it is the same... [15:44:20] can u also rename the model_name to bloom-560m-gpu? won't fix the issue but the current model is unreachable otherwise [15:46:46] sure done [15:47:56] grazie [15:49:08] weird that bloom-3b is still failing [15:53:40] hmm the new is stuck in pending as there seem to be 2 other pods for the same model that use a gpu [15:54:46] I am going afk for the day folks, cu tomorrow [15:54:58] o/ [15:56:56] going afk as well, will not be able to join the kserve community meeting