[06:59:07] Good morning o/ [07:27:13] (03PS1) 10Ilias Sarantopoulos: readability: bump catboost to 1.2.3 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008766 (https://phabricator.wikimedia.org/T353461) [08:12:20] * isaranto afk be back in 1h [08:47:45] morning o/ [09:26:36] Hey Aiko! [09:50:54] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008766 (https://phabricator.wikimedia.org/T353461) (owner: 10Ilias Sarantopoulos) [10:47:38] Morning! [11:02:15] o/ Tobias! [11:41:43] (03PS1) 10AikoChou: revertrisk-batch: add env var CLASSIFIER_BATCH_SIZE to batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008837 (https://phabricator.wikimedia.org/T355656) [11:42:43] * isaranto lunch! [11:46:42] kevinbazira: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1008840 One minor tweak to the APIGW config change [11:50:13] 10Lift-Wing, 06Machine-Learning-Team: Discuss caching strategies for Lift Wing - https://phabricator.wikimedia.org/T349180#9600287 (10klausman) 05Open→03Resolved [11:50:15] 06Machine-Learning-Team, 05Goal: Goal: Decide on an optional Lift Wing caching strategy for model servers - https://phabricator.wikimedia.org/T348155#9600288 (10klausman) [11:50:28] 06Machine-Learning-Team: Set SLO for the recommendation-api-ng service hosted on LiftWing - https://phabricator.wikimedia.org/T347262#9600289 (10klausman) 05Open→03Resolved [11:50:31] 06Machine-Learning-Team: Deploy the recommendation-api-ng on LiftWing - https://phabricator.wikimedia.org/T347015#9600290 (10klausman) [11:51:02] 06Machine-Learning-Team, 06SRE, 13Patch-For-Review: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516#9600294 (10klausman) 05Open→03Resolved [11:51:05] 06Machine-Learning-Team, 05Goal: Goal: Inference Optimization for Hugging face/Pytorch models - https://phabricator.wikimedia.org/T353337#9600295 (10klausman) [11:51:24] https://github.com/kserve/kserve/pull/3374 PR for pydantic v2 in kserve is merged! \o/ [11:51:45] Yay! [11:52:22] I would have preferred it to be in 0.12, but oh well. [11:54:07] that needs some time, but at least we can test it now! [12:01:17] 06Machine-Learning-Team, 13Patch-For-Review: Create external endpoint for article-descriptions isvc hosted on LiftWing - https://phabricator.wikimedia.org/T358654#9600368 (10klausman) And the external endpoint is live: `lang=json $ curl -s "https://api.wikimedia.org/service/lw/inference/v1/models/article-desc... [12:01:32] kevinbazira: external access to art-desc is working now! [12:22:15] great! [12:22:36] too bad that pydantic v2 isn't in 0.12 but it was expected :( [12:23:23] we can use the current commit for rr-ml/knowledge-integrity [12:28:05] (03CR) 10AikoChou: [C: 03+1] readability: bump catboost to 1.2.3 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008766 (https://phabricator.wikimedia.org/T353461) (owner: 10Ilias Sarantopoulos) [12:58:14] * klausman late lunch [13:12:04] (03PS1) 10AikoChou: revertrisk: add env var USE_GPU [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) [13:40:42] klausman: thanks! the external endpoint for article-descriptions isvc works like a charm :) [13:45:43] isaranto: https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1008514 just saw that Luca had a patch for catboost yesterday, maybe you missed it XD [13:46:27] (03CR) 10Ilias Sarantopoulos: [C: 03+1] readability: update the catboost version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008514 (https://phabricator.wikimedia.org/T353461) (owner: 10Elukey) [13:46:39] definitely missed it! [13:46:42] thanks [13:47:40] aiko: I'm currently reviewing https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1008858 (rrml-gpu) and wanted to discuss it further [13:48:43] Although this would work, it involves modifying the class as well as doing an import inline. [13:48:43] In order to avoid adding an if/else inside the class and as well as an environment variable we could do one the following: [13:48:43] - add a rrml_utils.py file (that way we would have the import torch in that file only and not inline [13:48:43] - Alternatively and even better I would create an RRMLGPU object that inherits the same class but loads the model on GPU. Then we would check for the GPU in the main file [13:49:36] BUT I don't have a clear suggestion as in the latter case we still need to detect the GPU and not all model servers will have torch [13:51:21] so perhaps a combination would work. bottom line is that I'd like us to avoid if possible to have many environment variables. The issue would be that you can supply the USE_GPU env var but not actually utilize a gpu [13:51:55] if we can't avoid it I'm cool with it just wanted to discuss about it though [13:53:58] hello folks! [13:54:17] yes I got your point. both suggestions sound good.. we can discuss it further later in our meeting! [13:54:37] hi Luca o/ [13:56:30] (03CR) 10Elukey: [C: 03+1] readability: bump catboost to 1.2.3 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008766 (https://phabricator.wikimedia.org/T353461) (owner: 10Ilias Sarantopoulos) [13:56:38] (03CR) 10Kevin Bazira: [C: 03+1] revertrisk-batch: add env var CLASSIFIER_BATCH_SIZE to batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008837 (https://phabricator.wikimedia.org/T355656) (owner: 10AikoChou) [13:56:54] (03Abandoned) 10Elukey: readability: update the catboost version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008514 (https://phabricator.wikimedia.org/T353461) (owner: 10Elukey) [13:58:12] elukey: o/ sorry I missed your patch! [13:58:23] shouldn't have abandonded yours [13:59:21] (03PS2) 10Ilias Sarantopoulos: readability: bump catboost to 1.2.3 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008766 (https://phabricator.wikimedia.org/T353461) [13:59:33] nono it is fine! [13:59:36] well, now I copied your commit msg (mine was empty) [14:02:33] (03CR) 10Ilias Sarantopoulos: [V: 03+2 C: 03+2] readability: bump catboost to 1.2.3 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008766 (https://phabricator.wikimedia.org/T353461) (owner: 10Ilias Sarantopoulos) [14:03:28] 06Machine-Learning-Team: Investigate InfServiceHighMemoryUsage for article-descriptions - https://phabricator.wikimedia.org/T358742#9601035 (10klausman) This was indeed caused by using the wrong metric. We have chosen to move to using the existing k8s alerts. [14:05:32] 06Machine-Learning-Team: Move the article-descriptions model server from staging to production - https://phabricator.wikimedia.org/T358467#9601040 (10klausman) >>! In T358467#9582988, @kevinbazira wrote: > The article-descriptions model server was firing `InfServiceHighMemoryUsage` alerts. This usually happens w... [14:13:08] elukey: I reviewed the kserve upgrade guide, it is nice. Although I need to try it out to be sure [14:14:03] I was also thinking sth that we discussed a while ago: to use the kserve chart as a base and add our custom things in a values file. not sure if it would work but I'd like to try it [14:19:58] Good morning all [14:23:00] isaranto: definitely, we could try the chart, but remember that we'll need to add customization to it as well (we'll not be able to use it as vanilla) [14:23:17] or we could work with upstream to be able to add our own custom parameters (like annotations, etc..) [14:23:24] and that may be the way to go [14:23:28] but it requires some time [14:27:23] aiko: o/ [14:27:27] not sure if you saw https://phabricator.wikimedia.org/T359067 [14:27:39] lemme know if I missed anything (I didn't forget :D) [14:28:33] when you have a moment do you mind to show me the output of "docker history --no-trunc $id-of-the-rr-ml-image-with-gpu-stuff" ? [14:28:39] in a phab paste for example [14:28:45] just to understand the layers [14:29:09] klausman: o/ ok if I upgrade kserve on ml-serve-eqiad? [14:30:00] o/ Chris! [14:30:37] regarding kserve chart: I think we can just override some values (but I have to check not sure if it is possible) [14:31:11] there are some custom annotations etc.. that we cannot override iirc [14:31:21] ack [14:31:31] and there is a container that needs to be removed (to proxy port 8443 for metrics etc..) [14:31:40] nothing big but some work needs to be done [14:31:43] elukey: go ahead [14:31:51] I can try to see if I can send a pull request to upstream [14:32:02] klausman: ack! [14:33:41] also me and aiko were just discussing about GPU detection and have the following question:has anybody used a python package that would detect if an AMD GPU exists? [14:34:24] can't think of one, maybe we could add a custom class that checks some rocm status [14:35:13] ml-serve-eqiad upgraded [14:35:24] tried to kill a pod and the new one spins up fine [14:35:43] doing ml-serve-codfw as well [14:37:29] 06Machine-Learning-Team: Update to KServe 0.11 - https://phabricator.wikimedia.org/T337213#9601202 (10elukey) [14:38:28] aand done [14:38:38] 06Machine-Learning-Team: Update to KServe 0.11 - https://phabricator.wikimedia.org/T337213#9601206 (10elukey) New control plane deployed to ml-serve-{eqiad,codfw}, tested if killing a pod worked, all good. [14:39:09] what is the procedure now? Should I move the task to "Done" and resolve it? [14:40:56] 06Machine-Learning-Team, 06DC-Ops, 06SRE, 10ops-codfw: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9601214 (10elukey) a:05klausman→03None [14:41:33] 06Machine-Learning-Team, 06DC-Ops, 06SRE, 10ops-codfw: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9601234 (10elukey) Removed Tobias as assignee so the new node can be initialized. [14:43:22] iirc mvoe them to Done for a couple of days and resolve them in the team meeting so they don't get lost [14:43:38] ack! [14:43:52] when the column is smaller we can just resolve them directly [14:44:04] (no strict procedure though) [14:56:28] elukey: thank u <3 I'll read it and reply in the task! [14:58:13] aiko: ack! When you have a moment can you also pass me the docker output that I wrote above? [14:58:57] (I assume that you are able to build the rr-ml-gpu image locally but CI can't upload it to the registry) [14:59:08] just to understand how big is one layer [15:01:25] elukey: yes here it is https://phabricator.wikimedia.org/P58479 [15:01:29] <3 [15:02:14] ahahah wow 10.7GB [15:02:41] this one is around 10G. the one I reported last week (~4.5G) was not correct [15:03:25] okok I'll update the task, I fear that we'll need to force nginx to use disk space rather than memory (via tmpfs) [15:03:40] we can't ask serviceops to raise memory to 10+GB just for htis [15:03:42] *this [15:17:55] 06Machine-Learning-Team: Set SLO for the article-descriptions isvc hosted on LiftWing - https://phabricator.wikimedia.org/T358655#9601424 (10klausman) a:03klausman [15:18:08] 06Machine-Learning-Team: Investigate InfServiceHighMemoryUsage for article-descriptions - https://phabricator.wikimedia.org/T358742#9601425 (10klausman) a:03klausman [15:19:45] 06Machine-Learning-Team, 05Goal: Lift Wing Python Package - https://phabricator.wikimedia.org/T359140#9601427 (10calbon) a:03Mercelisvaughan [15:20:25] 06Machine-Learning-Team, 05Goal: Q4: Lift Wing Python Package - https://phabricator.wikimedia.org/T359140#9601433 (10calbon) [15:26:44] 06Machine-Learning-Team, 07Documentation, 07Software-Licensing: Add Licensing and Open Source requirement/strong preference to Lift Wing model deployment documentations - https://phabricator.wikimedia.org/T359066#9601467 (10calbon) [15:30:07] 06Machine-Learning-Team: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images - https://phabricator.wikimedia.org/T359067#9601483 (10calbon) a:03elukey [15:38:10] 06Machine-Learning-Team: Update to KServe 0.11 - https://phabricator.wikimedia.org/T337213#9601518 (10calbon) 05Open→03Resolved [15:43:41] 06Machine-Learning-Team, 10ORES: Inconsistent data type for articlequality score predictions on ptwiki - https://phabricator.wikimedia.org/T358953#9601534 (10calbon) [15:44:34] 06Machine-Learning-Team: Investigate why WikiGPT returns an Internal Server Error - https://phabricator.wikimedia.org/T358842#9601548 (10calbon) 05In progress→03Resolved [15:44:42] 06Machine-Learning-Team, 07Epic: WikiGPT Experiment - https://phabricator.wikimedia.org/T328494#9601549 (10calbon) [15:48:48] 06Machine-Learning-Team: Prep work for (re)training workflow sprint - https://phabricator.wikimedia.org/T358748#9601590 (10calbon) [15:52:51] 06Machine-Learning-Team: Investigate InfServiceHighMemoryUsage for article-descriptions - https://phabricator.wikimedia.org/T358742#9601630 (10klausman) 05Open→03Resolved [15:56:22] 06Machine-Learning-Team, 06Structured-Data-Backlog: Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9601667 (10calbon) a:03kevinbazira [16:01:03] aiko: (if you have a moment) - did you build the rr-ml docker image with a special way to get the 10G layer? [16:01:11] because I tried and I see ~5G [16:01:27] with [16:01:29] DOCKER_BUILDKIT=1 docker build --target production -f .pipeline/revertrisk/multilingual.yaml --platform=linux/amd64 . -t rr-ml [16:02:42] 06Machine-Learning-Team: Move the article-descriptions model server from staging to production - https://phabricator.wikimedia.org/T358467#9601712 (10klausman) [16:03:06] 06Machine-Learning-Team, 13Patch-For-Review: Create external endpoint for article-descriptions isvc hosted on LiftWing - https://phabricator.wikimedia.org/T358654#9601709 (10klausman) 05Open→03Resolved [16:03:13] I also get a ton of nvidia stuff deployed :( [16:04:00] I used docker buildx build --target production -f .pipeline/revertrisk/multilingual.yaml --platform=linux/amd64 . -t revertrisk-ml-gpu:2 [16:05:18] I always build images using this command [16:05:21] :(( [16:06:02] mmm weird, should be the same [16:06:06] and I have your new commit [16:06:19] can you try "docker run --rm -it --entrypoint /bin/bash revertrisk-ml-gpu:2" [16:06:29] and then once you have the shell [16:06:58] du -hs /opt/lib/python/site-packages/* | sort -h [16:07:07] | tail -n 10 say [16:07:38] yep [16:07:45] 52M /opt/lib/python/site-packages/sympy [16:07:45] 65M /opt/lib/python/site-packages/pandas [16:07:45] 68M /opt/lib/python/site-packages/transformers [16:07:45] 72M /opt/lib/python/site-packages/cmake [16:07:45] 98M /opt/lib/python/site-packages/scipy [16:07:46] 144M /opt/lib/python/site-packages/ray [16:07:46] 153M /opt/lib/python/site-packages/plotly [16:07:46] 286M /opt/lib/python/site-packages/catboost [16:07:47] 472M /opt/lib/python/site-packages/triton [16:07:47] 8.4G /opt/lib/python/site-packages/torch [16:08:08] ok very different from me, I see nvidia and there isn't anything here [16:08:16] 8.4G for torch, wow [16:08:40] QQ huge [16:08:41] sheeeshus [16:10:49] no idea why I get a different image though [16:12:07] did you git pull latest changes? maybe it was previous version. I also got some nvidia things installed [16:12:33] in theory yes [16:14:56] hmm that's weird [16:17:00] it is probably a weird state in my docker setup [16:17:06] trying to clean everything and restart [16:34:51] aiko: to double check, you don't have any extra change applied to https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1006909/1/revert_risk_model/model_server/multilingual/requirements.txt#3 right ? [16:34:55] locally I mean [16:37:00] in my docker build logs I see that it tries to install transformers-4.36.2-py3-none-any.whl [16:37:11] that is fine, afaics from https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/commit/46e1b69bf124006e222c67851379571c695f89c1 [16:37:47] but I am wondering if the tool.poetry.source is not picked up [16:38:30] if you have a min, do you mind to try my command? [16:38:52] if you get the wrong image then we'll know the root cause [16:38:59] sorry but I can't find what's wrong :( [16:40:27] 06Machine-Learning-Team: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images - https://phabricator.wikimedia.org/T359067#9602091 (10akosiaris) Hi, So this is a difficult one to tackle. From what I gather images (and layers) can end up being really large, close to 10GB. I have questi... [16:43:10] no I don't have any change locally I remember [16:43:20] no problem! let me try your command [16:43:23] <3 [16:51:51] the resulting image size is ~5GB 0.0 [16:53:04] ok right then something is starting to make sense :D [16:53:43] I don't have the buildx extension in my docker (debian's default), or at least can't find it [16:53:44] I don't understand [16:53:57] ohh [16:54:03] maybe buildx does something extra [16:57:20] aiko: can you run the du command that we tried before inside the /opt/lib/python/site-packages/torch [16:57:27] in the 10g docker image version [16:57:45] (maybe use phab's phaste for the output so we'll have it as reference) [16:57:55] in this case don't use tail, let's see the whole thing [16:58:17] or we can add it to the task for rr ml on gpus [16:58:25] ok! [17:01:56] so weird!! [17:02:11] first I got https://phabricator.wikimedia.org/P58515 [17:02:33] then I ran again inside the torch/lib [17:02:43] I got https://phabricator.wikimedia.org/P58516 [17:06:13] ok the sum is ~8G, no so weird [17:29:55] I see improvements in the image I am building that uses python 3.11 and torch 2.1.2 . torch lib is now 7.4 GB [17:29:59] 😛 [17:30:43] nice work aiko thanks :) [17:32:04] isaranto: lol [17:38:11] ouch the huggingface image has a layer of 12.3 GB [17:38:23] at least if I am reading the output correctly https://phabricator.wikimedia.org/P58525 [17:40:07] yep :( [17:40:19] is it the COPY third_party? [17:40:33] it really depends what ends up in there [17:42:37] aiko: one thing that I don't understand - /opt/lib/python/site-packages/torch/lib is listed as ~8G, but your du output for it is way less [17:42:57] wondering if there are any hidden files or similar not popping up [17:43:25] it is when it is copying the virtual env, perhaps I can change things in the build instructions and see what we end up with [17:43:49] what is the dockerfile? [17:44:14] I'd check what uses all the most of the size in the venv, maybe there are a lot of not needed things [17:45:05] this is the dockerfile https://github.com/isaranto/kserve/blob/kserve-hf-rocm/python/huggingface_server_debian.Dockerfile#L57 [17:45:36] in Aiko's image though both outputs show the same size (~8GB) [17:45:55] (just circling back to the previous discussion) - sorry for the context switch [17:46:51] nono it is the same problem in difference sauces :D [17:46:59] ah ok so the sum is 8G? [17:47:06] I thought it was less, good my bad then [17:47:59] the main question mark is if we really need roblas for example, probably yes [17:48:04] yes the sum of all dirs inside is ~8GB (according to Aiko --^) [17:48:22] ahhh didn't see Aiko's comment, sorryyy [17:48:31] my brain is probably fried :D [17:48:43] anyway, a lot of good data for tomorrow [17:48:58] logging off, have a nice rest of the day folks! [17:54:40] bye luca :) have a nice rest of evening o/ [17:55:34] mine as well! I wanted to ask about docker-pkg but I'll leave for tomorrow [17:55:55] have a nice rest of day/evening ! [17:55:59] going afk! [17:56:04] bye Ilias :D [19:10:26] 06Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742#9602929 (10achou) The PR for pydantic v2 in kserve has been merged! We can use this commit https://github.com/kserve/kserve/commit/426fe21da0612ea6ef4a116b511427... [19:35:45] I found pyopencl can be used to detect if an AMD GPU exists [19:36:19] an example: https://phabricator.wikimedia.org/P58530