[07:11:08] (03CR) 10Kevin Bazira: [C:03+1] Rename entrypoint.sh to ci_entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012701 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [07:51:49] Mooorning o/ [08:51:59] (03CR) 10Ilias Sarantopoulos: [C:03+1] Rename entrypoint.sh to ci_entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012701 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [09:11:14] (03CR) 10Klausman: "I agree with Ilias regarding the returned error code. I think we should use one of 400 or 422 (Unprocessable Content, see https://en.wikip" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011305 (https://phabricator.wikimedia.org/T351278) (owner: 10AikoChou) [09:43:30] 06Machine-Learning-Team, 06Structured-Data-Backlog: Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9645058 (10kevinbazira) Thank you for providing more context @mfossati. I shared this information with the team, and they have a few more questions to clarify the imple... [09:52:11] (03PS14) 10Ilias Sarantopoulos: huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) [09:59:33] (03CR) 10Ilias Sarantopoulos: [C:03+1] Set most of the model servers to run a specific entrypoint.sh (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [10:00:21] morning! [10:12:31] morning aiko ! I fixed the hf image by adding the model_dir argument and it works with a local model [10:28:54] cool let me try it [10:48:01] Morning! [10:49:40] o/ [10:50:15] (03CR) 10Klausman: [C:03+1] Rename entrypoint.sh to ci_entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012701 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [11:15:48] (03CR) 10Ilias Sarantopoulos: "I tested the patch and it works great! We just need to correct some of the instructions in the README and we're good to go" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012647 (https://phabricator.wikimedia.org/T360177) (owner: 10Kevin Bazira) [11:33:02] * klausman lunch [12:07:49] 06Machine-Learning-Team: Add pyopencl requirements to images that use resource_utils - https://phabricator.wikimedia.org/T360212#9645447 (10isarantopoulos) [12:09:32] (03PS1) 10Ilias Sarantopoulos: fix: install pyopencl in llm and article-desc [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1013036 (https://phabricator.wikimedia.org/T360212) [12:17:41] * isaranto lunch! [13:15:50] (03PS3) 10Kevin Bazira: articletopic-outlink: run model server as python module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012647 (https://phabricator.wikimedia.org/T360177) [13:21:28] (03CR) 10Kevin Bazira: "Thank you for catching these. I have fixed them." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012647 (https://phabricator.wikimedia.org/T360177) (owner: 10Kevin Bazira) [13:37:52] hello folks! [13:41:41] hey Luca! [13:43:23] (03CR) 10Elukey: [C:03+2] Rename entrypoint.sh to ci_entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012701 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [13:48:26] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Cool!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012647 (https://phabricator.wikimedia.org/T360177) (owner: 10Kevin Bazira) [13:48:54] morning all [13:48:56] o/ [13:49:17] I'm losing my voice or something, but I'll be in our meeting [13:49:24] \o [13:56:56] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012647 (https://phabricator.wikimedia.org/T360177) (owner: 10Kevin Bazira) [13:58:40] (03Merged) 10jenkins-bot: Rename entrypoint.sh to ci_entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012701 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [14:18:20] (03PS4) 10Kevin Bazira: articletopic-outlink: run model server as python module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012647 (https://phabricator.wikimedia.org/T360177) [14:20:48] (03CR) 10Kevin Bazira: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012647 (https://phabricator.wikimedia.org/T360177) (owner: 10Kevin Bazira) [14:25:38] (03Merged) 10jenkins-bot: articletopic-outlink: run model server as python module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012647 (https://phabricator.wikimedia.org/T360177) (owner: 10Kevin Bazira) [15:00:30] (03PS3) 10Elukey: Set most of the model servers to run a specific entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) [15:01:35] (03CR) 10Elukey: "Ran a couple of tests with `docker run --cpus 2 etc..`:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [15:01:45] folks I ran a couple of tests for --^ [15:01:53] all seems good, fine if I go ahead and +2? [15:01:58] any final thought/doubt/barrier? [15:04:02] (03CR) 10AikoChou: [C:03+1] Set most of the model servers to run a specific entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [15:05:31] thanks :) [15:05:38] isaranto, klausman, kevinbazira - ok to merge? [15:05:50] sgtm [15:05:55] I will update images and deploy them etc.. during the next days [15:07:27] yes go ahead! [15:07:37] we can split the deployments [15:08:48] super :) [15:12:26] elukey: the istio yaml change is back to WIP, I need to figure out the exact meaning of some of the fields. Unfortunately, the docs have basically no examples for a "dumb TCP" kinda service. [15:14:20] klausman: ack sure, anything that I can help with? [15:14:46] probably, I want to try a few things, and I'll probably have some more concrete questions tomorrow, if that suits you [15:14:59] sure! [15:17:10] (03CR) 10Klausman: [C:03+1] Set most of the model servers to run a specific entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [15:17:38] (03CR) 10Elukey: [C:03+2] Set most of the model servers to run a specific entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [15:17:53] (03PS15) 10Ilias Sarantopoulos: huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) [15:20:19] 06Machine-Learning-Team, 13Patch-For-Review: Set automatically libomp's num threads when using Pytorch - https://phabricator.wikimedia.org/T360111#9646036 (10elukey) We decided to go for https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1012711 The idea is to source common env var... [15:20:51] 06Machine-Learning-Team, 13Patch-For-Review: Set automatically libomp's num threads when using Pytorch - https://phabricator.wikimedia.org/T360111#9646037 (10elukey) Next steps: * Deploy the new images to staging and verify that everything works as expected. * Rollout to prod. [15:28:14] (03CR) 10CI reject: [V:04-1] Set most of the model servers to run a specific entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [15:28:57] Could not install packages due to an OSError: [Errno 28] No space left on device [15:29:00] sigh [15:31:22] ouch [15:33:10] isaranto: I'm testing the hf image [15:33:32] :drumroll [15:39:16] I still got an error, but I found kserve has fixed the issue in a commit last week https://github.com/kserve/kserve/commit/3a11f5050bae54c0e2cc160f74230df3e543056e [15:39:28] the issue was https://github.com/kserve/kserve/issues/3423 [15:41:21] I tested with the commit and it works and it didn't need model_dir to be added [15:42:49] (03CR) 10Elukey: Set most of the model servers to run a specific entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [15:42:54] (03CR) 10Elukey: [C:03+2] Set most of the model servers to run a specific entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [15:44:06] aiko: what issue did you get? because I added the --model_dir cmd argument in the docker entrypoint and I tested it this morning [15:46:08] https://phabricator.wikimedia.org/P58825 weird issue happened in the HuggingfaceModelRepository [15:50:05] is this with bloom or bert-base-uncased? [15:50:14] I looked at the kserve code, and I was confused that I didn't get this error msg https://github.com/kserve/kserve/blob/master/python/huggingfaceserver/huggingfaceserver/__main__.py#L72 [15:50:16] I remember having the same issue with the latter [15:51:29] and I found that part is different from release-0.12 https://github.com/kserve/kserve/blob/release-0.12/python/huggingfaceserver/huggingfaceserver/__main__.py [15:53:05] that is with bert-base-uncased, I didn't try with bloom [15:54:40] (03CR) 10CI reject: [V:04-1] Set most of the model servers to run a specific entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [15:55:07] nope [15:58:43] ok, I see several fixes on upstream so I'll clone the latest commit of the repo [15:59:20] isaranto: I just tried bloom and I got the same error [16:00:38] ¯\_(ツ)_/¯ [16:03:50] (03PS2) 10Ilias Sarantopoulos: fix: install pyopencl in llm and article-desc [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1013036 (https://phabricator.wikimedia.org/T360212) [16:04:10] yeah I tried the latest commit and the error was gone 0.0 [16:05:15] I'll use the latest commit for now since it has 2 more fixes. sorry about that. Cliche but honestly it worked for me this morning [16:05:54] https://www.reddit.com/r/ProgrammerHumor/comments/70we66/it_works_on_my_machine/?rdt=63139 😆 [16:07:28] so true 🤣 [16:15:04] (03CR) 10CI reject: [V:04-1] fix: install pyopencl in llm and article-desc [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1013036 (https://phabricator.wikimedia.org/T360212) (owner: 10Ilias Sarantopoulos) [16:16:54] hmm same here for CI ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device: [16:17:08] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1013036 (https://phabricator.wikimedia.org/T360212) (owner: 10Ilias Sarantopoulos) [16:21:44] (03PS16) 10Ilias Sarantopoulos: huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) [16:26:46] didn't manage to do much with the load testing for article-descriptions. more tomorrow! [16:26:54] going afk folks, have a nice rest of day! [16:28:02] bye Ilias! have a nice evening [16:34:29] (03CR) 10Elukey: Set most of the model servers to run a specific entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [16:34:35] (03CR) 10Elukey: [C:03+2] Set most of the model servers to run a specific entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [16:48:13] 06Machine-Learning-Team, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Python torch fills disk of CI Jenkins instances - https://phabricator.wikimedia.org/T338317#9646494 (10hashar) From what I remember about PipelineLib, the idea was to keep some kind of layers caches to speed up futu... [16:51:17] (03CR) 10Elukey: [C:03+1] fix: install pyopencl in llm and article-desc [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1013036 (https://phabricator.wikimedia.org/T360212) (owner: 10Ilias Sarantopoulos) [16:52:47] (03Merged) 10jenkins-bot: Set most of the model servers to run a specific entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [16:56:27] (03CR) 10AikoChou: [C:03+1] fix: install pyopencl in llm and article-desc [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1013036 (https://phabricator.wikimedia.org/T360212) (owner: 10Ilias Sarantopoulos) [17:07:53] mmm IIRC article-descriptions needed OMP_NUM_THREADS no? [17:08:11] I don't find any reference of it in deployment-charts [17:09:27] ah wow it is in the model server's code [17:09:55] in this case, torch.set_num_threads seems to work [17:10:50] yep was going to share it: https://github.com/wikimedia/machinelearning-liftwing-inference-services/blame/main/article_descriptions/model_server/utils.py#L79 [17:11:29] kevinbazira: thanks! From now on we'll always set OMP_NUM_THREADS so it should be ok [17:11:36] I'll file a change to update the Docker image [17:11:38] so we can re-test it [17:11:44] okok [17:27:36] logging off folks! [17:27:42] have a nice rest of the day [18:00:17] night elukey [20:15:18] 06Machine-Learning-Team, 06Research: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#9647326 (10Isaac)