[07:35:07] o/ Good monring! [09:00:56] 10Machine-Learning-Team: Support building and running of langid model-server via Makefile - https://phabricator.wikimedia.org/T357382 (10kevinbazira) [09:02:46] 10Machine-Learning-Team: Support building and running of langid model-server via Makefile - https://phabricator.wikimedia.org/T357382 (10kevinbazira) 05Open→03In progress p:05Triage→03Medium a:03kevinbazira [09:02:48] 10Machine-Learning-Team: Add a script for running the Revert Risk model server locally - https://phabricator.wikimedia.org/T352689 (10kevinbazira) [09:05:55] Hello folks! [09:06:25] I think it was already mentioned in the past but IIUC Hugging face offers LLM models in ONNX format (https://onnx.ai/) [09:06:29] looks really nice [09:14:40] 10Machine-Learning-Team: Support building and running of langid model-server via Makefile - https://phabricator.wikimedia.org/T357382 (10kevinbazira) Trying to build the langid model-server locally throws the error below. This seems to be caused when pip is installing `fasttext==0.9.2` and can't find the `pybind... [09:19:16] 10Machine-Learning-Team: Support building and running of langid model-server via Makefile - https://phabricator.wikimedia.org/T357382 (10kevinbazira) The error above has been fixed by installing the `wheel` package before installing `fasttext`. The langid requirements.txt that I used has: ` kserve==0.11.2 wheel=... [09:22:09] Hi Luca! The kserve-triton server also supports onnx but I'm not sure if we could use it with a custom model server [09:24:40] (03PS1) 10Kevin Bazira: langid: fix pybind11 missing issue [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1002424 (https://phabricator.wikimedia.org/T357382) [09:29:43] isaranto: o/ it should be possible to load it in theory, but not sure if it will improve anything for us [09:29:50] anyway, good to know :) [09:30:00] maybe in the future if we create new models we could think about it [09:34:40] good morning o/ [09:36:20] Morning! [09:38:42] hey hey [11:38:34] aiko: fyi I started working on gpu on statbox 1005 [11:39:18] initially I started on statbox08 but somehow torch couldn't find the gpu after a while (while initially it did) [11:39:27] * isaranto going for lunch [11:40:09] isaranto: nice! [11:43:56] (03PS3) 10AikoChou: revertrisk: use GPU for revertrisk-multilingual [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995214 (https://phabricator.wikimedia.org/T356045) [11:47:56] \o/ Cass POC now works with transparently using a Cassandra cluster going away and coming back. [11:48:10] Basically, connection pooling [11:50:24] * klausman lunch [12:02:59] (03CR) 10Kevin Bazira: [C: 03+1] revertrisk: use GPU for revertrisk-multilingual [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995214 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [12:32:22] great stuff Tobias! [12:33:56] (03PS6) 10AikoChou: Makefile: add support for revertrisk-multilingual [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995198 (https://phabricator.wikimedia.org/T356501) [12:34:04] (03CR) 10CI reject: [V: 04-1] Makefile: add support for revertrisk-multilingual [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995198 (https://phabricator.wikimedia.org/T356501) (owner: 10AikoChou) [12:39:30] (03PS7) 10AikoChou: Makefile: add support for revertrisk-multilingual [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995198 (https://phabricator.wikimedia.org/T356501) [12:43:00] (03PS8) 10AikoChou: Makefile: add support for revertrisk-multilingual [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995198 (https://phabricator.wikimedia.org/T356501) [12:52:24] (03CR) 10AikoChou: Makefile: add support for revertrisk-multilingual (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995198 (https://phabricator.wikimedia.org/T356501) (owner: 10AikoChou) [13:08:45] running an errand - brb in 30' [13:16:00] (03CR) 10Kevin Bazira: [C: 03+1] "LGMT!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995198 (https://phabricator.wikimedia.org/T356501) (owner: 10AikoChou) [13:48:35] Back [14:07:22] Good morning! [14:08:01] morning! [14:08:46] kevinbazira: regarding https://phabricator.wikimedia.org/T357382#9536821 and pybind: what is your local os version? [14:10:30] cause we don't have any issue in the production image, so it is likely a package missing. I think installing `python3-dev` should fix the issue, though I'm not 100% sure [14:11:15] isaranto: o/ [14:11:42] the os version I tested this on is Bullseye: [14:12:10] ``` [14:12:10] $ cat /etc/os-release [14:12:10] PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" [14:12:10] NAME="Debian GNU/Linux" [14:12:10] VERSION_ID="11" [14:12:10] VERSION="11 (bullseye)" [14:12:10] VERSION_CODENAME=bullseye [14:12:11] ID=debian [14:12:11] HOME_URL="https://www.debian.org/" [14:12:12] SUPPORT_URL="https://www.debian.org/support" [14:12:12] BUG_REPORT_URL="https://bugs.debian.org/" [14:12:13] ``` [14:15:05] installing `python3-dev`, `pybind11`, and `fasttext-wheel` didn't work. [14:15:05] installing `wheel` is what worked. it seems to be a known issue with fasttext: https://github.com/facebookresearch/fastText/issues/512 [14:17:23] I see in one of the proposals about pytho3-dev https://github.com/facebookresearch/fastText/issues/512#issuecomment-515724686 [14:19:06] I am curious cause we don't face this issue in our production image https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/.pipeline/langid/blubber.yaml [14:19:06] which is using bullseye as well [14:19:52] in our blubber/docker images install commands are ran as root. perhaps you could try the same [14:24:02] I am running as root. [14:24:04] the solution that worked was installing `wheel` before `fasttext`. [14:24:04] here is an explanation on why this works: https://github.com/facebookresearch/fastText/issues/512#issuecomment-1718315762 [14:28:37] I'm not saying it doesn't work, I'm just saying we need to understand what the issue is, especially since it doesn't exist in the deployed service, but only on your local setup [14:29:09] it isn't a good practice to add additional packages if we don't need them [14:29:18] which python version are you running? [14:30:00] I am running python 3.9.2 [14:32:00] ok i think I figured it out. Blubber by default will install wheel. If we extract the Dockerfile we'll see this among the instructions [14:32:00] `RUN python3 "-m" "pip" "install" "-U" "setuptools!=60.9.0" && python3 "-m" "pip" "install" "-U" "wheel" "tox" "pip"` [14:32:00] which would install the latest compatible wheel and pip versions. So we don't need it in the requirements.txt [14:41:50] I had added wheel to support users who are not building a docker image via blubber but rather those who are building a model-server locally in a python virtual env using the Makefile. [14:41:50] isaranto: would adding wheel to the requirements without pinning it work for both use cases Blubber and Makefile? [14:43:17] kevinbazira: yeah, let's go with that, it wouldnt affect the production image at all [14:43:38] great! pushing the change in a bit ... [14:45:51] (03PS2) 10Kevin Bazira: langid: fix pybind11 missing issue [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1002424 (https://phabricator.wikimedia.org/T357382) [14:50:44] (03CR) 10Ilias Sarantopoulos: [C: 03+1] langid: fix pybind11 missing issue [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1002424 (https://phabricator.wikimedia.org/T357382) (owner: 10Kevin Bazira) [14:52:40] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995198 (https://phabricator.wikimedia.org/T356501) (owner: 10AikoChou) [14:52:48] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1002424 (https://phabricator.wikimedia.org/T357382) (owner: 10Kevin Bazira) [14:53:39] nice! happy we figured it out! [14:53:50] <3 [14:53:56] (03Merged) 10jenkins-bot: langid: fix pybind11 missing issue [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1002424 (https://phabricator.wikimedia.org/T357382) (owner: 10Kevin Bazira) [15:03:21] 10Machine-Learning-Team: Deploy 7b parameter models from HF - https://phabricator.wikimedia.org/T354870 (10isarantopoulos) [15:13:15] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415 (10calbon) p:05Medium→03High [15:13:17] 10Machine-Learning-Team, 10Patch-For-Review: Support building and running of langid model-server via Makefile - https://phabricator.wikimedia.org/T357382 (10calbon) p:05Medium→03Triage [15:13:26] 10Machine-Learning-Team, 10artificial-intelligence, 10Bad-Words-Detection-System, 10revscoring: Gather language assets for Occitan - https://phabricator.wikimedia.org/T354702 (10calbon) p:05High→03Triage [15:19:16] 10Machine-Learning-Team, 10Epic: Epic: Implement prototype inference service that uses Cassandra for request caching - https://phabricator.wikimedia.org/T356256 (10klausman) [15:19:31] 10Machine-Learning-Team, 10Epic: Epic: Implement prototype inference service that uses Cassandra for request caching - https://phabricator.wikimedia.org/T356256 (10klausman) [15:43:11] 10Machine-Learning-Team, 10Goal: Goal: A plan for a training infrastructure - https://phabricator.wikimedia.org/T353814 (10calbon) - Training servers ordered. - GCP credits likely. [15:44:56] 10Machine-Learning-Team, 10Goal: Goal: A plan for a training infrastructure - https://phabricator.wikimedia.org/T353814 (10calbon) Aiko to work on spike about GPU on Hadoop workflow and end to end airflow pipelne (data prep pipeline, training pipeline, model evaluation). [15:49:26] 10Machine-Learning-Team, 10Goal: Goal: Expand Lift Wing Cluster and add GPU capacity to production - https://phabricator.wikimedia.org/T353338 (10calbon) Hosts have GPUs [15:49:48] 10Machine-Learning-Team, 10Goal: Goal: Expand Lift Wing Cluster and add GPU capacity to production - https://phabricator.wikimedia.org/T353338 (10calbon) Procured fewer but larger hosts [15:58:45] 10Machine-Learning-Team: Do a rolling drain/undrain of LW k8s nodes in eqiad and codfw - https://phabricator.wikimedia.org/T356867 (10klausman) 05Open→03Resolved [15:58:58] 10Machine-Learning-Team: Drain and silence ml-serve2002.codfw.wmnet - https://phabricator.wikimedia.org/T355759 (10klausman) 05Open→03Resolved [15:59:05] 10Machine-Learning-Team: Drain & shutdown ml-serve2005.codfw.wmnet for physical move - https://phabricator.wikimedia.org/T355757 (10klausman) 05Open→03Resolved [16:31:41] (03PS9) 10AikoChou: Makefile: add support for revertrisk-multilingual [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995198 (https://phabricator.wikimedia.org/T356501) [16:31:56] (03CR) 10AikoChou: [V: 03+2] Makefile: add support for revertrisk-multilingual [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995198 (https://phabricator.wikimedia.org/T356501) (owner: 10AikoChou) [16:34:32] (03PS4) 10AikoChou: revertrisk: use GPU for revertrisk-multilingual [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995214 (https://phabricator.wikimedia.org/T356045) [16:40:36] isaranto: o/ could you review https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/995214 when you have time? :) [16:41:21] thank u [16:41:41] on it! [16:46:43] after I jump out of a meeting :) [17:28:56] (03CR) 10Ilias Sarantopoulos: revertrisk: use GPU for revertrisk-multilingual (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995214 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [17:52:09] (03PS5) 10AikoChou: revertrisk: use GPU for revertrisk-multilingual [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995214 (https://phabricator.wikimedia.org/T356045) [17:53:59] (03CR) 10AikoChou: revertrisk: use GPU for revertrisk-multilingual (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995214 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [18:07:13] (03CR) 10Ilias Sarantopoulos: [C: 03+1] revertrisk: use GPU for revertrisk-multilingual (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995214 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [18:07:32] logging off folks o/ cu tomorrow! [18:14:45] Night Ilias!