[04:41:14] (03CR) 10Kevin Bazira: [V:03+2 C:03+2] logo-detection: add KServe custom model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1017453 (https://phabricator.wikimedia.org/T361803) (owner: 10Kevin Bazira) [04:45:53] 06Machine-Learning-Team: Prepare docker image for hosting the logo-detection model-server on LiftWing - https://phabricator.wikimedia.org/T362598 (10kevinbazira) 03NEW [05:09:24] 06Machine-Learning-Team, 13Patch-For-Review: Create logo-detection model-server to be hosted on LiftWing - https://phabricator.wikimedia.org/T361803#9716257 (10kevinbazira) A custom KServe model-server was created for the logo-detection isvc and can be run locally using instructions in this [[ https://github.c... [06:08:04] 06Machine-Learning-Team: Prepare docker image for hosting the logo-detection model-server on LiftWing - https://phabricator.wikimedia.org/T362598#9716319 (10kevinbazira) Successfully built the logo-detection model-server docker image locally. Below are the image layers with the largest layer size being ~2.37GB.... [06:34:48] 06Machine-Learning-Team: Prepare docker image for hosting the logo-detection model-server on LiftWing - https://phabricator.wikimedia.org/T362598#9716364 (10kevinbazira) The `tensorflow` python package is the main contributor to the largest layer size indicated above. The installation of this package includes se... [06:50:48] 06Machine-Learning-Team: Prepare docker image for hosting the logo-detection model-server on LiftWing - https://phabricator.wikimedia.org/T362598#9716403 (10kevinbazira) Since this model-server will be running on CPU until the GPU procurement is complete, `tensorflow` has been replaced with `tensorflow-cpu` whic... [06:57:19] hello! [08:01:14] morning Ilias! [08:11:51] hey Aiko o/ [08:34:33] 06Machine-Learning-Team: Prepare docker image for hosting the logo-detection model-server on LiftWing - https://phabricator.wikimedia.org/T362598#9716716 (10kevinbazira) The largest layer size for the logo-detection model-server docker image has been reduced from ~2.37GBs to ~1.61GB as shown below: ` $ docker hi... [08:37:21] Morning! [09:40:04] (03PS1) 10Klausman: gitignore: Ignore my_venv/ and models/ directories [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020184 [09:40:40] (03CR) 10Ilias Sarantopoulos: [C:03+1] gitignore: Ignore my_venv/ and models/ directories [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020184 (owner: 10Klausman) [09:40:53] that was quick :) [09:41:01] :) [09:41:28] making up for not saying morning o/ [09:42:04] (03CR) 10Klausman: [C:03+2] gitignore: Ignore my_venv/ and models/ directories [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020184 (owner: 10Klausman) [09:42:54] During my rebase today, I accidentally added the whole venv when removing conflicts. So I had an immediate itch to scratch :) [09:48:42] Do the g&s jobs form inf-services show up on Zuul? [09:49:48] what is g&s? [09:49:54] gate&submit [09:51:41] iirc they do [09:52:16] ah, then I guess it just takes some time. [10:00:06] * klausman lunch [10:14:55] * isaranto lunch and errand! [10:20:30] (03PS1) 10Kevin Bazira: logo-detection: containerize model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598) [10:53:52] (03CR) 10Kevin Bazira: "I built the model-server image locally and the largest layer is ~1.61GB as shown here: https://phabricator.wikimedia.org/T362598#9716716" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598) (owner: 10Kevin Bazira) [11:01:21] (03CR) 10Klausman: [V:03+2 C:03+2] gitignore: Ignore my_venv/ and models/ directories [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020184 (owner: 10Klausman) [11:01:55] 07artificial-intelligence, 06Machine-Learning-Team, 10Bad-Words-Detection-System, 10revscoring: Gather language assets for Occitan - https://phabricator.wikimedia.org/T354702#9717378 (10Aklapper) [11:02:25] 07artificial-intelligence, 06Machine-Learning-Team, 10Bad-Words-Detection-System, 10revscoring: Gather language assets for Occitan - https://phabricator.wikimedia.org/T354702#9717380 (10Aklapper) @Lhanars: Hi! This task has been assigned to you a while ago. Could you maybe share an update? Do you still pla... [12:39:22] hello folks! [12:42:21] o/ [12:46:43] (03CR) 10Elukey: "Looks good! Left a note about the OS version, lemme know!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598) (owner: 10Kevin Bazira) [13:08:19] \o hey Luca [13:12:42] elukey: re: api-ro move: want/need any help? I'm reviewer on most of the changes from c_laime, but you probably have all the context. [13:13:47] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9717932 (10isarantopoulos) At the moment we have a 7B model deployed on ml-staging that uses the CPU and gets a response in ~30s. I am experimenting loading various model si... [13:14:37] klausman: everything is already in staging, trying to figure out if there is a way to have both domains configured at the same time, once/if I have it I'll send a proposal for the move [13:15:02] Ack. [13:15:10] (03PS2) 10Kevin Bazira: logo-detection: containerize model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598) [13:15:14] I presume it will mean a roll-restart of roughly every isvc? [13:15:29] there are two problems: [13:16:13] 1) the virtual service config for the *.wikimedia.org domains, not sure if it is possible to have both endpoints configured at the same time (api-ro and mw-api-ro-int) [13:16:34] 2) the mw api host that we configure on each isvc, that needs to be changed with a deploy [13:17:15] if 1) resolves with "we can use only one virtual service" then we'll need to depool a DC at the time from inference.discovery.wmnet and apply the patches, then repol [13:17:18] *repool [13:17:56] should be relatively easy to do [13:18:44] I think it would be the first time we do cross-DC prod traffic [13:19:42] in the past a DC from inference.d.w was depooled by serviceops for $reasons, it shouldn't cause any trouble [13:19:57] Ah, I see [13:20:27] maybe some latency will go up a little, it will be interesting to see effects on isvcs [13:21:10] I noticed the other day that there were isvcs whcih where a few dot releases behind on the charts (like x.y.z vs x.y.z+3), so that'd be cleaned up by a redploy as well [13:31:54] 06Machine-Learning-Team, 06Research: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#9718042 (10achou) @kostajh @XiaoXiao-WMF thanks for tagging. Sorry I was unaware of the discussion here. The ML team is currently i... [13:35:39] 06Machine-Learning-Team, 05Goal: Goal: Inference Optimization for Hugging face/Pytorch models - https://phabricator.wikimedia.org/T353337#9718064 (10isarantopoulos) [[ https://phabricator.wikimedia.org/T357986#9717932 | Current status from relevant subtask ]] At the moment we are working on how to better serve... [13:38:30] (03CR) 10Kevin Bazira: logo-detection: containerize model-server (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598) (owner: 10Kevin Bazira) [13:47:34] \o/ I got Puppet to actually extract the Cassandra IPs from the profile of the nodes and put them in a place for the deployment server network policy thingamajig to use [13:47:42] \o/ [13:48:22] (03CR) 10Elukey: [C:03+1] "The config looks good to me, let's wait for either Aiko or Ilias to review and then we are ready to go!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598) (owner: 10Kevin Bazira) [13:51:48] 06Machine-Learning-Team, 10ORES: Create basic alerts for isvcs to catch outages - https://phabricator.wikimedia.org/T362661 (10elukey) 03NEW [13:58:09] 06Machine-Learning-Team, 10ORES: Add slow-logs for ML isvcs - https://phabricator.wikimedia.org/T362663 (10elukey) 03NEW [13:59:24] 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9718192 (10elukey) Created two follow ups: * Basic alerting - T362661 (in place of the SLO dashboard etc.. that we can't use right now). * Add slow logs - T362663 (log slow requests ver... [14:04:51] 06Machine-Learning-Team, 10ORES: Create basic alerts for isvcs to catch outages - https://phabricator.wikimedia.org/T362661#9718221 (10klausman) Probably something like: ` (sum by (destination_canonical_service) (rate(istio_requests_total{response_code!="200"}[5m])))/ (sum by (destination_canonical_service) (... [14:45:22] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU - https://phabricator.wikimedia.org/T362670 (10calbon) 03NEW [14:45:27] 06Machine-Learning-Team: ------ - https://phabricator.wikimedia.org/T362671 (10calbon) 03NEW [14:45:57] 06Machine-Learning-Team: ------ - https://phabricator.wikimedia.org/T362671#9718504 (10calbon) 05Open→03Declined [14:51:10] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Revert Risk models are supported by caching in production - https://phabricator.wikimedia.org/T362672 (10calbon) 03NEW [14:51:47] 06Machine-Learning-Team, 05Goal: Q3 2024 Goal: A plan for a training infrastructure - https://phabricator.wikimedia.org/T353814#9718544 (10calbon) [14:51:53] 06Machine-Learning-Team, 05Goal: Q3 2024 Goal: Expand Lift Wing Cluster and add GPU capacity to production - https://phabricator.wikimedia.org/T353338#9718546 (10calbon) [14:52:02] 06Machine-Learning-Team, 05Goal: Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models - https://phabricator.wikimedia.org/T353337#9718547 (10calbon) [14:52:05] 06Machine-Learning-Team, 05Goal: Q3 2024 Goal: Implement caching for revertrisk-language-agnostic - https://phabricator.wikimedia.org/T353333#9718548 (10calbon) [14:52:10] 06Machine-Learning-Team, 05Goal: Q3 2024 Goal: Lift Wing users can request multiple predictions using a single request. - https://phabricator.wikimedia.org/T348153#9718549 (10calbon) [14:57:42] 06Machine-Learning-Team, 05Goal: 2024 Q4: Lift Wing Python Package - https://phabricator.wikimedia.org/T359140#9718601 (10calbon) [14:57:48] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Revert Risk models are supported by caching in production - https://phabricator.wikimedia.org/T362672#9718602 (10klausman) a:03klausman [14:57:58] 06Machine-Learning-Team: 2024 Q4 Goal: Operational Excellence - https://phabricator.wikimedia.org/T362674 (10calbon) 03NEW [14:58:17] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Operational Excellence - https://phabricator.wikimedia.org/T362674#9718616 (10calbon) [14:59:37] 06Machine-Learning-Team, 05Goal: 2024 Q4: Users can "pip install liftwing" and access 20% of models - https://phabricator.wikimedia.org/T359140#9718619 (10calbon) [15:12:04] kevinbazira: o/ the patch for the docker image looks great! I'm just reviewing it at the moment and building it locally so will paste an update on the patch [15:12:22] great work creating the CI pipelines and everything! [15:15:35] yep! [15:15:47] also the service will run on py3.11 and bookworm [15:19:14] yeah that's great! [15:26:06] thanks for the reviews elukey and isaranto :) [15:39:42] hi folks! Anybody currently working on staging? If not I'll test something [15:41:45] o/ no [15:42:55] nope, not working on staging either [15:45:06] ack testing [16:00:54] not now, but will use it again in the morning [16:16:00] service is restored, but I think that T353622 is impacting in the testing of the migration to the mw k8s endpoint [16:16:51] still not sure why though [16:19:27] I'll try to restart the testing tomorrow :) [16:19:38] going afk for today folks! Have a nice rest of the day! [16:21:27] ack. have a nice evening! [16:22:59] bye Luca! [16:35:19] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Nice work!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598) (owner: 10Kevin Bazira) [16:37:14] I faced an issue while running the logo detection model related to docker and m1 mac. `qemu: uncaught target signal 11 (Segmentation fault) - core dumped`. I'll update docker and try to fix it. [16:37:15] But I did put a +1 not to block this work since the image built fine etc. [16:43:07] Heading out now. Have a nice evening everyone! \o [16:43:11] going afk folks, have a nice evening! [16:43:22] Guten abend Tobias! [16:43:55] Καλό βράδυ, Ηλία! [16:48:43] isaranto: I remember I got the same issue before. I'll try it again [16:49:39] bye Tobias and Ilias! see u tomorrow o/ [18:24:15] night all