[04:41:14] <wikibugs>	 (03CR) 10Kevin Bazira: [V:03+2 C:03+2] logo-detection: add KServe custom model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1017453 (https://phabricator.wikimedia.org/T361803) (owner: 10Kevin Bazira)
[04:45:53] <wikibugs>	 06Machine-Learning-Team: Prepare docker image for hosting the logo-detection model-server on LiftWing - https://phabricator.wikimedia.org/T362598 (10kevinbazira) 03NEW
[05:09:24] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Create logo-detection model-server to be hosted on LiftWing - https://phabricator.wikimedia.org/T361803#9716257 (10kevinbazira) A custom KServe model-server was created for the logo-detection isvc and can be run locally using instructions in this [[ https://github.c...
[06:08:04] <wikibugs>	 06Machine-Learning-Team: Prepare docker image for hosting the logo-detection model-server on LiftWing - https://phabricator.wikimedia.org/T362598#9716319 (10kevinbazira) Successfully built the logo-detection model-server docker image locally. Below are the image layers with the largest layer size being ~2.37GB....
[06:34:48] <wikibugs>	 06Machine-Learning-Team: Prepare docker image for hosting the logo-detection model-server on LiftWing - https://phabricator.wikimedia.org/T362598#9716364 (10kevinbazira) The `tensorflow` python package is the main contributor to the largest layer size indicated above. The installation of this package includes se...
[06:50:48] <wikibugs>	 06Machine-Learning-Team: Prepare docker image for hosting the logo-detection model-server on LiftWing - https://phabricator.wikimedia.org/T362598#9716403 (10kevinbazira) Since this model-server will be running on CPU until the GPU procurement is complete, `tensorflow` has been replaced with `tensorflow-cpu` whic...
[06:57:19] <isaranto>	 hello!
[08:01:14] <aiko>	 morning Ilias!
[08:11:51] <isaranto>	 hey Aiko o/
[08:34:33] <wikibugs>	 06Machine-Learning-Team: Prepare docker image for hosting the logo-detection model-server on LiftWing - https://phabricator.wikimedia.org/T362598#9716716 (10kevinbazira) The largest layer size for the logo-detection model-server docker image has been reduced from ~2.37GBs to ~1.61GB as shown below: ` $ docker hi...
[08:37:21] <klausman>	 Morning!
[09:40:04] <wikibugs>	 (03PS1) 10Klausman: gitignore: Ignore my_venv/ and models/ directories [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020184
[09:40:40] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] gitignore: Ignore my_venv/ and models/ directories [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020184 (owner: 10Klausman)
[09:40:53] <klausman>	 that was quick :)
[09:41:01] <isaranto>	 :)
[09:41:28] <isaranto>	 making up for not saying morning o/
[09:42:04] <wikibugs>	 (03CR) 10Klausman: [C:03+2] gitignore: Ignore my_venv/ and models/ directories [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020184 (owner: 10Klausman)
[09:42:54] <klausman>	 During my rebase today, I accidentally added the whole venv when removing conflicts. So I had an immediate itch to scratch :)
[09:48:42] <klausman>	 Do the g&s jobs form inf-services show up on Zuul?
[09:49:48] <isaranto>	 what is g&s?
[09:49:54] <klausman>	 gate&submit
[09:51:41] <isaranto>	 iirc they do
[09:52:16] <klausman>	 ah, then I guess it just takes some time.
[10:00:06] * klausman lunch
[10:14:55] * isaranto lunch and errand!
[10:20:30] <wikibugs>	 (03PS1) 10Kevin Bazira: logo-detection: containerize model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598)
[10:53:52] <wikibugs>	 (03CR) 10Kevin Bazira: "I built the model-server image locally and the largest layer is ~1.61GB as shown here: https://phabricator.wikimedia.org/T362598#9716716" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598) (owner: 10Kevin Bazira)
[11:01:21] <wikibugs>	 (03CR) 10Klausman: [V:03+2 C:03+2] gitignore: Ignore my_venv/ and models/ directories [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020184 (owner: 10Klausman)
[11:01:55] <wikibugs>	 07artificial-intelligence, 06Machine-Learning-Team, 10Bad-Words-Detection-System, 10revscoring: Gather language assets for Occitan - https://phabricator.wikimedia.org/T354702#9717378 (10Aklapper)
[11:02:25] <wikibugs>	 07artificial-intelligence, 06Machine-Learning-Team, 10Bad-Words-Detection-System, 10revscoring: Gather language assets for Occitan - https://phabricator.wikimedia.org/T354702#9717380 (10Aklapper) @Lhanars: Hi! This task has been assigned to you a while ago. Could you maybe share an update? Do you still pla...
[12:39:22] <elukey>	 hello folks!
[12:42:21] <isaranto>	 o/
[12:46:43] <wikibugs>	 (03CR) 10Elukey: "Looks good! Left a note about the OS version, lemme know!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598) (owner: 10Kevin Bazira)
[13:08:19] <klausman>	 \o hey Luca
[13:12:42] <klausman>	 elukey: re: api-ro move: want/need any help? I'm reviewer on most of the changes from c_laime, but you probably have all the context.
[13:13:47] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9717932 (10isarantopoulos) At the moment we have a 7B model deployed on ml-staging that uses the CPU and gets a response in ~30s.   I am experimenting loading various model si...
[13:14:37] <elukey>	 klausman: everything is already in staging, trying to figure out if there is a way to have both domains configured at the same time, once/if I have it I'll send a proposal for the move
[13:15:02] <klausman>	 Ack.
[13:15:10] <wikibugs>	 (03PS2) 10Kevin Bazira: logo-detection: containerize model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598)
[13:15:14] <klausman>	 I presume it will mean a roll-restart of roughly every isvc?
[13:15:29] <elukey>	 there are two problems:
[13:16:13] <elukey>	 1) the virtual service config for the *.wikimedia.org domains, not sure if it is possible to have both endpoints configured at the same time (api-ro and mw-api-ro-int)
[13:16:34] <elukey>	 2) the mw api host that we configure on each isvc, that needs to be changed with a deploy
[13:17:15] <elukey>	 if 1) resolves with "we can use only one virtual service" then we'll need to depool a DC at the time from inference.discovery.wmnet and apply the patches, then repol
[13:17:18] <elukey>	 *repool
[13:17:56] <elukey>	 should be relatively easy to do
[13:18:44] <klausman>	 I think it would be the first time we do cross-DC prod traffic
[13:19:42] <elukey>	 in the past a DC from inference.d.w was depooled by serviceops for $reasons, it shouldn't cause any trouble
[13:19:57] <klausman>	 Ah, I see
[13:20:27] <elukey>	 maybe some latency will go up a little, it will be interesting to see effects on isvcs 
[13:21:10] <klausman>	 I noticed the other day that there were isvcs whcih where a few dot releases behind on the charts (like x.y.z vs x.y.z+3), so that'd be cleaned up by a redploy as well
[13:31:54] <wikibugs>	 06Machine-Learning-Team, 06Research: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#9718042 (10achou) @kostajh @XiaoXiao-WMF thanks for tagging. Sorry I was unaware of the discussion here. The ML team is currently i...
[13:35:39] <wikibugs>	 06Machine-Learning-Team, 05Goal: Goal: Inference Optimization for Hugging face/Pytorch models - https://phabricator.wikimedia.org/T353337#9718064 (10isarantopoulos) [[ https://phabricator.wikimedia.org/T357986#9717932 | Current status from relevant subtask ]] At the moment we are working on how to better serve...
[13:38:30] <wikibugs>	 (03CR) 10Kevin Bazira: logo-detection: containerize model-server (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598) (owner: 10Kevin Bazira)
[13:47:34] <klausman>	 \o/ I got Puppet to actually extract the Cassandra IPs from the profile of the nodes and put them in a place for the deployment server network policy thingamajig to use
[13:47:42] <isaranto>	 \o/
[13:48:22] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "The config looks good to me, let's wait for either Aiko or Ilias to review and then we are ready to go!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598) (owner: 10Kevin Bazira)
[13:51:48] <wikibugs>	 06Machine-Learning-Team, 10ORES: Create basic alerts for isvcs to catch outages - https://phabricator.wikimedia.org/T362661 (10elukey) 03NEW
[13:58:09] <wikibugs>	 06Machine-Learning-Team, 10ORES: Add slow-logs for ML isvcs - https://phabricator.wikimedia.org/T362663 (10elukey) 03NEW
[13:59:24] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9718192 (10elukey) Created two follow ups:  * Basic alerting - T362661 (in place of the SLO dashboard etc.. that we can't use right now). * Add slow logs - T362663 (log slow requests ver...
[14:04:51] <wikibugs>	 06Machine-Learning-Team, 10ORES: Create basic alerts for isvcs to catch outages - https://phabricator.wikimedia.org/T362661#9718221 (10klausman) Probably something like:  ` (sum by (destination_canonical_service) (rate(istio_requests_total{response_code!="200"}[5m])))/ (sum by (destination_canonical_service) (...
[14:45:22] <wikibugs>	 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU - https://phabricator.wikimedia.org/T362670 (10calbon) 03NEW
[14:45:27] <wikibugs>	 06Machine-Learning-Team: ------ - https://phabricator.wikimedia.org/T362671 (10calbon) 03NEW
[14:45:57] <wikibugs>	 06Machine-Learning-Team: ------ - https://phabricator.wikimedia.org/T362671#9718504 (10calbon) 05Open→03Declined
[14:51:10] <wikibugs>	 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Revert Risk models are supported by caching in production - https://phabricator.wikimedia.org/T362672 (10calbon) 03NEW
[14:51:47] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q3 2024 Goal: A plan for a training infrastructure  - https://phabricator.wikimedia.org/T353814#9718544 (10calbon)
[14:51:53] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q3 2024 Goal: Expand Lift Wing Cluster and add GPU capacity to production  - https://phabricator.wikimedia.org/T353338#9718546 (10calbon)
[14:52:02] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models - https://phabricator.wikimedia.org/T353337#9718547 (10calbon)
[14:52:05] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q3 2024 Goal: Implement caching for revertrisk-language-agnostic - https://phabricator.wikimedia.org/T353333#9718548 (10calbon)
[14:52:10] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q3 2024 Goal: Lift Wing users can request multiple predictions using a single request. - https://phabricator.wikimedia.org/T348153#9718549 (10calbon)
[14:57:42] <wikibugs>	 06Machine-Learning-Team, 05Goal: 2024 Q4: Lift Wing Python Package - https://phabricator.wikimedia.org/T359140#9718601 (10calbon)
[14:57:48] <wikibugs>	 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Revert Risk models are supported by caching in production - https://phabricator.wikimedia.org/T362672#9718602 (10klausman) a:03klausman
[14:57:58] <wikibugs>	 06Machine-Learning-Team: 2024 Q4 Goal: Operational Excellence - https://phabricator.wikimedia.org/T362674 (10calbon) 03NEW
[14:58:17] <wikibugs>	 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Operational Excellence - https://phabricator.wikimedia.org/T362674#9718616 (10calbon)
[14:59:37] <wikibugs>	 06Machine-Learning-Team, 05Goal: 2024 Q4: Users can "pip install liftwing" and access 20% of models - https://phabricator.wikimedia.org/T359140#9718619 (10calbon)
[15:12:04] <isaranto>	 kevinbazira: o/ the patch for the docker image looks great! I'm just reviewing it at the moment and building it locally so will paste an update on the patch
[15:12:22] <isaranto>	 great work creating the CI pipelines and everything!
[15:15:35] <elukey>	 yep!
[15:15:47] <elukey>	 also the service will run on py3.11 and bookworm
[15:19:14] <isaranto>	 yeah that's great!
[15:26:06] <kevinbazira>	 thanks for the reviews elukey and isaranto :)
[15:39:42] <elukey>	 hi folks! Anybody currently working on staging? If not I'll test something
[15:41:45] <aiko>	 o/ no
[15:42:55] <klausman>	 nope, not working on staging either
[15:45:06] <elukey>	 ack testing
[16:00:54] <isaranto>	 not now, but will use it again in the morning
[16:16:00] <elukey>	 service is restored, but I think that T353622 is impacting in the testing of the migration to the mw k8s endpoint
[16:16:51] <elukey>	 still not sure why though
[16:19:27] <elukey>	 I'll try to restart the testing tomorrow :)
[16:19:38] <elukey>	 going afk for today folks! Have a nice rest of the day!
[16:21:27] <isaranto>	 ack. have a nice evening!
[16:22:59] <aiko>	 bye Luca!
[16:35:19] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] "Nice work!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598) (owner: 10Kevin Bazira)
[16:37:14] <isaranto>	 I faced an issue while running the logo detection model related to docker and m1 mac. `qemu: uncaught target signal 11 (Segmentation fault) - core dumped`. I'll update docker and try to fix it.
[16:37:15] <isaranto>	 But I did put a +1 not to block this work since the image built fine etc.
[16:43:07] <klausman>	 Heading out now. Have a nice evening everyone! \o
[16:43:11] <isaranto>	 going afk folks, have a nice evening!
[16:43:22] <isaranto>	 Guten abend Tobias!
[16:43:55] <klausman>	 Καλό βράδυ, Ηλία!
[16:48:43] <aiko>	 isaranto: I remember I got the same issue before. I'll try it again
[16:49:39] <aiko>	 bye Tobias and Ilias! see u tomorrow o/
[18:24:15] <chrisalbon>	 night all