[07:57:39] Good morning, I'm back! [08:02:02] hi Ilias, welcome back! :) I'm taking a day off today [08:06:50] <3 [09:11:00] 06Machine-Learning-Team, 06Wikipedia-Android-App-Backlog: Investigate increased preprocessing latencies on LW of article-descriptions model - https://phabricator.wikimedia.org/T358195#9595064 (10isarantopoulos) >>! In T358195#9580807, @calbon wrote: > Can we investigation reducing the computational need to jus... [11:56:06] * isaranto lunch [13:07:07] 06Machine-Learning-Team, 06Research: Allow to set Catboost's threads in readability-liftwing - https://phabricator.wikimedia.org/T353461#9595816 (10elukey) @Trokhymovych thank youuu! Looks good! :) [13:07:11] hello folks! [13:07:12] isaranto: o/ [13:08:02] elukey: \o/ [13:10:15] how are things? [13:10:29] missed ya! [13:11:04] note for the team - I am rolling out a change to the storage-initializer's s3 secret in prod, basically I changed the ca-bundle file before the paternity leave and it wasn't updated. The change doesn't trigger any deployment, but it impact the kserve upgrade (pre-requisite) [13:11:07] at the moment I'm giving a battle with poetry and python dependencies for the Hugging face model server, other than that good [13:11:22] ack [13:20:46] 06Machine-Learning-Team: Update to KServe 0.11 - https://phabricator.wikimedia.org/T337213#9595849 (10elukey) [13:21:46] 06Machine-Learning-Team: Update to KServe 0.11 - https://phabricator.wikimedia.org/T337213#9595850 (10elukey) The new control plane is running in staging, I took the opportunity to upgrade all Docker images to Debian Bookworm. [13:29:26] ok the kserve control plane 0.11 is ready for prod [13:31:57] nice! [13:32:21] btw https://github.com/kserve/kserve/releases/tag/v0.12.0 [13:32:26] kserve 0.12 is out! [13:33:28] yes yes :D [13:33:34] and there is a plan for kserve 1.0 to come out at some point (although I don't know if there is a specific timeline for that) [13:33:41] IIUC they moved to a quarter release [13:33:51] so in theory we could think about upgrading twice per year [13:45:25] Good morning all [13:45:54] Good morning Chris o/ [13:48:54] 06Machine-Learning-Team, 06Research: Allow to set Catboost's threads in readability-liftwing - https://phabricator.wikimedia.org/T353461#9595938 (10Trokhymovych) Merged. Thank you! [13:53:49] chrisalbon: o/ [13:58:43] folks I created https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/KServe#Admin_only_-_Upgrade_KServe_to_a_new_version [13:58:46] 06Machine-Learning-Team: Update to KServe 0.11 - https://phabricator.wikimedia.org/T337213#9595988 (10elukey) Added https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/KServe#Admin_only_-_Upgrade_KServe_to_a_new_version to list the steps that we do now. [13:58:55] that lists the steps to do for a complete Kserve upgrade [13:59:06] I'll also post it to slack for visibility [13:59:09] please review it :) [14:41:41] lovely I don't see logs for our clusters in logstash [14:42:42] mmm maybe the k8s app dashboard changed [14:49:21] I'll review in a bit! thanks for creating it! [14:49:38] no rush! [15:24:22] 06Machine-Learning-Team: Update to KServe 0.11 - https://phabricator.wikimedia.org/T337213#9596411 (10elukey) Next steps: deploy to prod [15:25:07] folks qq - I recall that there was some discussion about our docker images being too big, and Aiko is seeing some issues with the layers size [15:25:08] (03CR) 10Bartosz Dziewoński: [C: 03+2] Use a sql IN clause in DatabaseQueryBuilder::buildDiscreteQuery [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1008046 (https://phabricator.wikimedia.org/T358961) (owner: 10Umherirrender) [15:25:19] do we have a task? Can't find anything [15:25:29] I recall that there was some discussion about a base image to extend/use [15:25:34] anything done on that front? [15:25:52] There is a task, but I think its named something else [15:27:06] of course, now I cant find it [15:27:16] ahhaha yes [15:28:42] the model she was working on is rr-multilingual-gpu [15:29:12] if it is taking this long for me to find it, maybe make a new ticket [15:29:21] * elukey nods [15:30:30] Tajh wants us to revisit our deployment documentation and add a requirement (or strong preference) for open source code and licensing. I'll make a ticket. [15:31:54] we are already enforcing it, any specific bit? [15:32:05] like models, frameworks, etc..? [15:32:12] I guess python libs etc.. [15:34:03] I think models themselves, although I need to check because it is mostly for legal reasons because we are a very large online platform (VLOP) in Europe, which comes with some requirements [15:34:19] ack ack [15:35:52] 06Machine-Learning-Team: Add Licensing and Open Source requirement/strong preference to Lift Wing model deployment documentations - https://phabricator.wikimedia.org/T359066 (10calbon) [15:46:29] (03Merged) 10jenkins-bot: Use a sql IN clause in DatabaseQueryBuilder::buildDiscreteQuery [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1008046 (https://phabricator.wikimedia.org/T358961) (owner: 10Umherirrender) [15:50:46] 06Machine-Learning-Team, 07Software-Licensing: Add Licensing and Open Source requirement/strong preference to Lift Wing model deployment documentations - https://phabricator.wikimedia.org/T359066#9596573 (10Reedy) [15:51:02] 06Machine-Learning-Team: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images - https://phabricator.wikimedia.org/T359067 (10elukey) [15:51:36] chrisalbon: --^ [15:56:27] 06Machine-Learning-Team: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images - https://phabricator.wikimedia.org/T359067#9596616 (10elukey) Previous discussion with Service Ops on IRC: ` 15:32 o/ hi from ml-team, I need some help with a 500 error when CI is pushing the model... [15:57:16] 06Machine-Learning-Team: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images - https://phabricator.wikimedia.org/T359067#9596618 (10elukey) [16:00:43] 06Machine-Learning-Team: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images - https://phabricator.wikimedia.org/T359067#9596663 (10elukey) Relevant (previous) related task: https://phabricator.wikimedia.org/T288198 [16:03:38] 06Machine-Learning-Team: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images - https://phabricator.wikimedia.org/T359067#9596711 (10elukey) [16:04:21] asked some thoughts from serviceops, let's see [16:16:15] ack [16:17:05] I was trying to find a way if we could have a generic image for rocm, but haven't found a solution yet [16:17:25] at least for the open source llms and huggingface this seems doable [16:24:05] I added an idea to the task, we could add in production images a base bookworm image with pytorch[rocm] already copied to the system py libs [16:24:30] not 100% sure if it would work, but even if it does we'd have the same problem with the big layers [16:26:38] What is the problem we are solving? Sorry I am dumb [16:28:55] the main trouble is that when we install via pip torch+rocm we end up with a ton of things installed, and pushing the docker image to our docker registry may fail due to some resource constraints [16:29:50] and also the more images we have with torch+rocm the more we risk to duplicate layers, ending up with k8s workers pulling the same thing (in different sauces) over and over from the registry [16:29:56] (instead of sharing for example) [16:30:01] etc.. [16:30:34] the more we hammer the docker registry the more we may end up in troubles when we try to pull from it [16:30:51] not sure if I pictured the issue clearly [16:34:07] ah got it [16:34:08] thanks [16:35:27] a good topic to discuss tomorrow or on Wednesday! [16:35:47] as I also want to pick everyone's brain on the HF docker image [16:36:08] going to a doc appointment and afk for the day folks, cu tomorrow! [16:36:15] o/ [16:55:21] (03PS1) 10Elukey: readability: update the catboost version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008514 (https://phabricator.wikimedia.org/T353461) [17:22:15] Night all. I’m just going to be sitting here alone, writing ITCs [17:22:54] :D [17:36:12] * elukey afk!