[07:33:00] good morning folks o/ [08:31:30] 06Machine-Learning-Team: Investigate kserve 0.13.0 upgrade - https://phabricator.wikimedia.org/T367048#10287468 (10isarantopoulos) encountered the following issue when running revscoring: ` Traceback (most recent call last): File "/srv/rev/revscoring_model/model.py", line 5, in import kserve Fi... [08:34:26] (03PS3) 10Ilias Sarantopoulos: revscoring: upgrade kserve to 0.13.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1085625 [08:35:02] (03PS4) 10Ilias Sarantopoulos: revscoring: upgrade kserve to 0.13.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1085625 (https://phabricator.wikimedia.org/T367048) [08:56:33] (03CR) 10Kevin Bazira: [C:03+1] revscoring: upgrade kserve to 0.13.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1085625 (https://phabricator.wikimedia.org/T367048) (owner: 10Ilias Sarantopoulos) [09:01:13] 06Machine-Learning-Team, 05Goal: Goal 2: People outside the ML team can ssh into an ml-lab machine, run a Jupyter Notebook, and run PyTorch powered by a GPU. - https://phabricator.wikimedia.org/T371396#10287519 (10isarantopoulos) [09:01:14] 06Machine-Learning-Team: [ml-lab] Use a (jupyter) notebook and load a LLM from huggingface - https://phabricator.wikimedia.org/T377574#10287521 (10isarantopoulos) [09:01:15] 06Machine-Learning-Team: ml-lab should have documentation - https://phabricator.wikimedia.org/T376974#10287522 (10isarantopoulos) [09:11:32] I'll be deploying revscoring changes to ml-staging for more testing (run the httpbb tests etc) [09:11:38] (03CR) 10Ilias Sarantopoulos: [C:03+2] revscoring: upgrade kserve to 0.13.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1085625 (https://phabricator.wikimedia.org/T367048) (owner: 10Ilias Sarantopoulos) [09:12:20] (03Merged) 10jenkins-bot: revscoring: upgrade kserve to 0.13.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1085625 (https://phabricator.wikimedia.org/T367048) (owner: 10Ilias Sarantopoulos) [09:29:48] here it is -> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1087130 [09:35:10] isaranto: o/ as FYI https://phabricator.wikimedia.org/T279621 [09:35:40] "apus" is what it was previously called MOSS, basically a proper object storage outside thanos [09:36:10] I'd follow up with Data Persistence to migrate the ML models over [09:36:27] o/ elukey thanks for bringing this up! [09:36:35] in theory the space requested is not big atm, but they will probably ask some ballpark numbers for the future etc.. [09:37:04] and since it offers an S3 API, all the models servers should behave just fine after the move [09:44:21] elukey: I need to catch up on my reading and updates on ceph. iiuc apus is the layer on top of ceph that provides the s3 compatible api. is that right? [09:44:54] exactly yes [09:45:10] thanos has swift behind the scenes, apus has ceph [09:45:21] but you'll deal with S3 APIs, so it shouldn't change anything [09:46:17] ack, thank you! [09:55:33] morning! [09:59:34] o/ aiko [10:40:50] 06Machine-Learning-Team: Run unit tests for the inference-services repo in CI - https://phabricator.wikimedia.org/T360120#10287964 (10kevinbazira) 05Open→03In progress a:03kevinbazira [10:44:25] (03PS1) 10Kevin Bazira: test: automate running unit tests in the LW repo [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1087146 (https://phabricator.wikimedia.org/T360120) [10:45:30] (03CR) 10CI reject: [V:04-1] test: automate running unit tests in the LW repo [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1087146 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [10:45:42] Morning! [10:48:19] Morning! [11:09:29] 06Machine-Learning-Team: Update output schema for reference risk model - https://phabricator.wikimedia.org/T378939 (10achou) 03NEW [11:13:24] (03CR) 10Kevin Bazira: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1087146 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [11:19:58] (03PS2) 10Kevin Bazira: test: automate running unit tests in the LW repo [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1087146 (https://phabricator.wikimedia.org/T360120) [11:20:53] 06Machine-Learning-Team: Update output schema for reference risk model - https://phabricator.wikimedia.org/T378939#10288085 (10achou) Hi @FNavas-foundation, I would like to confirm if the enterprise team is fine with this option. Thank you! [11:27:14] (03PS1) 10Nik Gkountas: Use random sorting only for topic-based recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1087156 (https://phabricator.wikimedia.org/T377124) [11:28:14] (03PS2) 10Nik Gkountas: Use random sorting only for topic-based recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1087156 (https://phabricator.wikimedia.org/T377124) [11:35:26] (03CR) 10Kevin Bazira: "I tested this locally by running:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1087146 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [12:05:31] (03CR) 10Ilias Sarantopoulos: "I think it would be better that we figure out a solution for all model servers and have one command that can be enabled/disabled and take " [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1087146 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [12:21:27] articlequality on ml-staging overrides the prod image so I filed another patch to update it https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1087157 [12:21:42] I'm deploying all the revscoring services in staging to test them [13:44:30] 06Machine-Learning-Team: Update output schema for reference risk model - https://phabricator.wikimedia.org/T378939#10288525 (10FNavas-foundation) @achou yes that's sensible! thank you for alerting me [14:01:44] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Expose revision revert risk scores in EventStreams - https://phabricator.wikimedia.org/T326179#10288798 (10Ottomata) [14:48:17] good morning all [14:52:33] hi Chris! [15:30:24] kevinbazira: thanks for working on automating the CI tests! I'm here if you want to discuss my review above so that we can figure out a way to do this if possible [15:39:00] (03CR) 10Eamedina: [C:03+1] Use random sorting only for topic-based recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1087156 (https://phabricator.wikimedia.org/T377124) (owner: 10Nik Gkountas) [15:39:44] isaranto: thanks for the review. happy to discuss this. I've added my thoughts in a comment: https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1087146/comments/b1a52242_837aa900 [15:40:37] thanks, I missed that! [15:42:01] I'm trying to see if there is a way that we wouldn't require to have different configuration per model server in order to make onboarding new models easier [15:42:34] yeah. that would be great. I thought about it too. [15:44:06] given that each model-server has its own dependencies and likely tests. then it might help to configure separate tox test envs for each model-server. [15:47:11] yes but the requirements file is defined in blubber so we could use a common structure in blubber to define one command for running tests in all envs [15:47:46] then we could add specific tox configuration on a per need basis, so this would be the exception [15:54:31] ok in the blubber test variant we would have to: [15:54:31] 1. install a specific model-servers dependencies [15:54:31] 2. use entrypoint.sh to run tests e.g tox -e new_model_server [15:54:31] wdyt? [15:58:43] 1. yes [15:58:43] 2. it could just be one command for all model servers e.g. tox -e run-ci-with-tests that runs pre-commit along with tests [16:00:12] since pytest searches directories we won't need to specify dirs, and any customization can happen via blubber (copy the required files to the required dir) [16:01:10] it has really been a while since I tried this on this patch https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/982396 and I don't remember the steps needed for it to work [16:02:38] if you also agree on the approach you could try sth out and let me know if you encounter any issues. [16:15:29] this approach would be great. [16:15:29] 1. ok [16:15:29] 2.1. if we don't specify dirs won't tests for other model-servers fail or we plan to copy only tests for a specific model-server into the test variant? [16:15:29] 2.2. in case we continue with the path of specifying dirs, then we could achieve it in bubber by passing a dir_name to tox e.g pytest test/unit/{env:DIR_NAME} [16:17:42] (03PS21) 10Nik Gkountas: Support Default collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1072175 (https://phabricator.wikimedia.org/T374597) (owner: 10Santhosh) [16:18:25] (03CR) 10CI reject: [V:04-1] Support Default collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1072175 (https://phabricator.wikimedia.org/T374597) (owner: 10Santhosh) [16:20:23] (03CR) 10Sbisson: [C:03+2] Use random sorting only for topic-based recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1087156 (https://phabricator.wikimedia.org/T377124) (owner: 10Nik Gkountas) [16:20:35] for 2.1 we only copy specific stuff on our blubber images so that would be ok. [16:21:03] (03Merged) 10jenkins-bot: Use random sorting only for topic-based recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1087156 (https://phabricator.wikimedia.org/T377124) (owner: 10Nik Gkountas) [16:21:53] ok I'll implement 2.1 tomorrow. [16:21:56] there is the issue there what happens with common libraries. e.g. if you change soemthing from utils python. [16:22:07] 🙌 [16:22:43] sure, let's iterate on the idea, I'm pretty sure some issues will appear. [16:22:53] have a nice evening o/ [16:23:13] yeah ... python utils will have to be copied to the test variant as well [16:23:29] np! have a good evening too! [16:31:23] on another topic: I noticed huggingface has updated the quantization page https://huggingface.co/docs/transformers/v4.46.0/quantization/overview [16:31:46] it has a nice table that includes rocm and apple silicon availability for each library [16:34:23] I stumbled into some issues trying to use them in the ml-labs today. will try to do it in a more organized way in the following days so that I can report concrete findings [16:34:40] going afk, have a nice evening folks o/ [17:30:14] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work: Request to host article-country model on Lift Wing - https://phabricator.wikimedia.org/T371897#10290006 (10Isaac) > We have: removed support for QID input, initialized claims as a dict, added support for async API calls @kevinbazira thanks! Schemas are looki... [18:32:04] (03PS22) 10Nik Gkountas: Support Default collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1072175 (https://phabricator.wikimedia.org/T374597) (owner: 10Santhosh)