[05:47:28] Good morning! [06:55:10] I plan to continue the work on training the logo detection model with torch, let me know if you think I shouldn't bother [06:55:33] My plan would be to make it work today otherwise continue with what we have with keras and tensorflow [08:12:02] 10Lift-Wing, 06Machine-Learning-Team, 10ORES, 10ChangeProp, and 6 others: Selectively disable changeprop functionality that is no longer used - https://phabricator.wikimedia.org/T361483#9687158 (10akosiaris) [08:25:42] 06Machine-Learning-Team, 10Gerrit, 07git-lfs, 10Release-Engineering-Team (Radar): ML-Team will soon stop using LFS on Gerrit (for ORES deployment) - https://phabricator.wikimedia.org/T342765#9687219 (10hashar) [08:25:47] 10ORES, 07git-lfs, 07Puppet: 14Require git-lfs in ORES hosts - 14https://phabricator.wikimedia.org/T232494#9687224 (10hashar) [08:26:08] 10ORES, 10Gerrit, 07git-lfs: 14Write a cookbook for the workaround for getting LFS to gerrit - 14https://phabricator.wikimedia.org/T226055#9687231 (10hashar) [08:26:46] 06Machine-Learning-Team, 10Diffusion, 10Wikimedia-GitHub, 07git-lfs, 10Release-Engineering-Team (Seen): 14LFS objects are not mirroring from Github through Phab to Gerrit consistently - 14https://phabricator.wikimedia.org/T212818#9687241 (10hashar) [08:27:13] 06Machine-Learning-Team, 07git-lfs: 14Gerrit repo scoring/ores/editquality LFS broken (smudge filter lfs failed) - 14https://phabricator.wikimedia.org/T212544#9687243 (10hashar) [08:27:29] 06Machine-Learning-Team, 10ORES, 10Scap, 06SRE, 07git-lfs: 14scap support for git-lfs - 14https://phabricator.wikimedia.org/T181855#9687245 (10hashar) [08:27:45] 10ORES, 10Gerrit, 06SRE, 07git-lfs, 13Patch-For-Review: 14Plan migration of ORES repos to git-lfs - 14https://phabricator.wikimedia.org/T181678#9687247 (10hashar) [08:56:54] 06Machine-Learning-Team, 10Structured-Data-Backlog (Current Work): Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9687372 (10mfossati) The prototype looks good to me, I'm excited to see this effort move to the next level! @kevinbazira, I've especially appreciated the... [09:02:27] o/ [09:04:44] (03PS2) 10AikoChou: revertrisk: error handling for batch requests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1016341 (https://phabricator.wikimedia.org/T360406) [09:10:08] (03PS3) 10AikoChou: revertrisk: error handling for batch requests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1016341 (https://phabricator.wikimedia.org/T360406) [09:20:31] (03CR) 10AikoChou: [C:03+2] "Thanks for the review!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1016341 (https://phabricator.wikimedia.org/T360406) (owner: 10AikoChou) [09:22:33] o/ [09:29:13] 06Machine-Learning-Team: Create logo-detection model-server to be hosted on LiftWing - https://phabricator.wikimedia.org/T361803 (10kevinbazira) 03NEW [09:33:12] 06Machine-Learning-Team, 10Structured-Data-Backlog (Current Work): Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9687630 (10kevinbazira) Thanks @mfossati! <3 It's great to hear you're excited about moving to the next milestone. Rest assured, in T361803, we'll mainta... [09:39:06] (03Merged) 10jenkins-bot: revertrisk: error handling for batch requests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1016341 (https://phabricator.wikimedia.org/T360406) (owner: 10AikoChou) [10:04:32] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1014545 [10:04:50] --^ I'm going to deploy RRLA KI v0.6 and also the batch model with error handling to staging [10:11:35] * aiko lunch! [10:19:08] Ack! [10:19:15] * isaranto lunch as well [12:43:27] hello folks! [12:47:27] hi Luca o/ [12:50:48] o/ [12:51:00] isaranto: o/ if you fix https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1015297 I can review/merge/deploy [12:51:37] o/ [12:51:47] oh I never saw that. fixing it now, thanks! [12:53:19] elukey: ready! [12:55:45] isaranto: if you can run docker-pkg and paste the output so people now that it built fine etc.. [12:55:57] ok [12:57:31] I'm rebuilding the images now, will take a while but will paste it on the task as soon as its done [13:00:19] +! [13:00:20] +1 [13:03:47] 06Machine-Learning-Team, 13Patch-For-Review: 14Improving error message for Revertrisk models - 14https://phabricator.wikimedia.org/T351278#9688222 (10achou) 05Open→03Resolved 14This task is complete. Check out these examples of new error messages: ` $ curl "https://inference-staging.svc.codfw.wmnet:30... [13:09:54] Good morning all [13:10:01] 10Lift-Wing, 06Machine-Learning-Team, 10ORES, 10ChangeProp, and 6 others: Selectively disable changeprop functionality that is no longer used - https://phabricator.wikimedia.org/T361483#9688388 (10akosiaris) Next up. `mobile-sections`. It's deprecated per T328036 for a long time now. I 'll remove rules upd... [13:11:35] 10Lift-Wing, 06Machine-Learning-Team, 10ORES, 10ChangeProp, and 6 others: Selectively disable changeprop functionality that is no longer used - https://phabricator.wikimedia.org/T361483#9688445 (10akosiaris) >>! In T361483#9680093, @elukey wrote: >>>! In T361483#9680024, @akosiaris wrote: >>>>! In T361483#... [13:11:44] o/ Chris [13:14:38] lol I got an error `No space left on device: ` . Deleting some stuff and rebuilding [13:14:49] disk space finally caught up with me (again) [14:06:15] elukey: sry I was in a meeting. it is taking a bit longer as I had to delete many images from my disk so now it is redownloading stuff and building etc [14:06:29] sure sure! [14:13:53] done! [14:14:33] elukey: I update the patch with a comment -> https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1015297. thanks!! [14:19:07] building the new images [14:21:55] \o/ [14:43:01] 10Lift-Wing, 06Machine-Learning-Team, 10ORES, 10ChangeProp, and 5 others: Selectively disable changeprop functionality that is no longer used - https://phabricator.wikimedia.org/T361483#9688741 (10SLopes-WMF) [14:43:34] isaranto: still publishing, it takes a lot [15:24:27] isaranto: done! [15:27:19] wow, thanks! [15:29:29] finally :D [15:30:53] (03PS26) 10Ilias Sarantopoulos: huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) [15:31:22] all good things take time! [15:31:49] (03PS27) 10Ilias Sarantopoulos: huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) [15:37:55] I'm updating the README.md and the commit-msg and the above will be ready as well --^ [15:48:40] isaranto: just to be sure, let's verify the final size of the image etc.. [15:49:25] yes, I'll rebuild and paste a comment on the patch [15:55:00] 06Machine-Learning-Team, 13Patch-For-Review: Create a Pytorch base image - https://phabricator.wikimedia.org/T360638#9689230 (10elukey) We have created two base images, one for Pytorch 2.2.x and one for 2.1.x, they will be tested and used with Revert Risk ML and Hugging face's model server. [15:56:13] 06Machine-Learning-Team: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images - https://phabricator.wikimedia.org/T359067#9689238 (10elukey) All subtasks completed, wrapping up the task, thanks to all for feedback/help/support! <3 [15:56:45] 06Machine-Learning-Team: Add Dragonfly to the ML k8s clusters - https://phabricator.wikimedia.org/T359416#9689243 (10elukey) Rolled out Dragonfly to all ml clusters! [16:10:30] bye folks! See you tomorrow :) [16:10:33] o/ [16:11:41] night elukey! [16:12:09] ciao Luca! [16:12:25] all well with the image it is 11.4GB! [16:16:36] (03PS28) 10Ilias Sarantopoulos: huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) [16:21:48] (03CR) 10Ilias Sarantopoulos: "This results in a docker image of 11.4GB." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [16:24:39] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9689427 (10isarantopoulos) After the released new pytorch image I have reveried the docker image size (11.4GB as described above), and the layers being the following (same as... [16:30:06] is that too big? [16:31:43] (03CR) 10CI reject: [V:04-1] huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [16:37:33] No it should be ok! With Luca's base image now we just have ~1GB of extra stuff we put in our production images [16:38:30] I'm logging off as well folks, have a nice evening! [16:40:03] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [18:29:36] 06Machine-Learning-Team, 10Observability-Metrics: SLO dashboards for Lift Wing showing unexpected values - https://phabricator.wikimedia.org/T359879#9690074 (10herron) >>! In T359879#9669840, @elukey wrote: > @herron something really strange: https://w.wiki/9bMW > > I compared the recording rule with the actu... [18:35:43] 06Machine-Learning-Team, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q4): Gap in metrics rendered from Thanos Rules - https://phabricator.wikimedia.org/T352756#9690118 (10herron) [18:58:56] 06Machine-Learning-Team, 13Patch-For-Review: Error handling in Batch Predictions for RevertRisk Models - https://phabricator.wikimedia.org/T360406#9690212 (10achou) [19:15:38] 06Machine-Learning-Team, 13Patch-For-Review: 14Error handling in Batch Predictions for RevertRisk Models - 14https://phabricator.wikimedia.org/T360406#9690275 (10achou) 05Open→03Resolved 14This task is complete. Check out these examples: * Batch request where all requests fail: return a 422 (Unproces... [19:25:34] 06Machine-Learning-Team: Deploy RR-language-agnostic batch version to prod - https://phabricator.wikimedia.org/T358744#9690308 (10achou) I repost [[ https://phabricator.wikimedia.org/T360406#9685087 | what I previously wrote ]] here as the issue is more related to deployment. >>! In T360406#9685087, @achou wrot... [19:25:55] 06Machine-Learning-Team, 13Patch-For-Review: Deploy RevertRisk language-agnostic with knowledge integrity v0.6.0 - https://phabricator.wikimedia.org/T360423#9690315 (10achou) [19:45:18] 06Machine-Learning-Team, 13Patch-For-Review: Deploy RevertRisk language-agnostic with knowledge integrity v0.6.0 - https://phabricator.wikimedia.org/T360423#9690484 (10achou) The new RRLA model server featuring KI v.0.6 has been deployed to ML-staging. I used `wrk` to conduct load testing and compare the perfo... [19:50:16] 06Machine-Learning-Team, 13Patch-For-Review: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742#9690511 (10achou) FYI @MunizaA :) >>! In T360423#9690484, @achou wrote: > The new RRLA model server featuring KI v.0.6 has been deployed to... [19:52:03] 06Machine-Learning-Team: 14Prep work for (re)training workflow sprint - 14https://phabricator.wikimedia.org/T358748#9690532 (10achou) 05Open→03Resolved [20:08:48] 06Machine-Learning-Team: Investigate the inconsistent load test results (locust) for revertrisk - https://phabricator.wikimedia.org/T361881 (10achou) 03NEW [20:13:13] 06Machine-Learning-Team, 13Patch-For-Review: 14Fix locust load testing for Revert Risk models - 14https://phabricator.wikimedia.org/T361234#9690640 (10achou) 05Open→03Resolved a:03achou 14This task is complete. I've created T361881 to follow up on the above test results issue.