[06:45:57] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Host an OpenVINO model in LiftWing - https://phabricator.wikimedia.org/T395012#10866589 (10santhosh) I tried to integrate Openvino model server to liftwing. Learnings for the first iteration(see the above WIP patch): The Opevino Model server's latest... [06:50:58] good monring! [07:01:54] 06Machine-Learning-Team: Article Summary Generation and Evaluation Pipeline using vLLM image - https://phabricator.wikimedia.org/T395246#10866605 (10kevinbazira) I have decoupled the monolithic pipeline we ran in T395246#10858281 into two KServe custom model-servers: =====1. Summary Generation Server A KServe... [07:03:15] morning isaranto o/ [07:03:16] I decoupled the monolithic pipeline we ran as `generate_simple_summaries.py` into two KServe custom model-servers (`summary_generation_server.py` and `summary_evaluation_server.py`): https://phabricator.wikimedia.org/T395246#10866605 [07:03:16] Now going to work on a KServe custom transformer to orchestrate the two model-servers. [07:04:42] o/ kevinbazira I was just writing my thoughts on the task [07:05:17] I think we can go with something much simpler for now, since this is not a production service for the time being [07:05:28] writing in the task.. [07:05:52] also good morning <3 [07:22:01] 06Machine-Learning-Team: Article Summary Generation and Evaluation Pipeline using vLLM image - https://phabricator.wikimedia.org/T395246#10866609 (10isarantopoulos) First of all, the above is great work, Kevin! However, since this is not yet a production-level service (nor has it been requested as such), I’d sug... [07:24:22] kevinbazira: I'm available if you want to further chat about this. I think you've done great work but we'd like to limit the scope of this until we have an official request to host this a service. for the time being we just want to support an experiment [07:27:51] ok, thanks for the suggestion. I'll limit this to the notebook(s) you suggested. [07:35:28] I just mentioned that we could do this in a notebook, not that we have to.I think we can find a middle ground that would set us up for success when and if it is decided to move this to prod [07:35:47] if you have time we can jump on a call today and chat about this [07:44:07] okok I've set up a quick call and invited you: https://meet.google.com/nbg-cevs-who [07:45:52] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Wikipedia-Android-App-Backlog, 05FY2024-25 WE4.2, 10Moderator-Tools-Team (Kanban): Ensure all ORES i18n messages are available for wikis to add revert risk language agnostic filters to - https://phabricator.wikimedia.org/T395481#10866635 (10isarantop... [07:46:05] thanks, be there in a sec! [09:32:14] isaranto: o/ [09:32:31] \o [09:32:43] I have some time now, I wanted to scheduled a proper window for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1151600 but if nobody is doing anything atm I could take an hour and deploy to staging and some codfw [09:33:01] just to verify, the rest can be done at a slower pace [09:33:27] today is a good day, a lot of ppl are afk due to holidays in some countries [09:33:46] me and kevinbazira are the only ones around [09:34:21] Kevin is it ok for Luca to try? [09:36:15] yes please, elukey please proceed [09:37:31] ah nice okok! [10:04:40] the article-quality pod (under article-models in staging) is crash looping for [10:04:43] _catboost.CatBoostError: catboost/libs/model/model_import_interface.h:19: Model file doesn't exist: /mnt/models/catboost_model.cbm [10:05:52] a yes Bartosz tried to deploy yesterday but there was an issue that we had missed [10:06:03] lemme try to fix it in a sec [10:06:12] yes yes np I am just reporting :) [10:06:19] ty! [10:06:32] I had the impression we had rolled back but I was wrong [10:09:06] elukey: when you have a moment. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1152027 [10:09:16] I can deploy this right away [10:10:09] I can take care of it don't worry :) [10:11:17] I'm on it already! [10:11:18] :D [10:11:37] just to verify that my assumption works [10:12:56] elukey: all good! [10:13:39] thanks! [10:13:57] I am finishing up the staging deployments and so far I don't see issues with the cleanup [10:14:05] niice! [10:15:34] the only very weird thing is that I get the following intermittently [10:15:35] Error: template: kserve-inference/templates/serviceaccount.yaml:4:20: executing "kserve-inference/templates/serviceaccount.yaml" at <.Values.inference.predictor.config.serviceAccountName>: nil pointer evaluating interface {}.serviceAccountName [10:15:42] and I haven't touched it [10:16:10] it seems as if the control plane is under pressure [10:16:20] maybe due to the deployments [10:17:15] I get it for reverted and outlink [10:17:53] ahhhh okok wait [10:18:07] in these two probably .Values.inference.predictor.config. is not there [10:18:14] and it errors out [10:18:44] nope it is there https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1151600/5/helmfile.d/ml-services/articletopic-outlink/values-ml-staging-codfw.yaml [10:19:02] sigh [10:19:10] I left config: without elements [10:19:15] fixing [10:22:26] it happens the same for reverted, but it doesn't have the same issue [10:28:21] ah no same issue sigh [10:30:01] all right I'll proceed with a namespace in codfw [10:31:26] err sorry eqiad [10:33:04] ack [10:33:10] * isaranto afk lunch! [10:34:07] article-descriptions seems not showing anything weird, pod is being initialized right now [10:34:28] all good [10:46:23] I'll finish later on [11:04:27] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Host an OpenVINO model in LiftWing - https://phabricator.wikimedia.org/T395012#10867030 (10santhosh) Regarding the KServe API and Openvino model server: The Kserver compatiblae OVMS Rest api is documented at https://docs.openvino.ai/2025/model-server... [12:21:03] eqiad done! [12:22:38] for completeness, I'll run the deployments in codfw too (there is a little cleanup) [12:32:57] aaand done [12:33:30] ok so we should be at this stage now https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1151604 [12:33:36] namely, turn on all the security features [12:33:53] ah no ok, we also need to update istio [12:35:00] all done [12:36:11] proceeding with the last bit [12:36:24] awesome! [12:38:05] tried to kill viwiki reverted and everything went fine [12:45:30] the only issue that I see is with viwiki, it times out if I try [12:45:31] httpbb --hosts inference.svc.codfw.wmnet --https_port 30443 /srv/deployment/httpbb-tests/liftwing/production/test_revscoring-editquality-reverted.yaml [12:45:37] but not in eqiad [12:46:53] mmm very high latency in https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&from=now-15m&to=now&timezone=utc&var-cluster=D-2kXvZnk&var-namespace=revscoring-editquality-reverted&var-component=$__all&var-model_name=viwiki-reverted [12:47:50] ah wait something strange, not all viwiki replicas have been cleared in the deployment [12:47:53] lemme recycle them [12:49:28] and the reason is [12:49:29] Error creating: pods "viwiki-reverted-predictor-default-00028-deployment-794765ckkd47" is forbidden: exceeded quota: quota-compute-resources, requested: limits.cpu=6, used: limits.cpu=90, limited: limits.cpu=90 [12:51:07] I manually bumped to 100 and it worked [12:51:20] maybe viwiki has more traffic and we need to bump resource quotas? [12:51:26] I see 4 replicas [12:52:16] now httpb works [12:54:43] thanks for reporting. I'll put it in our todo list to check viwiki-reverted. [13:03:12] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban), 13Patch-For-Review: [Spike] Investigate why filtering wasn't working on testwiki - https://phabricator.wikimedia.org/T395256#10867452 (10isarantopoulos) Thanks for tackling that! I tried testing this locally but it didn't... [13:23:38] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban), 13Patch-For-Review: [Spike] Investigate why filtering wasn't working on testwiki - https://phabricator.wikimedia.org/T395256#10867506 (10PatchDemoBot) Test wiki **created** on [[ https://patchdemo.wmcloud.org | Patch demo... [13:36:55] started to recycle all the isvc pods on ml-serve-eqiad, it will take a while (I've set a slow pace) [14:46:23] (still in progress) [15:26:13] aaand done [15:29:09] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10867982 (10elukey) @klausman since today it was very quiet for ML, I took the opportunity to apply all the changes stated in T369493#107928... [15:30:39] the only error that httpbb reports seems to be for ptwiki revscoring articlequality, errors while fetching the features (503 from mw) [15:32:48] deleted the pod, now all works [15:37:27] httpb works on all clusters [15:43:10] * isaranto afk [15:51:39] * elukey afk! [17:04:03] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Moderator-Tools-Team (Kanban), 13Patch-For-Review: Improve ORES extension table backfill script - https://phabricator.wikimedia.org/T395253#10868405 (10Kgraessle) [17:41:42] 06Machine-Learning-Team: Non-English articles show autogenerated English summaries - https://phabricator.wikimedia.org/T395596 (10putnik) 03NEW [17:42:06] 06Machine-Learning-Team: Non-English articles show autogenerated English summaries - https://phabricator.wikimedia.org/T395596#10868571 (10putnik) [17:52:16] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban), 13Patch-For-Review: [Spike] Investigate why filtering wasn't working on testwiki - https://phabricator.wikimedia.org/T395256#10868613 (10PatchDemoBot) Test wiki on [[ https://patchdemo.wmcloud.org | Patch demo ]] by KGra... [18:02:58] (03PS7) 10Kgraessle: Fix highlighting for revertrisklanguageagnostic model. [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1151700 (https://phabricator.wikimedia.org/T395256) [18:04:20] (03PS8) 10Kgraessle: Fix highlighting for revertrisklanguageagnostic model. [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1151700 (https://phabricator.wikimedia.org/T395256) [18:09:20] (03PS9) 10Kgraessle: Fix highlighting for revertrisklanguageagnostic model. - Removed unnecessary javascript module [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1151700 (https://phabricator.wikimedia.org/T395256) [18:21:58] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Wikipedia-Android-App-Backlog, 05FY2024-25 WE4.2, 10Moderator-Tools-Team (Kanban): Ensure all ORES i18n messages are available for wikis to add revert risk language agnostic filters to - https://phabricator.wikimedia.org/T395481#10868689 (10Kgraessle) [18:27:21] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban), 13Patch-For-Review: [Spike] Investigate why filtering wasn't working on testwiki - https://phabricator.wikimedia.org/T395256#10868721 (10Kgraessle) >>! In T395256#10867452, @isarantopoulos wrote: > Thanks for tackling tha... [18:33:40] (03CR) 10Kgraessle: [C:03+1] ores-extension: Add extra logging [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1151706 (https://phabricator.wikimedia.org/T395253) (owner: 10Gkyziridis) [20:02:54] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Wikipedia-Android-App-Backlog, 05FY2024-25 WE4.2, 10Moderator-Tools-Team (Kanban): Ensure all ORES i18n messages are available for wikis to add revert risk language agnostic filters to - https://phabricator.wikimedia.org/T395481#10869016 (10Kgraessle) [20:08:15] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Wikipedia-Android-App-Backlog, 05FY2024-25 WE4.2, 10Moderator-Tools-Team (Kanban): Ensure all ORES i18n messages are available for wikis to add revert risk language agnostic filters to - https://phabricator.wikimedia.org/T395481#10869030 (10Kgraessle)