[04:07:21] 10Lift-Wing, 06Machine-Learning-Team, 06Editing-team (Tracking), 07ml-model-requests, 07OKR-Work (WE1 FY2025-26): Increase batch size in edit-check service - https://phabricator.wikimedia.org/T419527#11700608 (10ppelberg) >>! In T419527#11697638, @gkyziridis wrote: > === Update === > The `max_batch_size`... [05:08:42] (03PS1) 10Kevin Bazira: embeddings: update base image to one that supports aiter OOTB [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1250913 (https://phabricator.wikimedia.org/T419650) [08:39:20] 10Lift-Wing, 06Machine-Learning-Team, 06Editing-team (Tracking), 07ml-model-requests, 07OKR-Work (WE1 FY2025-26): Increase batch size in edit-check service - https://phabricator.wikimedia.org/T419527#11701106 (10gkyziridis) >>! In T419527#11700608, @ppelberg wrote: > > Building on the above, what (if an... [08:50:32] 06Machine-Learning-Team, 06Product Safety and Integrity, 13Patch-For-Review: Deploy CoPE-A on LiftWing - https://phabricator.wikimedia.org/T418832#11701136 (10BWojtowicz-WMF) Small update on the progress. First of all, I have been _very_ incorrect previously about our partitioning for MI300X GPUs and what i... [11:39:15] (03PS4) 10Kgraessle: Expose the revert risk language agnostic prediction boolean via the RecentChanges API [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1248799 (https://phabricator.wikimedia.org/T407552) [11:40:12] (03CR) 10Kgraessle: "Done" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1248799 (https://phabricator.wikimedia.org/T407552) (owner: 10Kgraessle) [12:19:28] (03CR) 10Kevin Bazira: "thank you for adding the cope model to the policy-violation isvc." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1249948 (https://phabricator.wikimedia.org/T418832) (owner: 10Bartosz Wójtowicz) [13:01:10] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Formatting for html to text - https://phabricator.wikimedia.org/T419840 (10OKarakaya-WMF) 03NEW [13:03:01] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Formatting for html to text - https://phabricator.wikimedia.org/T419840#11701925 (10OKarakaya-WMF) Issues are fixed in a [single file](https://gitlab.wikimedia.org/repos/machine-learning/exploratory-notebook/-/blob/edit_suggestions_experiments/edit_suggestions/edi... [13:04:03] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Edit Suggestions - Formatting for html to text - https://phabricator.wikimedia.org/T419840#11701927 (10OKarakaya-WMF) [13:04:16] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Edit Suggestions - Formatting for html to text - https://phabricator.wikimedia.org/T419840#11701939 (10OKarakaya-WMF) [13:40:27] 06Machine-Learning-Team: kserve helm status is broken across ml clusters - https://phabricator.wikimedia.org/T419040#11702112 (10DPogorzelski-WMF) yea i did test this: {F72818361} i think i'll re-check this after kserve update, could be pointless trying to fix it if we want to update kserve [13:44:30] (03CR) 10Ozge: [C:03+1] revise-tone-task-generator: always process edits from testwiki [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1250646 (https://phabricator.wikimedia.org/T416904) (owner: 10AikoChou) [13:44:52] (03PS3) 10Bartosz Wójtowicz: policy-violation: Add CoPE-A-9B model server alongside gpt-oss-safeguard-20b. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1249948 (https://phabricator.wikimedia.org/T418832) [13:46:26] (03CR) 10AikoChou: [C:03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1250646 (https://phabricator.wikimedia.org/T416904) (owner: 10AikoChou) [13:49:37] (03Merged) 10jenkins-bot: revise-tone-task-generator: always process edits from testwiki [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1250646 (https://phabricator.wikimedia.org/T416904) (owner: 10AikoChou) [14:02:40] aiko, dpogorzelski, klausman - o/ the SRE team is moving away from pyrra for SLOs, towards a grafana-based solution (sloth). We are importing the dashboards in the new system, and I am reviewing the code for Revert Risk. IIRC the SLO has always been a pilot, not an official one like ToneCheck where we went through the official https://wikitech.wikimedia.org/wiki/SLO process. Do you want to keep it? Or possibly drop it for the moment, and then do [14:02:40] the formal process for it later? [14:07:11] (03PS4) 10Bartosz Wójtowicz: policy-violation: Add CoPE-A-9B model server alongside gpt-oss-safeguard-20b. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1249948 (https://phabricator.wikimedia.org/T418832) [14:08:55] elukey: dropping now and doing the the proper way later SGTM [14:09:54] (03CR) 10Bartosz Wójtowicz: policy-violation: Add CoPE-A-9B model server alongside gpt-oss-safeguard-20b. (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1249948 (https://phabricator.wikimedia.org/T418832) (owner: 10Bartosz Wójtowicz) [14:15:27] (03PS5) 10Bartosz Wójtowicz: policy-violation: Add CoPE-A-9B model server alongside gpt-oss-safeguard-20b. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1249948 (https://phabricator.wikimedia.org/T418832) [14:32:39] (03PS2) 10Kevin Bazira: embeddings: update base image to one that supports aiter OOTB [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1250913 (https://phabricator.wikimedia.org/T419650) [14:36:41] (03CR) 10Ozge: [C:03+1] embeddings: update base image to one that supports aiter OOTB [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1250913 (https://phabricator.wikimedia.org/T419650) (owner: 10Kevin Bazira) [14:40:39] (03CR) 10Kevin Bazira: [C:03+2] embeddings: update base image to one that supports aiter OOTB [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1250913 (https://phabricator.wikimedia.org/T419650) (owner: 10Kevin Bazira) [14:41:15] (03Merged) 10jenkins-bot: embeddings: update base image to one that supports aiter OOTB [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1250913 (https://phabricator.wikimedia.org/T419650) (owner: 10Kevin Bazira) [14:47:44] (03PS6) 10Bartosz Wójtowicz: policy-violation: Add CoPE-A-9B model server alongside gpt-oss-safeguard-20b. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1249948 (https://phabricator.wikimedia.org/T418832) [14:53:50] (03CR) 10Kevin Bazira: [C:03+1] policy-violation: Add CoPE-A-9B model server alongside gpt-oss-safeguard-20b. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1249948 (https://phabricator.wikimedia.org/T418832) (owner: 10Bartosz Wójtowicz) [15:00:12] (03CR) 10Bartosz Wójtowicz: [C:03+2] policy-violation: Add CoPE-A-9B model server alongside gpt-oss-safeguard-20b. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1249948 (https://phabricator.wikimedia.org/T418832) (owner: 10Bartosz Wójtowicz) [15:03:39] (03Merged) 10jenkins-bot: policy-violation: Add CoPE-A-9B model server alongside gpt-oss-safeguard-20b. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1249948 (https://phabricator.wikimedia.org/T418832) (owner: 10Bartosz Wójtowicz) [15:13:44] FIRING: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=edit-check&var-backend=edit-check-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:23:44] RESOLVED: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=edit-check&var-backend=edit-check-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [17:15:30] 06Machine-Learning-Team, 10Semantic Search, 07OKR-Work: Migrate embeddings inference service from HF Transformers+CK FlashAttention to vLLM+AITER - https://phabricator.wikimedia.org/T418976#11703437 (10OKarakaya-WMF) after aiter changes on staging: ` (venv) ozge@stat1010:~/repos/wiki/gerrit/inference-servi...