[08:05:22] 06Machine-Learning-Team (Q4 FY2025-26), 10Wikidata, 06Wikidata-Omega: Threshold analysis for WMDE use cases of Wikidata RR model - https://phabricator.wikimedia.org/T429049#12017361 (10Lydia_Pintscher) [08:50:14] 06Machine-Learning-Team (Q4 FY2025-26), 06ServiceOps new, 10ServiceOps-SharedInfra: Access control for LiftWing LLM services exposed to external clients through REST Gateway - https://phabricator.wikimedia.org/T426749#12017628 (10isarantopoulos) @Clement_Goubert one thing I have missed in my notes above is s... [09:21:54] 06Machine-Learning-Team (Q4 FY2025-26), 06ServiceOps new, 10ServiceOps-SharedInfra: Access control for LiftWing LLM services exposed to external clients through REST Gateway - https://phabricator.wikimedia.org/T426749#12017722 (10Clement_Goubert) >>! In T426749#12017627, @isarantopoulos wrote: > @Clement_Gou... [10:05:36] (03PS1) 10Kevin Bazira: policy-violation: migrate cope-b-a4b model-server from HF transformers to vLLM [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1302112 (https://phabricator.wikimedia.org/T427497) [10:48:35] (03PS1) 10Gkyziridis: feat(article-country): Expose OpenAPI spec from the model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1302121 [10:56:51] 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26), 07Essential-Work: Update WMF Debian vLLM image to use pre-built wheels from upstream - https://phabricator.wikimedia.org/T428577#12018055 (10kevinbazira) 05Open→03Resolved [10:58:37] 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26), 13Patch-For-Review: Deploy CoPE-B-A4B on LiftWing - https://phabricator.wikimedia.org/T427497#12018060 (10kevinbazira) [11:00:35] 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26), 13Patch-For-Review: Upgrade production vLLM image to use vLLM version >= 0.19 - https://phabricator.wikimedia.org/T426766#12018071 (10isarantopoulos) 05Open→03Resolved As mentioned above an image with villm 0.22 is already available in the registry... [11:07:30] (03CR) 10Nik Gkountas: [C:03+2] docs: Update Lift Wing API info [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1301411 (owner: 10Alex Paskulin) [11:09:13] (03Merged) 10jenkins-bot: docs: Update Lift Wing API info [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1301411 (owner: 10Alex Paskulin) [11:43:57] (03CR) 10Bartosz Wójtowicz: [C:03+1] "LGTM, thank you!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1302112 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [12:03:05] 06Machine-Learning-Team (Q4 FY2025-26), 06SRE, 10SRE-Access-Requests: Requesting access to ml-lab-users for mfossati - https://phabricator.wikimedia.org/T429148#12018288 (10isarantopoulos) [12:04:29] 06Machine-Learning-Team (Q4 FY2025-26), 06SRE, 10SRE-Access-Requests: Requesting access to ml-lab-users for mfossati - https://phabricator.wikimedia.org/T429148#12018298 (10isarantopoulos) I approve [12:32:59] 06Machine-Learning-Team (Q4 FY2025-26), 06SRE, 10SRE-Access-Requests: Requesting access to ml-lab-users for mfossati - https://phabricator.wikimedia.org/T429148#12018419 (10Jdrewniak) As @HSwan-WMF's delegate, I approve. [12:36:35] (03PS1) 10Bartosz Wójtowicz: Add shared MediaWiki API retry and error classification [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1302145 [13:20:06] (03PS1) 10Kevin Bazira: policy-violation: use common_settings.sh specific to cope-b-a4b model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1302156 (https://phabricator.wikimedia.org/T427497) [13:26:25] (03CR) 10Bartosz Wójtowicz: [C:03+1] policy-violation: use common_settings.sh specific to cope-b-a4b model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1302156 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [13:29:07] 06Machine-Learning-Team (Q4 FY2025-26): Editing Suggestions - Increase/refresh LLM generated suggestions. - https://phabricator.wikimedia.org/T428882#12018830 (10OKarakaya-WMF) [13:29:12] (03CR) 10Kevin Bazira: [C:03+2] policy-violation: use common_settings.sh specific to cope-b-a4b model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1302156 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [13:29:23] 06Machine-Learning-Team (Q4 FY2025-26): Editing Suggestions - Increase/refresh LLM generated suggestions. - https://phabricator.wikimedia.org/T428882#12018831 (10OKarakaya-WMF) [13:29:36] 06Machine-Learning-Team (Q4 FY2025-26): Editing Suggestions - Increase/refresh LLM generated suggestions. - https://phabricator.wikimedia.org/T428882#12018832 (10OKarakaya-WMF) [13:29:49] (03Merged) 10jenkins-bot: policy-violation: use common_settings.sh specific to cope-b-a4b model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1302156 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [13:56:06] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#12019071 (10isarantopoulos) 05Open→03Resolved Marking as resolved. As a result of this investigation the Linked Artifact Cache was built and artic... [14:58:28] 06Machine-Learning-Team (Q4 FY2025-26): Editing Suggestions - Increase/refresh LLM generated suggestions. - https://phabricator.wikimedia.org/T428882#12019792 (10OKarakaya-WMF) [15:46:33] (03PS1) 10Kevin Bazira: policy-violation: use custom builders to install cope-b-a4b deps in Python3.12 env [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1302198 (https://phabricator.wikimedia.org/T427497) [15:47:29] (03CR) 10Kevin Bazira: [C:03+2] policy-violation: use custom builders to install cope-b-a4b deps in Python3.12 env [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1302198 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [15:49:13] (03Merged) 10jenkins-bot: policy-violation: use custom builders to install cope-b-a4b deps in Python3.12 env [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1302198 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [15:53:40] (03PS1) 10Ilias Sarantopoulos: qwen36-27b: migrate model server to vLLM 0.22 / Python 3.12 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1302199 (https://phabricator.wikimedia.org/T425680) [15:58:01] (03CR) 10CI reject: [V:04-1] qwen36-27b: migrate model server to vLLM 0.22 / Python 3.12 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1302199 (https://phabricator.wikimedia.org/T425680) (owner: 10Ilias Sarantopoulos) [16:19:40] (03PS2) 10Ilias Sarantopoulos: qwen36-27b: migrate model server to vLLM 0.22 / Python 3.12 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1302199 (https://phabricator.wikimedia.org/T425680) [16:21:51] Hey y'all you may want to take a look at your calico-kube-controllers on k8s-mlserve-eqiad [16:22:10] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DKubernetesContainerOomKilled&q=prometheus%3Dk8s-mlserve [16:27:38] 06Machine-Learning-Team (Q4 FY2025-26): monitoring: View GPU usage per LLM deployment/mode - https://phabricator.wikimedia.org/T429236 (10isarantopoulos) 03NEW [16:28:55] 06Machine-Learning-Team (Q4 FY2025-26): monitoring: View GPU usage per LLM deployment/model - https://phabricator.wikimedia.org/T429236#12020406 (10isarantopoulos) [16:30:10] 06Machine-Learning-Team (Q4 FY2025-26): monitoring: Grafana dashboard for LLM serving on MI300X - https://phabricator.wikimedia.org/T429237 (10isarantopoulos) 03NEW [16:31:13] 06Machine-Learning-Team (Q4 FY2025-26): monitoring: View GPU usage per LLM deployment/model - https://phabricator.wikimedia.org/T429236#12020441 (10isarantopoulos) [16:31:54] 06Machine-Learning-Team (Q4 FY2025-26): monitoring: Grafana dashboard for LLM serving on MI300X - https://phabricator.wikimedia.org/T429237#12020453 (10isarantopoulos) [16:31:55] 06Machine-Learning-Team (Q4 FY2025-26): monitoring: View GPU usage per LLM deployment/model - https://phabricator.wikimedia.org/T429236#12020454 (10isarantopoulos) [16:31:56] 06Machine-Learning-Team (Q4 FY2025-26), 05Goal: Setup MI300X nodes for LLM serving - https://phabricator.wikimedia.org/T424322#12020452 (10isarantopoulos) [16:33:19] 06Machine-Learning-Team (Q4 FY2025-26): monitoring: View GPU usage per LLM deployment/model - https://phabricator.wikimedia.org/T429236#12020470 (10isarantopoulos) [20:12:26] 06Machine-Learning-Team (Q4 FY2025-26), 06SRE, 10SRE-Access-Requests: Requesting access to ml-lab-users for mfossati - https://phabricator.wikimedia.org/T429148#12021240 (10BCornwall) p:05Triage→03Medium [21:20:29] 06Machine-Learning-Team (Q4 FY2025-26), 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to ml-lab-users for mfossati - https://phabricator.wikimedia.org/T429148#12021466 (10BCornwall) 05Open→03In progress