[05:40:38] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10790749 (10kevinbazira) When we added FlashAttention to the `wmf-debian-vllm` image in T385173#10780983, the image size grew to ~58GB: ` $ docker images REPOSITORY T... [05:51:43] o/ morning morning [05:53:09] the slimmed down `wmf-debian-vllm` image that has FlashAttention serves `aya-expanse-8b` successfully: https://phabricator.wikimedia.org/T385173#10790749 [05:53:09] `aya-expanse-32b` model loading fails with a `bus error` in the same image: https://phabricator.wikimedia.org/P75745 [05:53:09] investigating this issue ... [06:54:59] Good Morning. I'll be putting models for MinT with sha512sum as a next step for https://phabricator.wikimedia.org/T391958 [07:02:56] ack [07:38:58] kart_: o/ remember that it is the ML team that needs to push the data to the bucket, so you'll have to place the model binary into an approved place (statxxxx, gdrive, etc..) [07:39:12] and the sta512 will need to go in the task [08:15:37] Yes. Is people.w.o is OK? [08:21:19] it is probably better a stat1XXX node for the ml-team [08:21:43] more convenient, but it can work as well [08:21:55] since we have the sha512 on phab [08:28:08] OK. I'll do that. [09:36:47] I am going to depool codfw traffic from inference to test https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1140140 [09:44:47] all right I am going to kill all pods in revscoring reverted, test etc.. [09:45:06] basically the new setting tells knative to inject security settings to all new isvc pods [10:00:35] all right all good [10:05:57] I am now recycling all the isvc pods, it will take a bit [10:16:27] thank you! can you let me know when done? There's a bunch of reboots necessary, and I'd rather not do that in parallel :) [10:18:14] right yes! I think ETA ~1h more or less [10:19:06] ack! [10:35:26] PASS: 114 requests sent to inference.svc.codfw.wmnet. All assertions passed. [10:36:20] klausman: all done and repooled [10:37:55] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10791637 (10elukey) Summary of what I've done: - depooled codfw from inference - deployed https://gerrit.wikimedia.org/r/1140140 - recycled... [10:39:58] details of what I did in https://phabricator.wikimedia.org/T369493#10791637 [10:40:12] klausman: so now we should be able to move to PSS on ml-serve-codfw [10:40:23] following https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/PSP_replacement [10:41:31] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10791655 (10elukey) The following worked fine on ml-serve-codfw: ` root@deploy1003:~# kubectl get ns -l pod-security.kubernetes.io/audit=re... [10:49:42] afk for lunch, ping me if anything in ml-serve-codfw doesn't look ok [11:12:19] roger! and thanks again! [11:13:53] * klausman lunch as well [13:17:15] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10792063 (10elukey) Next step: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/PSP_replacement#Enforce_the_restricted_PSS for ml-ser... [13:52:07] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10792320 (10elukey) 05Stalled→03Open [14:21:13] 06Machine-Learning-Team, 06collaboration-services, 10Discovery-Search (2025.05.02 - 2025.05.23), 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10792537 (10Gehel) [14:47:39] ok PSS enforced on ml-serve-codfw, and PSP disabled [14:47:59] I am doing some checks but it should be as good as on staging now [14:52:39] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Productionize peacock detection model - https://phabricator.wikimedia.org/T391940#10792759 (10achou) Update: I managed to run a proof of concept DAG to collect peacock data in the [[ https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/ru... [14:56:38] I updated the istio config as well, forgot to do it, so now we properly inject the seccomp stuff on the gateway pods etc.. [14:56:44] tried to kill some of them, all good [14:58:55] excellent [15:16:07] klausman: last one is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1141928 [15:16:45] for eqiad we can do the same as for containerd+bookworm, you plan/execute and I help if needed, wdyt? [15:23:17] SGTM [15:25:07] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10792884 (10elukey) High level procedure for eqiad: - Prerequisite: T387854 - Upgrade the knative-serving pods to their latest version. -... [15:25:39] perfect, I added a high level procedure to the task so we don't forget