[04:23:06] 06Machine-Learning-Team, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX), 13Patch-For-Review: Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10794576 (10KartikMistry) All MinT models are available at `stat1008:/home/kartik/models`. Note that we generally preserve the models subd... [06:30:51] 06Machine-Learning-Team, 06Discovery-Search, 10MediaWiki-Search: Build and enable thesaurus / synonym list for search - https://phabricator.wikimedia.org/T85770#10794641 (10Jack_who_built_the_house) Nobody seems to care about this, yet in my belief this is one of the crucial points why people (e.g. me) would... [06:58:11] o/ good morning folks, I'm back [07:04:50] good morning folks, welcome back Ilias [07:08:30] Good morning! welcome back [07:09:26] 06Machine-Learning-Team, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX), 13Patch-For-Review: Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10794669 (10Nikerabbit) [07:10:13] 06Machine-Learning-Team, 06Language and Product Localization, 13Patch-For-Review: Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10794671 (10Nikerabbit) [07:16:19] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10794682 (10kevinbazira) In T385173#10790749, the initial `wmf-debian-vllm:fa-slim` image was built by tracing the serving of the `aya-expanse-8b` model using `docker-slim`. While this slimmed ima... [07:20:45] o/ morning morning, welcome back! [07:20:45] finally fixed the `bus error`, the slimmed down `wmf-debian-vllm` image that has FlashAttention now serves both `aya-expanse` 8b and 32b successfully: https://phabricator.wikimedia.org/T385173#10794682 [07:22:29] o/ nice work Kevin [07:22:43] 10GB less is a huge improvement! [07:26:40] 🎉 [08:47:35] o/ bartosz , welcome! [08:48:32] Hello everyone! [09:06:44] Welcome bartosz [09:07:55] welcome bartosz! [09:44:57] georgekyz: shall we merge this ? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1140992 [09:55:13] isaranto: But we merge this one: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1140195 [09:57:17] isaranto: Do you believe that we can keep them together ? [09:57:34] the api-gw is using the other service thouhg (edit-check-staging) so it needs to have the gpu removed (like it is now) [09:57:56] the patch I sent is just to keep charts updaes with the current status [09:58:09] the gpu has already been removed by you folks last week as I checked [10:00:05] exactly... in the edit-check-cpu one . [10:00:42] no in the other one as well [10:00:52] But now I am lost.... when we are hitting the https://inference-staging.svc.codfw.wmnet:30443/v1/models/edit-check-staging:predict" we are hitting the edit-check-cpu ?? [10:02:22] sorry in a meeting -- will respond later [10:03:19] ok \ [10:27:59] Welcome, Bartosz! [11:05:38] I'm back. georgekyz I'm referring to the api gw endpoint https://api.wikimedia.org/service/lw/inference/v1/models/edit-check-staging:predict which points to the edit-check-staging deployment [11:06:38] yeap I got what you mean, I +1 the patch we can merge it [11:06:49] ack, sorry for the confusion [11:07:47] I got confused because probably Aiko had disabled the gpu already from editing the isvc directly and then she pushed the patch for adding the `edit-check-cpu` placeholder. [12:05:04] the scripts and steps used in the WMF Debian vLLM image porting process have been added to a gitlab repo: [12:05:04] https://gitlab.wikimedia.org/repos/machine-learning/wmf-debian-vllm [12:43:30] great! will review. let's also talk about this either today or tomorrow in our meetings [12:44:47] I think we'd like to rerun the benchmarks for aya-expanse (at least for 8b) and check the latencies we get there [13:18:33] 10Lift-Wing, 06Machine-Learning-Team, 10Wikimedia Enterprise - Content Integrity: Load test the language agnostic article-quality model - https://phabricator.wikimedia.org/T388805#10795907 (10isarantopoulos) 05Open→03Resolved [13:19:50] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10795913 (10isarantopoulos) 05In progress→03Resolved [13:42:35] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Simple article summaries: Set up the software stack for efficiently serving production LLMs - https://phabricator.wikimedia.org/T391941#10796083 (10kevinbazira) We ported the upstream [[ https://hub.docker.com/layers/rocm/vllm/rocm6.3.1_mi300_ubuntu22.04_py3.12_v... [13:53:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [13:53:49] Deployment outlink-topic-model-predictor-default-00023-deployment in articletopic-outlink at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [13:53:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s-mlserve&var-namespace=articletopic-outlink&var-deployment=outlink-topic-model-predictor-default-00023-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:00:18] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Simple article summaries: Set up the software stack for efficiently serving production LLMs - https://phabricator.wikimedia.org/T391941#10796156 (10kevinbazira) We added [[ https://github.com/Dao-AILab/flash-attention | CK FlashAttention ]] to the wmf-debian-vllm... [14:08:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment outlink-topic-model-predictor-default-00023-deployment in articletopic-outlink at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:37:53] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Investigate Add a link model training and deployment - https://phabricator.wikimedia.org/T393474 (10isarantopoulos) 03NEW [14:38:56] 06Machine-Learning-Team: ML Services causing log spam - https://phabricator.wikimedia.org/T393475 (10klausman) 03NEW [14:40:37] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10796332 (10isarantopoulos) [15:52:01] klausman: I'm looking into this alert https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas and I see that there is no pod in eqiad [15:52:25] I do a helmfile diff and I see the seccompProfile change. [15:52:47] is it sth you were working on or is it random? [15:53:37] ah I saw some discussion related to this yesterday. [15:59:05] Though AIUI, Luca's restarting should have resolved discrepancies. What diff do you see? [16:00:21] elukey: should we (still) be seeing PSS diffs in eqiad? [16:01:40] the addition of the new policy [16:01:40] ``` [16:01:40] + securityContext: [16:01:41] + seccompProfile: [16:01:41] + type: RuntimeDefault [16:01:41] ``` [16:01:49] isaranto: I don't see a diff with `helmfile -e ml-serve-eqiad -i diff --context=3` [16:01:52] shall I do a sync? [16:01:57] it is in codfw [16:02:05] sorry my bad [16:02:37] It's a bit odd that e.g. revertrisk in codfw doesn't have the same [16:02:38] tha alert is about codfw and that is where I see the diff but I mentioned eqiad by mistake [16:03:54] I'd sync it, it's probably fine [16:04:57] cool, pods are starting up! [16:05:15] all good now, thanks! [16:05:20] np! [16:05:34] I'm heading out now, already late for my b'day party :) [16:05:41] the alert will probably go away now [16:05:58] ohhhhh happy Birthdayyyyyyy 🎉 <3 [16:06:04] going afk folks, will check later if anything is needed [16:06:55] RESOLVED: [2x] KubernetesDeploymentUnavailableReplicas: Deployment outlink-topic-model-predictor-default-00023-deployment in articletopic-outlink at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [16:07:36] isaranto: thanl you! [16:07:39] thank* [19:07:59] klausman, isaranto - sorry my bad, it must have slipped from my deploy list, but now I am not sure why httpb worked without reporting issues [19:12:59] so from https://grafana.wikimedia.org/d/c6GYmqdnz/knative-serving?orgId=1&var-cluster=codfw%20prometheus%2Fk8s-mlserve&var-knative_namespace=knative-serving&var-revisions_namespace=articletopic-outlink&from=1746529585195&to=1746552463990 [19:13:30] it seems that we went from 1 to zero replicas, and then I assume that the pod wasn't coming up due to the missing seccomp policy (so PSS prevented the pod) [19:13:40] now why it went from 1 to zero I have no idea [19:13:46] I don't see events etc.. [19:14:29] is there a scale-to-zero policy in place? [19:15:19] mmm no min-scale is 1 [19:15:39] very weird, I'll recheck tomorrow [19:15:44] sorry for the alert!