[04:23:06] <wikibugs>	 06Machine-Learning-Team, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX), 13Patch-For-Review: Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10794576 (10KartikMistry) All MinT models are available at `stat1008:/home/kartik/models`. Note that we generally preserve the models subd...
[06:30:51] <wikibugs>	 06Machine-Learning-Team, 06Discovery-Search, 10MediaWiki-Search: Build and enable thesaurus / synonym list for search - https://phabricator.wikimedia.org/T85770#10794641 (10Jack_who_built_the_house) Nobody seems to care about this, yet in my belief this is one of the crucial points why people (e.g. me) would...
[06:58:11] <isaranto>	 o/ good morning folks, I'm back
[07:04:50] <georgekyz>	 good morning folks, welcome back Ilias
[07:08:30] <ozge_>	 Good morning! welcome back
[07:09:26] <wikibugs>	 06Machine-Learning-Team, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX), 13Patch-For-Review: Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10794669 (10Nikerabbit)
[07:10:13] <wikibugs>	 06Machine-Learning-Team, 06Language and Product Localization, 13Patch-For-Review: Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10794671 (10Nikerabbit)
[07:16:19] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10794682 (10kevinbazira) In T385173#10790749, the initial `wmf-debian-vllm:fa-slim` image was built by tracing the serving of the `aya-expanse-8b` model using `docker-slim`. While this slimmed ima...
[07:20:45] <kevinbazira>	 o/ morning morning, welcome back!
[07:20:45] <kevinbazira>	 finally fixed the `bus error`, the slimmed down `wmf-debian-vllm` image that has FlashAttention now serves both `aya-expanse` 8b and 32b successfully: https://phabricator.wikimedia.org/T385173#10794682
[07:22:29] <isaranto>	 o/ nice work Kevin
[07:22:43] <isaranto>	 10GB less is a huge improvement!
[07:26:40] <kevinbazira>	 🎉
[08:47:35] <isaranto>	 o/ bartosz , welcome!
[08:48:32] <bartosz>	 Hello everyone!
[09:06:44] <georgekyz>	 Welcome bartosz 
[09:07:55] <elukey>	 welcome bartosz!
[09:44:57] <isaranto>	 georgekyz: shall we merge this ? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1140992
[09:55:13] <georgekyz>	 isaranto: But we merge this one: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1140195
[09:57:17] <georgekyz>	 isaranto: Do you believe that we can keep them together ?
[09:57:34] <isaranto>	 the api-gw is using the other service thouhg (edit-check-staging) so it needs to have the gpu removed (like it is now)
[09:57:56] <isaranto>	 the patch I sent is just to keep charts updaes with the current status
[09:58:09] <isaranto>	 the gpu has already been removed by you folks last week as I checked
[10:00:05] <georgekyz>	 exactly... in the edit-check-cpu one .
[10:00:42] <isaranto>	 no in the other one as well
[10:00:52] <georgekyz>	 But now I am lost.... when we are hitting the https://inference-staging.svc.codfw.wmnet:30443/v1/models/edit-check-staging:predict" we are hitting the edit-check-cpu ??
[10:02:22] <isaranto>	 sorry in a meeting -- will respond later
[10:03:19] <georgekyz>	 ok \
[10:27:59] <klausman>	 Welcome, Bartosz!
[11:05:38] <isaranto>	 I'm back. georgekyz I'm referring to the api gw endpoint https://api.wikimedia.org/service/lw/inference/v1/models/edit-check-staging:predict  which points to the edit-check-staging deployment 
[11:06:38] <georgekyz>	 yeap I got what you mean, I +1 the patch we can merge it
[11:06:49] <isaranto>	 ack, sorry for the confusion
[11:07:47] <georgekyz>	 I got confused because probably Aiko had disabled the gpu already from editing the isvc directly and then she pushed the patch for adding the `edit-check-cpu` placeholder.
[12:05:04] <kevinbazira>	 the scripts and steps used in the WMF Debian vLLM image porting process have been added to a gitlab repo:
[12:05:04] <kevinbazira>	 https://gitlab.wikimedia.org/repos/machine-learning/wmf-debian-vllm
[12:43:30] <isaranto>	 great! will review. let's also talk about this either today or tomorrow in our meetings
[12:44:47] <isaranto>	 I think we'd like to rerun the benchmarks for aya-expanse (at least for 8b) and check the latencies we get there
[13:18:33] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10Wikimedia Enterprise - Content Integrity: Load test the language agnostic article-quality model - https://phabricator.wikimedia.org/T388805#10795907 (10isarantopoulos) 05Open→03Resolved
[13:19:50] <wikibugs>	 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10795913 (10isarantopoulos) 05In progress→03Resolved
[13:42:35] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Simple article summaries: Set up the software stack for efficiently serving production LLMs - https://phabricator.wikimedia.org/T391941#10796083 (10kevinbazira) We ported the upstream [[ https://hub.docker.com/layers/rocm/vllm/rocm6.3.1_mi300_ubuntu22.04_py3.12_v...
[13:53:49] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[13:53:49] <jinxer-wm>	 Deployment outlink-topic-model-predictor-default-00023-deployment in articletopic-outlink at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ...
[13:53:49] <jinxer-wm>	 https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s-mlserve&var-namespace=articletopic-outlink&var-deployment=outlink-topic-model-predictor-default-00023-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[14:00:18] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Simple article summaries: Set up the software stack for efficiently serving production LLMs - https://phabricator.wikimedia.org/T391941#10796156 (10kevinbazira) We added [[ https://github.com/Dao-AILab/flash-attention | CK FlashAttention ]] to the wmf-debian-vllm...
[14:08:49] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment outlink-topic-model-predictor-default-00023-deployment in articletopic-outlink at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[14:37:53] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Investigate Add a link model training and deployment - https://phabricator.wikimedia.org/T393474 (10isarantopoulos) 03NEW
[14:38:56] <wikibugs>	 06Machine-Learning-Team: ML Services causing log spam - https://phabricator.wikimedia.org/T393475 (10klausman) 03NEW
[14:40:37] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10796332 (10isarantopoulos)
[15:52:01] <isaranto>	 klausman: I'm looking into this alert https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas and I see that there is no pod in eqiad
[15:52:25] <isaranto>	 I do a helmfile diff and I see the seccompProfile change. 
[15:52:47] <isaranto>	 is it sth you were working on or is it random?
[15:53:37] <isaranto>	 ah I saw some discussion related to this yesterday.
[15:59:05] <klausman>	 Though AIUI, Luca's restarting should have resolved discrepancies. What diff do you see?
[16:00:21] <klausman>	 elukey: should we (still) be seeing PSS diffs in eqiad?
[16:01:40] <isaranto>	 the addition of the new policy 
[16:01:40] <isaranto>	 ```
[16:01:40] <isaranto>	 +     securityContext:
[16:01:41] <isaranto>	 +       seccompProfile:
[16:01:41] <isaranto>	 +         type: RuntimeDefault
[16:01:41] <isaranto>	 ```
[16:01:49] <klausman>	 isaranto: I don't see a diff with `helmfile -e ml-serve-eqiad  -i diff --context=3`
[16:01:52] <isaranto>	 shall I do a sync?
[16:01:57] <isaranto>	 it is in codfw
[16:02:05] <isaranto>	 sorry my bad
[16:02:37] <klausman>	 It's a bit odd that e.g. revertrisk in codfw doesn't have the same
[16:02:38] <isaranto>	 tha alert is about codfw and that is where I see the diff but I mentioned eqiad by mistake
[16:03:54] <klausman>	 I'd sync it, it's probably fine
[16:04:57] <isaranto>	 cool, pods are starting up!
[16:05:15] <isaranto>	 all good now, thanks!
[16:05:20] <klausman>	 np!
[16:05:34] <klausman>	 I'm heading out now, already late for my b'day party :)
[16:05:41] <isaranto>	 the alert will probably go away now
[16:05:58] <isaranto>	 ohhhhh happy Birthdayyyyyyy 🎉 <3
[16:06:04] <isaranto>	 going afk folks, will check later if anything is needed
[16:06:55] <jinxer-wm>	 RESOLVED: [2x] KubernetesDeploymentUnavailableReplicas: Deployment outlink-topic-model-predictor-default-00023-deployment in articletopic-outlink at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[16:07:36] <klausman>	 isaranto: thanl you!
[16:07:39] <klausman>	 thank*
[19:07:59] <elukey>	 klausman, isaranto - sorry my bad, it must have slipped from my deploy list, but now I am not sure why httpb worked without reporting issues
[19:12:59] <elukey>	 so from https://grafana.wikimedia.org/d/c6GYmqdnz/knative-serving?orgId=1&var-cluster=codfw%20prometheus%2Fk8s-mlserve&var-knative_namespace=knative-serving&var-revisions_namespace=articletopic-outlink&from=1746529585195&to=1746552463990 
[19:13:30] <elukey>	 it seems that we went from 1 to zero replicas, and then I assume that the pod wasn't coming up due to the missing seccomp policy (so PSS prevented the pod)
[19:13:40] <elukey>	 now why it went from 1 to zero I have no idea
[19:13:46] <elukey>	 I don't see events etc..
[19:14:29] <elukey>	 is there a scale-to-zero policy in place?
[19:15:19] <elukey>	 mmm no min-scale is 1
[19:15:39] <elukey>	 very weird, I'll recheck tomorrow
[19:15:44] <elukey>	 sorry for the alert!