[06:33:13] good morning [06:33:25] good morning [07:17:11] morning folks o/ [08:19:55] 06Machine-Learning-Team, 05Goal, 13Patch-For-Review: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11137893 (10OKarakaya-WMF) [09:11:38] isaranto: The sidecar container is the agent right? Where this is actually implemented for experimental? Is it something that we can configure in the deployment-charts? I see exactly the same format for experimental and edit-check namespace. In both namespaces I can see the section of "These endpoints should be reachable by Istio proxy sidecars." [09:15:18] yes the agent is the sidecar container for the batcher. afaik the only thing we need to configure is the batcher (as we have it) and the rest is implemented by the chart. [09:15:28] i'm referring to this bit [09:15:29] ``` [09:15:29] batcher: [09:15:29] maxBatchSize: 32 [09:15:29] maxLatency: 20 [09:15:29] ``` [09:20:12] from a first pass I see the namespaces are identical in the manifests but I'd suggest we check also the deployed namespaces if there are any differences [09:21:16] another thing I noticed is that maxLatency is really really low, I'd increase it to something bigger (default is 500ms). I don't recall why we set it that low atm but it makes sense to have it at least >100ms [09:21:45] the maxLatency doesnt have anything to do with this probably, I'm just mentioning it since we're at it [09:24:08] Yes I agree. Another thing I saw just now is a slight difference on the configuration. If you see this example here: https://kserve.github.io/website/docs/model-serving/predictive-inference/batcher#example the `minReplicas` is set directly under `predictor` while in our charts is under `config` (not sure if this is a problem though). [09:37:53] I am gonna paste some findings in the ticket [09:38:19] ack! [09:40:19] the maxReplicas thing is indeed one additional thing to check, but the docs sometimes use different pre-built runtimes (e.g. torch) so the source of truth for us should be the kubernetes API docs + knative + kserve docs for the versions we have and most importantly the applied manifest (kubectl describe pod) to see what is actually applied [09:46:58] 06Machine-Learning-Team: Kserve batcher doesn't seem to be properly configured for edit-check - https://phabricator.wikimedia.org/T403423#11138114 (10gkyziridis) Based on the reference from [[ https://kserve.github.io/website/docs/model-serving/predictive-inference/batcher#example | Kserve-Example ]] I see that... [10:03:30] I think we had made a mistake on setting up the batcher from the beginning since Kserve `batcher` needs to be under `predictor:config:batcher`. Probably due to the fact that we had set `config: maxReplicas` and outside of this scope we set the `batcher`, it was not recognised and that's why the agent was missing. There is a reference here: [10:03:30] https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/kserve-inference/.fixtures/inference_no_transformer.yaml#L99-L107 [10:04:12] Patch for tackling that: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1184042 [10:04:59] 06Machine-Learning-Team, 07Essential-Work: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs - https://phabricator.wikimedia.org/T398600#11138181 (10elukey) Today I used `gbp import-orig -v --merge-mode=replace --pristine-tar ../v1.31.0.8.tar.gz` on build2002 to import the latest upstream version, and I... [10:08:43] hey folks! Some updates related to the new ml-serve1012/1013 hosts (the ones with MI300X GPUs): [10:08:43] 1) The hosts are configured with Bookworm plus backported kernel and gpu firmware from debian backports, all configured in Puppet. [10:08:43] 2) I created a new role called role::ml_k8s::insetup_gpu, that is meant to install what is needed for a GPU host before getting to a complete/full k8s worker, since some reboots may be needed (for example, to pick up the new firmwares etc..). [10:08:43] 3) All the ml-admins have access, so more people can experiment. [10:08:43] I think that the next step is to experiment with GPU partitioning, but we'd need the `amd-smi` tool that of course is only available from Debian Trixie onward :D https://packages.debian.org/trixie/amd-smi. We may need to pull it from AMD repos. [10:08:47] cc: klausman --^ [10:13:56] good catch George! I added a comment, as we need to also figure out the diff between edit-check and experimental ns [10:14:01] 06Machine-Learning-Team, 13Patch-For-Review: Kserve batcher doesn't seem to be properly configured for edit-check - https://phabricator.wikimedia.org/T403423#11138224 (10isarantopoulos) Great catch! Indeed the batcher is not set in the `edit-check` namespaces. iiuc setting the batcher under both `predictor` or... [10:14:38] hi Luca! \o/ sounds great, thanks a lot! [10:33:38] 06Machine-Learning-Team, 07Essential-Work: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs - https://phabricator.wikimedia.org/T398600#11138334 (10elukey) I tried to test on ml-serve1012 (manual install of the deb) and this is what I gathered: 1) hwloc and libhwloc-dev in build depends should not be... [10:34:34] isaranto: I also did some experiment with --^ so far it looks good, but we'll need to make few tests first [10:34:44] with the full k8s role etc.. [10:54:56] elukey: ack, and ty! [12:03:44] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11138603 (10elukey) 05Resolved→03Open @Jclark-ctr Hi! I noticed that console redir seems not working for ml-serve1013 (but it works for 1012), and the bios settings... [12:11:15] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11138654 (10isarantopoulos) I'm just pasting here some notes based on our discussions in IRC so that information doesn't get lost. Looks like the higher numbers in Istio compared to KServe are prob... [12:21:59] Folks I am having an issue on deploying on prod... My changes were shown when I executed: `helmfile -e ml-serve-eqiad diff`, I performed the `sync` operation but I see the same pods deployed on prod, nothing changed... However on staging everything seems correct. [12:26:07] 06Machine-Learning-Team, 06Growth-Team, 10Improve-Tone-Structured-Task, 05Goal, 07OKR-Work: Analyze samples of articles to see how many structured tasks we might be able to generate - https://phabricator.wikimedia.org/T401968#11138757 (10achou) Update: I summarize the article types to extract into four... [12:26:39] georgekyz: o/ if you check get events for the namespace there are some max memory usage errors [12:29:26] confirmed by `kubectl get limitrange -n edit-check -o yaml`, the max memory for a pod is 10Gi, but with the batcher (I assumed it adds it) we cross that limi [12:29:28] *limit [12:29:57] in staging we have 32Gi (that seems a bit too much) [12:30:26] 06Machine-Learning-Team: Kserve batcher doesn't seem to be properly configured for edit-check - https://phabricator.wikimedia.org/T403423#11138774 (10gkyziridis) ==Update== There is an issue on deploying the latest edit-check changes on prod. ` $ kube_env experimental ml-staging-codfw $ kubectl get pods NAME... [12:30:53] ah yes because limitranges is global in this case [12:32:29] isaranto: hmmm nice catch! Should we lower the limits? For instance set it 4Gi ? [12:32:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [12:32:49] Deployment edit-check-predictor-00008-deployment in edit-check at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s-mlserve&var-namespace=edit-check&var-deployment=edit-check-predictor-00008-deployment - ... [12:32:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:32:59] ah... welcome welcome [12:33:30] sry I'm in meetings will respond later [12:33:43] nothing on fire, it is just because the new pods don't come up [12:33:46] sending a patch in a sec [12:37:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment edit-check-predictor-00008-deployment in edit-check at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:38:54] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1184063 - let's see if the diff is ok or not, otherwise I'll do the complete limit range config.. I bumped manually the limit to 15G, it should hopefully clear in a bit [12:39:16] georgekyz: yep the new pod is getting up [12:40:15] and running [12:42:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment edit-check-predictor-00008-deployment in edit-check at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:47:26] elukey: Thank you so much for tackling this so quickly! I see that the pod is deployed correctly on `ml-serve-eqiad` and the 'agent' is available now. But there is an issue on `ml-serve-codfw` where the new pod is not working so it is still the previous one. [12:48:13] 06Machine-Learning-Team, 13Patch-For-Review: Kserve batcher doesn't seem to be properly configured for edit-check - https://phabricator.wikimedia.org/T403423#11138821 (10gkyziridis) Thnx to @elukey who set higher limits in `helmfile.d/admin_ng/values/ml-serve.yaml` the pods are deployed correctly and the batch... [12:48:35] georgekyz: ack, manually fixed codfw as well [12:48:48] elukey: You are THE BEST! [12:51:24] :) it is taking a bit for the controller to get that it needs to spin up a new pod.. [12:55:03] ack! no worries! Thank you so much for your support [12:56:17] georgekyz: this is the permanent fix https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1184063?tab=checks [12:57:45] +1 [13:02:44] and the codfw pod is good :) [13:02:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [13:02:49] Deployment edit-check-predictor-00008-deployment in edit-check at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s-mlserve&var-namespace=edit-check&var-deployment=edit-check-predictor-00008-deployment - ... [13:02:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:03:02] \o/ [13:03:14] Nice work! [13:07:18] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11138908 (10Jclark-ctr) @elukey Confirmed same issue; connected to iDRAC via SSH tunnel, logged in, and reset BMC under Maintenance → BMC Reset → Selected Unit Reset. i... [13:11:38] 06Machine-Learning-Team, 13Patch-For-Review: Kserve batcher doesn't seem to be properly configured for edit-check - https://phabricator.wikimedia.org/T403423#11138925 (10isarantopoulos) p:05Triage→03High a:03gkyziridis [14:02:24] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Create an analytics service user for the ML team - https://phabricator.wikimedia.org/T400902#11139281 (10brouberol) Is this safe to close? [14:02:38] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11139285 (10gkyziridis) The batcher is fixed and deployed from this ticket: https://phabricator.wikimedia.org/T403423 . Without disabling the autoscaling we can see that the metrics are more aligned... [14:08:35] 06Machine-Learning-Team, 05Goal, 13Patch-For-Review: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11139387 (10OKarakaya-WMF) [14:21:30] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11139452 (10elukey) 05Open→03Resolved @Jclark-ctr confirmed that it works, thanks a lot! [14:25:03] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Enable volunteer evaluation of Tone Check model in additional languages - https://phabricator.wikimedia.org/T400423#11139464 (10SSalgaonkar-WMF) @Trizek-WMF good question! We still //ideally// want to have 5+ evaluators per language, like we did last time, bu... [14:36:52] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Operational Excellence - LiftWing Platform Updates & Improvements - https://phabricator.wikimedia.org/T398948#11139554 (10isarantopoulos) [14:37:24] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Create an analytics service user for the ML team - https://phabricator.wikimedia.org/T400902#11139558 (10OKarakaya-WMF) Thank you @brouberol , I think we can close this task. [14:37:57] 06Machine-Learning-Team, 07Essential-Work: Upgrade revscoring model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400350#11139567 (10isarantopoulos) 05In progress→03Resolved [14:38:05] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade readability model server from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400352#11139571 (10isarantopoulos) 05Open→03Resolved [14:48:04] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Create an analytics service user for the ML team - https://phabricator.wikimedia.org/T400902#11139662 (10brouberol) Thanks! [14:49:12] 06Machine-Learning-Team, 06Data-Engineering, 06Data-Persistence, 06Growth-Team, and 2 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11139666 (10Ottomata) a:03Ottomata [14:49:27] 06Machine-Learning-Team, 06Data-Persistence, 06Growth-Team, 10Improve-Tone-Structured-Task, and 2 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11139669 (10Ottomata) [14:59:04] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Create an analytics service user for the ML team - https://phabricator.wikimedia.org/T400902#11139718 (10brouberol) 05In progress→03Resolved [16:06:15] (03CR) 10Nik Gkountas: [C:04-1] Support difficulty filtering for collection with source_lang=en (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1182654 (owner: 10Sbisson) [16:43:11] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11140258 (10achou) >>! In T392283#11136518, @achou wrote: > For t... [16:58:32] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11140339 (10Michael) Looping in @KStoller-WMF and @Urbanecm_WMF f... [18:23:11] (03PS3) 10Sbisson: Support difficulty filtering for collection with source_lang=en [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1182654 [18:24:10] (03CR) 10Sbisson: Support difficulty filtering for collection with source_lang=en (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1182654 (owner: 10Sbisson) [18:24:37] (03CR) 10CI reject: [V:04-1] Support difficulty filtering for collection with source_lang=en [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1182654 (owner: 10Sbisson) [18:53:46] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11140801 (10Ottomata) > Since I assume we aren't keeping more than a few model versions, and a few thresholds Since this is a 'query cache', I assumed th... [19:10:02] (03PS4) 10Sbisson: Support size filtering for collection with source_lang=en [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1182654 [21:26:22] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11141651 (10Eevans) >>! In T392283#11140256, @achou wrote: >>>! I... [23:15:10] 06Machine-Learning-Team, 10EditCheck, 10VisualEditor, 10Editing-team (Planning), 07Epic: Expand language coverage for Tone Check - https://phabricator.wikimedia.org/T394448#11141969 (10ppelberg) [23:15:44] FIRING: LiftWingServiceErrorRate: ... [23:15:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [23:25:44] RESOLVED: LiftWingServiceErrorRate: ... [23:25:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate