[07:00:46] Good morning [07:06:41] good morning! [07:29:32] morning folks! [07:41:21] Moin auch :) [07:44:03] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install ml-serve101[23] - https://phabricator.wikimedia.org/T393948#10815008 (10klausman) a:05klausman→03None [08:05:42] klausman: when we should expect models to be available for MinT (internal/external)? (ie https://phabricator.wikimedia.org/T391958) [08:06:07] Would like to start with external to test them as well :) [08:10:59] 06Machine-Learning-Team: [LLM] ML-lab benchmarking - https://phabricator.wikimedia.org/T382343#10815103 (10isarantopoulos) 05Open→03Resolved [08:13:56] Hey folks [08:14:02] isaranto: Can I merge this one : https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1144521? [08:17:19] o/ yes! [08:20:38] TIL https://github.com/microsoft/pyright [08:22:41] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: DBA Review of Tables that ORES Extension will create - https://phabricator.wikimedia.org/T391103#10815148 (10DMburugu) [08:23:02] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: DBA Review of Tables that ORES Extension will create - https://phabricator.wikimedia.org/T391103#10815149 (10DMburugu) [08:25:08] isaranto A few days ago I’ve also learned that Astral (developers of uv and ruff) are working on a new python type checker: https://github.com/astral-sh/ty [08:25:32] it’s still very much in alpha, but I have my hopes high :D [08:26:13] yeah I saw that as well, judging from the other tools I'd expect this to be great as well! [08:31:38] kart_: we'll start uploading them tomorrow. I expect them to be ready by the end of the week (just to be on the safe side) [08:35:02] Cool. Thanks! [08:42:06] (03CR) 10Nik Gkountas: [C:04-1] Popular/search recommander: use domain code in lllang parameter (036 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1143605 (https://phabricator.wikimedia.org/T306508) (owner: 10Sbisson) [08:44:20] 06Machine-Learning-Team, 07Documentation: [Fix]: Documentation for ORES and MediaWiki Docker - https://phabricator.wikimedia.org/T393876#10815282 (10isarantopoulos) a:03gkyziridis [08:44:41] 06Machine-Learning-Team, 10Editing-team (Tracking), 13Patch-For-Review: Peacock detection model GPU deployment returns inconsistent results - https://phabricator.wikimedia.org/T393154#10815283 (10isarantopoulos) a:03gkyziridis [08:48:18] 06Machine-Learning-Team, 06Language and Product Localization: Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10815301 (10isarantopoulos) p:05Triage→03Medium [08:51:53] 06Machine-Learning-Team, 13Patch-For-Review: Update kserve to 0.13.1 - https://phabricator.wikimedia.org/T367048#10815312 (10isarantopoulos) We should probably rename this task to upgrade to 0.15 https://github.com/kserve/kserve/releases/tag/v0.15.0 The only possible issue would be that we still have the chart... [09:17:08] georgekyz: sorry for jumpin in on your patch. I saw it failing so I rebased to see if the issue goes away [09:17:50] yes I am in a meeting right now... I just made a small typo in the commit message and from then and on the helm-lint is failing :P [09:17:57] I will take a look after my meeting [09:18:22] ack [09:30:26] (03PS1) 10Bartosz Wójtowicz: edit-check: Experirmental prod deployment. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1145106 (https://phabricator.wikimedia.org/T000000) [09:32:18] ^ this is just me testing out gerrit and git review commands :-) [09:37:14] (03PS2) 10Bartosz Wójtowicz: edit-check: Experirmental prod deployment. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1145106 (https://phabricator.wikimedia.org/T000000) [09:40:36] (03PS1) 10Bartosz Wójtowicz: Something being done by Bartosz. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1145108 [10:24:35] Folks does anybody knows what are the following on helm-lint? [10:24:35] 1. `rake aborted!` [10:24:35] 2. `NoMethodError: undefined method `filter!' for true:TrueClass` [10:24:53] `NoMethodError: undefined method filter!' for true:TrueClass` [10:27:05] I have no idea , maybe ask in #wikimedia-operations [10:31:49] kevinbazira, isaranto o/ - what is the plan for https://gitlab.wikimedia.org/repos/machine-learning/wmf-debian-vllm ? [10:32:19] are you going to use gitlab just as prep step, and then port the vllm image to production-images or similar? [10:32:51] because I have more suggestions (for example to use a multi stage build to differentiate between build step and runtime step) [10:33:00] elukey: o/ [10:33:06] Unsure yet and we were going to ask for your feedback (or someone else from the k8s-sig) [10:33:24] you read my mind, I am currently working on multi stage builds [10:33:29] <3 [10:33:59] the issue with this image is that it is going to be big so I doubt we will be able to add it to the docker registry the standard way. [10:34:06] if this is a base image that you'll use in the future (sort of like the pytorch one) I'd use production images [10:34:17] georgekyz: my hunch would be that you set CPU and GPU limits/request as strings, but helm might expect them to be integers [10:34:49] isaranto: production-images let's you to write the dockerfile, that is close to what you are doing, but there is no other way than "standard" [10:35:15] namely if a layer is more than 4GB compressed it will not be accepted (even if you manually docker push) [10:35:52] If we just use this image as is (instead of a base image), could we push it from ml-lab instead? this may happen a couple of times per year. [10:35:56] we could get creative and balance the RUN commands so that we don't exceed the limit in one layer, and we can drop unnecessary build things etc.. [10:35:58] bartosz: Thank you for your time and your idea mate, unfortunately is not that one, you can check here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/4d1478dffbabb5361bbca8867d3f8c304bbca23d%5E%21/#F0 [10:36:20] ack on the layer limit. I thought this would be bypassed [10:37:08] isaranto: any specific reason? It should be ok, but as I told to Tobias ML needs to go through a formal process with SRE/K8s-SIG to design the whole thing. From the security point of view we need to limit a lot of things, for example not pulling anything from internet etc.. [10:37:17] and possibly, only ml-admins access [10:37:29] (and also a good motivation why it cannot happen on say build2001) [10:38:34] bartosz: It is also very strange because it seems it is a CI change on the code because all the tests were succeeding before and after the minor change of the typo in the commit message the tests were starting failing... which means that something else is going on probably on the side of the CI. It is probably somewhere in the Ruby-based logic, a method is being called on a boolean value that doesn't support it. Specifically, the `filter!` [10:38:34] method is being invoked on a `TrueClass object`, which is not valid. So something is going on the operations site [10:40:19] elukey: the only reason would be if it allowed us to bypass the layer limit. Otherwise we would go the standard route of production-images etc. Me and you discussed the option of manual push if this is something that is going to be updated only 3-4 times a year. [10:40:55] I think the pip install torch layer exceeds the limit by itself [10:42:22] georgekyz: I see, thanks! Can we check easily if the docker image running helm linting in CI has been updated recently? [10:42:40] re: wmf-debian repo. that is prep work indeed [10:44:34] bartosz: georgekyz folks have mentioned on #wikimedia-operations that they are already looking at it so we can w8 a bit [10:45:02] yes ack I am just sharing my thoughts in this channel :p [10:46:31] ack! [10:46:33] bartosz: We can see in the docker-registry: https://docker-registry.wikimedia.org/releng/helm-linter/tags/ [10:47:38] bartosz: so the latest image it was updated today 10:09 in the morning, exactly before I made the change :P. Yesterday tests were succeeding :P [10:49:06] isaranto: IIRC we discussed the use of ml-lab because of the big CPU/memory/etc.. build requirements, the push would need to be done via something like docker-pkg etc.. (same as on build2001) [10:49:14] georgekyz: isaranto: I see, thank you both! [10:49:28] the main problem is that even with a docker push you'll end up hitting nginx, that is the one enforcing the 4GB limit [10:54:35] ack. thanks for clarifying. This means that it is likely that we will hit that limit with this image :( [10:54:50] * isaranto sobs [10:57:13] I was just checking the wheels available on https://download.pytorch.org/ [10:57:51] torch 2.4.0 - rocm6.1 : 2.5GB [10:57:51] torch 2.7.0 - rocm6.3: 4.2GB [11:07:04] * isaranto afk lunch [12:24:05] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10816022 (10isarantopoulos) a:03kevinbazira [13:12:36] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Productionize peacock detection model - https://phabricator.wikimedia.org/T391940#10816263 (10achou) Update: - Working on eval data collection and processing for `en`, `ar`, `es`, `ja`, `pt`, `fr`, `id`, `pl`, `zh`, `cs`, `he`, and `tr` wikis (languages prioriti... [13:52:36] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10816452 (10kevinbazira) As we prepare the wmf-debain-vllm image for the wikimedia docker registry in this [gitlab MR](https://gitlab.wikimedia.org/repos/machine-learning/wmf... [13:53:10] elukey: isaranto: using multi-stage builds in the `wmf-debain-vllm` image has dropped the image size from ~58GB to ~26.2GB without using `docker-slim`: https://phabricator.wikimedia.org/T385173#10816452 [13:53:21] very nice :) [13:56:46] 06Machine-Learning-Team, 07Documentation: [Fix]: Documentation for ORES and MediaWiki Docker - https://phabricator.wikimedia.org/T393876#10816467 (10gkyziridis) Deploying media wiki and ORES extension --------------------- I am trying to clear out the steps one by one to reproduce based on the two reference l... [14:00:22] kevinbazira: in https://gitlab.wikimedia.org/repos/machine-learning/wmf-debian-vllm/-/commit/8dd38fe0 you probably can avoid gcc and libc6-dev, in theory you don't have to build anything right? [14:00:34] in the final image I mean [14:00:49] the interesting bit could also be to figure out what dominates those 26G [14:02:45] gcc and libc6-dev are required in the runtime image as shown in: [14:02:46] https://phabricator.wikimedia.org/P75997 [14:02:57] and https://phabricator.wikimedia.org/P76004 [14:10:05] mod = compile_module_from_src(src, "hip_utils") [14:10:09] * elukey cries in a corner [14:11:59] I am reviewing https://github.com/triton-lang/triton/blob/main/third_party/amd/backend/driver.py#L168 but I can't make any sense of it [14:12:57] yep I cried in the corner too :') [14:14:56] in the original commit, https://github.com/triton-lang/triton/commit/2f88120618a561bc4f940a40dc87ffbfaf667dca#diff-1fcef987547dd43a29d53023db89e9edd9ce9157c054b29aea9ed634cf2babd4R22-R70, it seems that it tries to compile if the .so is missing [14:15:08] now it is a little bit different [14:15:20] but it may be that triton doesn't find a certain lib [14:15:23] and it compiles it [14:15:53] in this case, hip_utils.so (or similar) [14:18:12] or possibly hip_utils.so is created by triton [14:18:26] and it waits to check what runtimes it runs on before building it [14:18:29] that is... sigh [14:41:09] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Operational Excellence - LiftWing Platform Updates & Improvements - https://phabricator.wikimedia.org/T391943#10816821 (10isarantopoulos) We will need to reimage all the LiftWing workers in eqiad for T369493 [15:43:25] 06Machine-Learning-Team, 10EditCheck, 10VisualEditor, 10Editing-team (Tracking): Compile list of templates, jargon and policies relevant to NPOV - https://phabricator.wikimedia.org/T389445#10817138 (10achou) @Trizek-WMF Thanks! Yes, there is a way. We can parse wikitext and check if the template is in the... [16:49:39] 06Machine-Learning-Team, 10LDAP-Access-Requests, 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10817576 (10BCornwall) 0... [17:47:46] 06Machine-Learning-Team, 06Data-Engineering, 07Essential-Work: Make the revert risk predictions datasets available for analysis - https://phabricator.wikimedia.org/T388453#10817939 (10Ahoelzl) @JAllemandou can you please evaluate the effort? [18:17:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [18:17:49] Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ... [18:17:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [19:53:10] (03PS4) 10Sbisson: Popular/search recommander: use domain code in lllang parameter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1143605 (https://phabricator.wikimedia.org/T306508) [19:56:53] (03CR) 10Sbisson: Popular/search recommander: use domain code in lllang parameter (034 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1143605 (https://phabricator.wikimedia.org/T306508) (owner: 10Sbisson) [20:52:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment enwiki-articlequality-predictor-default-00023-deployment in revscoring-articlequality at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [21:02:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment enwiki-articlequality-predictor-default-00023-deployment in revscoring-articlequality at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [22:02:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [22:02:49] Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ... [22:02:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas