[06:50:56] hello! [07:00:44] good morning [07:22:44] (03CR) 10Ilias Sarantopoulos: [C:03+1] edit-check: Update readme and add pydantic tests. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1136981 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [07:32:08] morning [07:50:57] (03CR) 10Ilias Sarantopoulos: "Thanks for all the great work Ozge!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (owner: 10Ozge) [07:51:12] isaranto: anything for https://phabricator.wikimedia.org/T391958 ? :) [07:51:43] o/ kart_ sorry for never responding [07:55:22] kart_: do you have a specific timeline? or are you looking for a more long-term solution as the connected task mentions? The reason why I'm asking is that we haven't migrated to APUS yet so it would be a first for us [07:56:24] The main task is: https://phabricator.wikimedia.org/T335491 [07:57:14] But, feel free to update T391958 as well, as it is specific. [08:01:15] before the move to apus we could try to figure out if the mint model would go alongside the "ml-team" ones or not [08:01:34] for example, do we need a different bucket with different credentials etc..? [08:01:39] or do we use the same? [08:02:04] and the follow up question is about how to publish it externally - we already have a solution, etc.. [08:03:21] +1 [08:05:19] 06Machine-Learning-Team, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX): Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10771668 (10KartikMistry) [08:06:21] kart_: atm the workflow is the following, for any model - 1) the model "owner" shares the binary with a member of the ml-team via a "trusted" source (drive, stat boxes, etc..) sharing also the SHA512 of the binary separately 2) the ml-team retrieves the binary, checks the SHA and uploads it to S3 and to the DE UI for external consumption [08:07:06] the idea is to control as much as possible what gets published, especially because we share it to the community [08:07:12] and also to avoid messing up with prod :D [08:07:25] if the workflow works with you, it should be easy to integrate [08:07:40] and it could work right now without apus [08:07:51] but of course you'd need the above steps when releasing a new version [08:08:10] I'd argue that a single process is beneficial for everybody, and more checks are good :) [08:08:24] so we don't start to have multiple ways to upload binaries etc.. [08:08:33] isaranto: --^ [08:09:09] Most of our models won't change frequently or at all. [08:09:24] So, it is probably one time setup as of now. [08:09:50] then I think it should work [08:09:55] Yes [08:10:07] let's see what the ML team thinks about! [08:10:14] Sure! [08:16:56] so MinT service would require read only access to that bucket (not sure how it would happen). I agree then that we can just use swift as is then [08:18:28] Since MinT is going to have read only access and they don't change often I'd suggest that we put them under wmf-ml-models for now. [08:19:18] wmf-ml-models should be fine. [08:19:19] klausman: does the above sound ok? What would MinT need to be able to fetch the models? [08:20:28] Yeah, I think read-only access is probably relatively easy to set up, though I am not super familiar with how that is set up on the Swift server side [08:23:13] (03CR) 10Ozge: "I think it happens when we send an article id that does not exist or fetching html fails for some other reasons. Because we return None he" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (owner: 10Ozge) [08:24:37] klausman: o/ we have the mlserve users in hieradata/common/profile/thanos/swift.yaml [08:24:54] and then the password is wired to the kserve pods via puppet private [08:25:25] I have realized though that we may not be using the read only credentials for kserve though [08:25:38] I am checking hieradata/role/common/deployment_server/kubernetes.yaml [08:26:03] no sorry we are [08:26:04] AWS_ACCESS_KEY_ID: mlserve:ro [08:26:32] ah, that _ -> : conversion tripped me up [08:27:26] so it should be easy, data persistence manages the thanos swift cluster, I think there is a procedure to create the new account if needed [08:29:24] this is the prev task https://phabricator.wikimedia.org/T311628 [08:29:48] I don't recall though if there is a specific procedure to generate the password, or if it is in puppet private [08:30:13] yes ok it is in hieradata/common/profile/thanos/swift.yaml (private) [08:30:57] so, to recap, IIUC/IIRC: 1) contact data persistence to get the green light 2) update puppet private 3) update public puppet 4) update the rest in k8s [08:32:13] elukey: possible to log these in the task as well? I keep forgetting IRC ;) [08:32:37] kart_: I think that klausman will follow up opening the tasks etc.. [08:32:50] OK! [08:32:53] Thanks! [08:36:12] yes, will do [08:44:40] (03PS14) 10Ozge: feat: adds articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 [08:46:11] (03CR) 10Ozge: "We should get a more clear error message now for the errors during html fetch:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (owner: 10Ozge) [09:00:34] (03PS15) 10Ozge: feat: adds articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 [09:02:09] (03PS16) 10Ozge: feat: adds articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (https://phabricator.wikimedia.org/T391679) [09:02:20] (03CR) 10Ozge: "I've added retry as well." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (https://phabricator.wikimedia.org/T391679) (owner: 10Ozge) [09:05:08] (03CR) 10Ozge: feat: adds articlequality_v2 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (https://phabricator.wikimedia.org/T391679) (owner: 10Ozge) [09:35:59] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Awesome thanks!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (https://phabricator.wikimedia.org/T391679) (owner: 10Ozge) [09:48:14] klausman: did you see https://phabricator.wikimedia.org/T392289? If you are ok I'd do it also for ml-staging, it seems suffering from the same issue [09:49:34] (03CR) 10Gkyziridis: [C:03+2] edit-check: Update readme and add pydantic tests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1136981 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [09:50:22] elukey: that would be lovely, thank you! [09:52:59] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10771940 (10kevinbazira) I have been working on porting the ROCm vLLM image (`rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6`) to use WMF's debian bookworm (`docker-registry.wikimedia.org... [09:53:38] doing it [09:55:33] o/ I ported the upstream ROCm vLLM image to use WMF's Debian Bookworm instead of Ubuntu: https://phabricator.wikimedia.org/T385173#10771940 [09:55:33] in this image, vLLM serves the `facebook/opt-125m` model successfully as shown in the above phab comment. [09:55:33] next, I'll test it with the `aya-expanse` models. [10:02:48] (03PS17) 10Ozge: feat: adds articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 [10:03:02] (03PS18) 10Ozge: feat: adds articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (https://phabricator.wikimedia.org/T391679) [10:03:33] (03CR) 10Ozge: [V:03+2 C:03+2] feat: adds articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (https://phabricator.wikimedia.org/T391679) (owner: 10Ozge) [10:06:02] kevinbazira: that is outstanding! Very nice work [10:06:18] I think docker-slim might come in handy in the future [10:10:03] klausman: thanks for your help with the proxy :) [10:10:24] sure sure docker-slim is super cool! [10:21:19] 06Machine-Learning-Team: [FIX]: Edit-check peacock detection locust tests - https://phabricator.wikimedia.org/T392460#10771990 (10gkyziridis) 05Open→03Resolved [10:21:27] Hey, my Pipeline bot in my patch created a docker image with name `machinelearning-liftwing-inference-services-edit-check`. But I need `machinelearning-liftwing-inference-services-articlequality`. Do you know how pipeline bot decides on it? Also would be great to check the repo where the pipelines are implemented. [10:25:32] https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1138337 [10:39:09] kevinbazira: this looks awesome!great work slimming the image down to 25G! let me know if you need a specific review on that. I will take. a look tomorrow! [10:39:14] ozge_: this seems odd! [10:39:53] ozge_: the article quality one failed, there is a pipeline job reported https://integration.wikimedia.org/ci/job/inference-services-pipeline-articlequality-publish/65/console [10:39:56] the ci pilelines are configured in integration/config lemme find the lin [10:40:08] Post-merge build failed. [10:40:08] https://integration.wikimedia.org/ci/job/trigger-inference-services-pipeline-articlequality-publish/22/console : FAILURE in 2m 41s [10:40:12] this one I mean [10:40:19] a yes there is a failure [10:40:36] and you modified the edit_check's README, that triggered a new build/publish for editcheck [10:41:49] the failure is really weird [10:42:44] if you don't spot anything obvious it may be a transient failure on the CI's front, we can probably force a new build via https://integration.wikimedia.org/ci/job/inference-services-pipeline-articlequality-publish/65/console (need to be logged in) [10:48:06] Cool, thank you both. I’m looking into it. [10:54:33] Re-triggering the pipeline worked! I’ve created a small PR in deployment charts for staging deployment [10:54:34] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1139436 [10:55:49] jenkins failed on purpose so you could figure out what to do in a case like this. It is part of onboarding :P [10:56:24] jokes aside this is pretty rare [10:56:54] :) [11:46:39] Article quality staging deployment failed with the following error: ``` File "/srv/articlequality/model_server/model.py", line 14, in [11:46:40] from src.models.articlequality.model_server.config import Settings [11:46:40] ModuleNotFoundError: No module named 'src' ``` I think I need to change this https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/.pipeline/articlequality/blubber.yaml#7 to site packages but I’ll try to reproduce it with docker compose first. [11:57:42] ozge_: you'd have to do the following: [11:58:07] 1. copy the files using the same structure as the repo [11:58:14] ``` copies: [11:58:14] - from: local [11:58:14] source: src/models/articlequality [11:58:14] destination: src/models/articlequality [11:58:14] ``` [11:58:31] in line 41 in the blubber file [11:58:49] and in line 60 add a parameter to the entrypoint command `entrypoint: ["./entrypoint.sh", "src/models/articlequality/model_server/model.py"]` [12:00:28] something I am missing: where does data/feature_values.tsv come from? iirc it should be bundled with the model in s3 but i don't see it in https://analytics.wikimedia.org/published/wmf-ml-models/articlequality/language-agnostic/ [12:03:26] aiko: do you recall what happens there? [12:09:45] the file is in the repo: src/models/articlequality/data/feature_values.tsv [12:13:23] lol I missed that, thanks! [12:15:56] Yep, it’s in the repo. Actually I was doing similar changes. Thank you! docker compose works for me now. I’ll create new patch. [12:16:17] ozge_: then you'd have to also change the dir for the tsv file. Either use max_feature_vals: str = "src/models/articlequality/data/feature_values.tsv" in pydantic or copy the file in blubber under data/ in the image [12:19:43] (03PS1) 10Ozge: feat: updates blubber yaml for articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1139451 [12:21:10] @isaranto: indeed updated in the pedantic. Small patch above is ready. Tested locally. [12:21:23] *pydantic [12:21:28] (03CR) 10Ilias Sarantopoulos: [C:03+1] feat: updates blubber yaml for articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1139451 (owner: 10Ozge) [12:21:40] LGTM, I tested it as well [12:24:26] (03CR) 10Ozge: [V:03+2 C:03+2] feat: updates blubber yaml for articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1139451 (owner: 10Ozge) [12:57:17] I went all well. Both models are running fine on staging. I’ll run performance tests. Which server (or local) do you prefer generally? [12:57:36] *it went all well :) [13:17:56] great! [15:03:46] Hey, quick question: where do we run the locust tests (local, stat1010.eqiad.wmnet etc) and what should be the host (inference-staging.svc.codfw.wmnet:30443, localhost:8080) ? If we run them on servers e.g. stat1010, do we have project set up already in one of the servers? [15:07:05] Just asking to keep the scores consistent in the csv files. [15:08:33] ozge_: You can run the locust on every stat machine. You do not need a localhost because you need to hit the servers [15:08:43] I am running mine on stat10 machine [15:09:16] ozge_: you do not need a project for that, just ssh to one of the stat machines, clone your repo/branch and run the locust [15:18:52] Ok cool! I’m trying 1010. looks like Pip install gets stuck. Did you have have something similar before? [15:20:42] ozge_: try "set_proxy" first [15:22:04] yes, set_proxy summarizes this https://wikitech.wikimedia.org/wiki/HTTP_proxy [15:27:44] FIRING: LiftWingServiceErrorRate: ... [15:27:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=plwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:28:04] noooooooo [15:49:13] I'm checking the above alert, the container is stuck -- it was cpu throttled [15:50:32] 06Machine-Learning-Team: AI/ML Infrastructure Request: **Accessing topics endpoints at scale** - https://phabricator.wikimedia.org/T392833 (10Seddon) 03NEW [15:52:42] I deployed the change for enabling multiprocessing on ptwiki, so this will be fixed as well [15:55:25] 06Machine-Learning-Team: AI/ML Infrastructure Request: **Accessing topics endpoints at scale** - https://phabricator.wikimedia.org/T392833#10773268 (10Seddon) [15:57:02] deployment done. Alert will resolve in a sec [15:57:44] RESOLVED: LiftWingServiceErrorRate: ... [15:57:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=plwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [16:00:42] going afk folks, have a nice evening/rest of day! [16:24:22] o/ [17:37:12] 06Machine-Learning-Team: AI/ML Infrastructure Request: **Accessing topics endpoints at scale** - https://phabricator.wikimedia.org/T392833#10773684 (10HNordeenWMF)