[06:50:56] <isaranto>	 hello!
[07:00:44] <ozge_>	 good morning
[07:22:44] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] edit-check: Update readme and add pydantic tests. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1136981 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[07:32:08] <georgekyz>	 morning
[07:50:57] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "Thanks for all the great work Ozge!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (owner: 10Ozge)
[07:51:12] <kart_>	 isaranto: anything for https://phabricator.wikimedia.org/T391958 ? :)
[07:51:43] <isaranto>	 o/ kart_ sorry for never responding 
[07:55:22] <isaranto>	 kart_: do you have a specific timeline? or are you looking for a more long-term solution as the connected task mentions? The reason why I'm asking is that we haven't migrated to APUS yet so it would be a first for us
[07:56:24] <kart_>	 The main task is: https://phabricator.wikimedia.org/T335491
[07:57:14] <kart_>	 But, feel free to update T391958 as well, as it is specific.
[08:01:15] <elukey>	 before the move to apus we could try to figure out if the mint model would go alongside the "ml-team" ones or not
[08:01:34] <elukey>	 for example, do we need a different bucket with different credentials etc..?
[08:01:39] <elukey>	 or do we use the same?
[08:02:04] <elukey>	 and the follow up question is about how to publish it externally - we already have a solution, etc..
[08:03:21] <kart_>	 +1
[08:05:19] <wikibugs>	 06Machine-Learning-Team, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX): Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10771668 (10KartikMistry)
[08:06:21] <elukey>	 kart_: atm the workflow is the following, for any model - 1) the model "owner" shares the binary with a member of the ml-team via a "trusted" source (drive, stat boxes, etc..) sharing also the SHA512 of the binary separately 2) the ml-team retrieves the binary, checks the SHA and uploads it to S3 and to the DE UI for external consumption
[08:07:06] <elukey>	 the idea is to control as much as possible what gets published, especially because we share it to the community
[08:07:12] <elukey>	 and also to avoid messing up with prod :D
[08:07:25] <elukey>	 if the workflow works with you, it should be easy to integrate
[08:07:40] <elukey>	 and it could work right now without apus
[08:07:51] <elukey>	 but of course you'd need the above steps when releasing a new version
[08:08:10] <elukey>	 I'd argue that a single process is beneficial for everybody, and more checks are good :)
[08:08:24] <elukey>	 so we don't start to have multiple ways to upload binaries etc..
[08:08:33] <elukey>	 isaranto: --^
[08:09:09] <kart_>	 Most of our models won't change frequently or at all.
[08:09:24] <kart_>	 So, it is probably one time setup as of now.
[08:09:50] <elukey>	 then I think it should work
[08:09:55] <kart_>	 Yes
[08:10:07] <elukey>	 let's see what the ML team thinks about!
[08:10:14] <kart_>	 Sure!
[08:16:56] <isaranto>	 so MinT service would require read only access to that bucket (not sure how it would happen). I agree then that we can just use swift as is then 
[08:18:28] <isaranto>	 Since MinT is going to have read only access  and they don't change often I'd suggest that we put them under wmf-ml-models for now.
[08:19:18] <kart_>	 wmf-ml-models should be fine.
[08:19:19] <isaranto>	 klausman: does the above sound ok? What would MinT need to be able to fetch the models?
[08:20:28] <klausman>	 Yeah, I think read-only access is probably relatively easy to set up, though I am not super familiar with how that is set up on the Swift server side
[08:23:13] <wikibugs>	 (03CR) 10Ozge: "I think it happens when we send an article id that does not exist or fetching html fails for some other reasons. Because we return None he" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (owner: 10Ozge)
[08:24:37] <elukey>	 klausman: o/ we have the mlserve users in hieradata/common/profile/thanos/swift.yaml
[08:24:54] <elukey>	 and then the password is wired to the kserve pods via puppet private
[08:25:25] <elukey>	 I have realized though that we may not be using the read only credentials for kserve though
[08:25:38] <elukey>	 I am checking hieradata/role/common/deployment_server/kubernetes.yaml
[08:26:03] <elukey>	 no sorry we are
[08:26:04] <elukey>	 AWS_ACCESS_KEY_ID: mlserve:ro
[08:26:32] <klausman>	 ah, that _ -> : conversion tripped me up
[08:27:26] <elukey>	 so it should be easy, data persistence manages the thanos swift cluster, I think there is a procedure to create the new account if needed
[08:29:24] <elukey>	 this is the prev task https://phabricator.wikimedia.org/T311628
[08:29:48] <elukey>	 I don't recall though if there is a specific procedure to generate the password, or if it is in puppet private
[08:30:13] <elukey>	 yes ok it is in hieradata/common/profile/thanos/swift.yaml (private)
[08:30:57] <elukey>	 so, to recap, IIUC/IIRC: 1) contact data persistence to get the green light 2) update puppet private 3) update public puppet 4) update the rest in k8s
[08:32:13] <kart_>	 elukey: possible to log these in the task as well? I keep forgetting IRC ;)
[08:32:37] <elukey>	 kart_: I think that klausman will follow up opening the tasks etc..
[08:32:50] <kart_>	 OK!
[08:32:53] <kart_>	 Thanks!
[08:36:12] <klausman>	 yes, will do
[08:44:40] <wikibugs>	 (03PS14) 10Ozge: feat: adds articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337
[08:46:11] <wikibugs>	 (03CR) 10Ozge: "We should get a more clear error message now for the errors during html fetch:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (owner: 10Ozge)
[09:00:34] <wikibugs>	 (03PS15) 10Ozge: feat: adds articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337
[09:02:09] <wikibugs>	 (03PS16) 10Ozge: feat: adds articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (https://phabricator.wikimedia.org/T391679)
[09:02:20] <wikibugs>	 (03CR) 10Ozge: "I've added retry as well." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (https://phabricator.wikimedia.org/T391679) (owner: 10Ozge)
[09:05:08] <wikibugs>	 (03CR) 10Ozge: feat: adds articlequality_v2 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (https://phabricator.wikimedia.org/T391679) (owner: 10Ozge)
[09:35:59] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] "Awesome thanks!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (https://phabricator.wikimedia.org/T391679) (owner: 10Ozge)
[09:48:14] <elukey>	 klausman: did you see https://phabricator.wikimedia.org/T392289? If you are ok I'd do it also for ml-staging, it seems suffering from the same issue
[09:49:34] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] edit-check: Update readme and add pydantic tests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1136981 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[09:50:22] <klausman>	 elukey: that would be lovely, thank you!
[09:52:59] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10771940 (10kevinbazira) I have been working on porting the ROCm vLLM image (`rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6`) to use WMF's debian bookworm (`docker-registry.wikimedia.org...
[09:53:38] <elukey>	 doing it
[09:55:33] <kevinbazira>	 o/ I ported the upstream ROCm vLLM image to use WMF's Debian Bookworm instead of Ubuntu: https://phabricator.wikimedia.org/T385173#10771940
[09:55:33] <kevinbazira>	 in this image, vLLM serves the `facebook/opt-125m` model successfully as shown in the above phab comment.
[09:55:33] <kevinbazira>	 next, I'll test it with the `aya-expanse` models.
[10:02:48] <wikibugs>	 (03PS17) 10Ozge: feat: adds articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337
[10:03:02] <wikibugs>	 (03PS18) 10Ozge: feat: adds articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (https://phabricator.wikimedia.org/T391679)
[10:03:33] <wikibugs>	 (03CR) 10Ozge: [V:03+2 C:03+2] feat: adds articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1138337 (https://phabricator.wikimedia.org/T391679) (owner: 10Ozge)
[10:06:02] <klausman>	 kevinbazira: that is outstanding! Very nice work
[10:06:18] <klausman>	 I think docker-slim might come in handy in the future
[10:10:03] <kevinbazira>	 klausman: thanks for your help with the proxy :)
[10:10:24] <kevinbazira>	 sure sure docker-slim is super cool!
[10:21:19] <wikibugs>	 06Machine-Learning-Team: [FIX]: Edit-check peacock detection locust tests - https://phabricator.wikimedia.org/T392460#10771990 (10gkyziridis) 05Open→03Resolved
[10:21:27] <ozge_>	 Hey, my Pipeline bot in my patch created a docker image with name `machinelearning-liftwing-inference-services-edit-check`. But I need `machinelearning-liftwing-inference-services-articlequality`. Do you know how pipeline bot decides on it? Also would be great to check the repo where the pipelines are implemented.
[10:25:32] <ozge_>	 https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1138337
[10:39:09] <isaranto>	 kevinbazira: this looks awesome!great work slimming the image down to 25G! let me know if you need a specific review on that. I will take. a look tomorrow!
[10:39:14] <isaranto>	 ozge_: this seems odd!
[10:39:53] <elukey>	 ozge_: the article quality one failed, there is a pipeline job reported https://integration.wikimedia.org/ci/job/inference-services-pipeline-articlequality-publish/65/console
[10:39:56] <isaranto>	 the ci pilelines are configured in integration/config lemme find the lin
[10:40:08] <elukey>	 Post-merge build failed.
[10:40:08] <elukey>	 https://integration.wikimedia.org/ci/job/trigger-inference-services-pipeline-articlequality-publish/22/console : FAILURE in 2m 41s
[10:40:12] <elukey>	 this one I mean
[10:40:19] <isaranto>	 a yes there is a failure
[10:40:36] <elukey>	 and you modified the edit_check's README, that triggered a new build/publish for editcheck
[10:41:49] <elukey>	 the failure is really weird
[10:42:44] <elukey>	 if you don't spot anything obvious it may be a transient failure on the CI's front, we can probably force a new build via https://integration.wikimedia.org/ci/job/inference-services-pipeline-articlequality-publish/65/console (need to be logged in)
[10:48:06] <ozge_>	 Cool, thank you both. I’m looking into it.
[10:54:33] <ozge_>	 Re-triggering the pipeline worked! I’ve created a small PR in deployment charts for staging deployment
[10:54:34] <ozge_>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1139436
[10:55:49] <isaranto>	 jenkins failed on purpose so you could figure out what to do in a case like this. It is part of onboarding :P
[10:56:24] <isaranto>	 jokes aside this is pretty rare
[10:56:54] <ozge_>	 :)
[11:46:39] <ozge_>	 Article quality staging deployment failed with the following error: ```    File "/srv/articlequality/model_server/model.py", line 14, in <module>
[11:46:40] <ozge_>	     from src.models.articlequality.model_server.config import Settings
[11:46:40] <ozge_>	 ModuleNotFoundError: No module named 'src' ``` I think I need to change this https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/.pipeline/articlequality/blubber.yaml#7 to site packages but I’ll try to reproduce it with docker compose first.
[11:57:42] <isaranto>	 ozge_: you'd have to do the following:
[11:58:07] <isaranto>	 1. copy the files using the same structure as the repo
[11:58:14] <isaranto>	 ```    copies:
[11:58:14] <isaranto>	       - from: local
[11:58:14] <isaranto>	         source: src/models/articlequality
[11:58:14] <isaranto>	         destination: src/models/articlequality
[11:58:14] <isaranto>	 ```
[11:58:31] <isaranto>	 in line 41 in the blubber file
[11:58:49] <isaranto>	 and in line 60 add a parameter to the entrypoint command `entrypoint: ["./entrypoint.sh", "src/models/articlequality/model_server/model.py"]`
[12:00:28] <isaranto>	 something I am missing: where does data/feature_values.tsv come from? iirc it should be bundled with the model in s3 but i don't see it in https://analytics.wikimedia.org/published/wmf-ml-models/articlequality/language-agnostic/
[12:03:26] <isaranto>	 aiko: do you recall what happens there?
[12:09:45] <aiko>	 the file is in the repo: src/models/articlequality/data/feature_values.tsv
[12:13:23] <isaranto>	 lol I missed that, thanks!
[12:15:56] <ozge_>	 Yep, it’s in the repo. Actually I was doing similar changes. Thank you! docker compose works for me now. I’ll create new patch.
[12:16:17] <isaranto>	 ozge_: then you'd have to also change the dir for the tsv file. Either use max_feature_vals: str = "src/models/articlequality/data/feature_values.tsv" in pydantic or copy the file in blubber under data/ in the image
[12:19:43] <wikibugs>	 (03PS1) 10Ozge: feat: updates blubber yaml for  articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1139451
[12:21:10] <ozge_>	 @isaranto: indeed updated in the pedantic.  Small patch above is ready. Tested locally.
[12:21:23] <ozge_>	 *pydantic
[12:21:28] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] feat: updates blubber yaml for  articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1139451 (owner: 10Ozge)
[12:21:40] <isaranto>	 LGTM, I tested it as well
[12:24:26] <wikibugs>	 (03CR) 10Ozge: [V:03+2 C:03+2] feat: updates blubber yaml for  articlequality_v2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1139451 (owner: 10Ozge)
[12:57:17] <ozge_>	 I went all well. Both models are running fine on staging. I’ll run performance tests. Which server (or local) do you prefer generally?
[12:57:36] <ozge_>	 *it went all well :)
[13:17:56] <isaranto>	 great!
[15:03:46] <ozge_>	 Hey,  quick question: where do we run the locust tests (local, stat1010.eqiad.wmnet etc) and what should be the host (inference-staging.svc.codfw.wmnet:30443, localhost:8080) ? If we run them on servers e.g. stat1010, do we have project set up already in one of the servers?
[15:07:05] <ozge_>	 Just asking to keep the scores consistent in the csv files.
[15:08:33] <georgekyz>	 ozge_: You can run the locust on every stat machine. You do not need a localhost because you need to hit the servers 
[15:08:43] <georgekyz>	 I am running mine on stat10 machine 
[15:09:16] <georgekyz>	 ozge_: you do not need a project for that, just ssh to one of the stat machines, clone your repo/branch and run the locust
[15:18:52] <ozge_>	 Ok cool! I’m trying 1010. looks like Pip install gets stuck. Did you have have something similar before?
[15:20:42] <aiko>	 ozge_: try "set_proxy" first
[15:22:04] <isaranto>	 yes, set_proxy summarizes this https://wikitech.wikimedia.org/wiki/HTTP_proxy
[15:27:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[15:27:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=plwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[15:28:04] <isaranto>	 noooooooo
[15:49:13] <isaranto>	 I'm checking the above alert, the container is stuck -- it was cpu throttled
[15:50:32] <wikibugs>	 06Machine-Learning-Team: AI/ML Infrastructure Request: **Accessing topics endpoints at scale** - https://phabricator.wikimedia.org/T392833 (10Seddon) 03NEW
[15:52:42] <isaranto>	 I deployed the change for enabling multiprocessing on ptwiki, so this will be fixed as well
[15:55:25] <wikibugs>	 06Machine-Learning-Team: AI/ML Infrastructure Request: **Accessing topics endpoints at scale** - https://phabricator.wikimedia.org/T392833#10773268 (10Seddon)
[15:57:02] <isaranto>	 deployment done. Alert will resolve in a sec
[15:57:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: ...
[15:57:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=plwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[16:00:42] <isaranto>	 going afk folks, have a nice evening/rest of day!
[16:24:22] <aiko>	 o/
[17:37:12] <wikibugs>	 06Machine-Learning-Team: AI/ML Infrastructure Request: **Accessing topics endpoints at scale** - https://phabricator.wikimedia.org/T392833#10773684 (10HNordeenWMF)