[00:12:56] 10Lift-Wing, 10Machine-Learning-Team: Add prometheus metrics for KFServer-based model HTTP servers - https://phabricator.wikimedia.org/T292398 (10ACraze) I believe our sandbox clusters use the prometheus operator: https://github.com/kserve/kserve/tree/master/docs/samples/metrics-and-monitoring > [...] access... [06:30:03] 10Lift-Wing, 10Machine-Learning-Team: Add prometheus metrics for KFServer-based model HTTP servers - https://phabricator.wikimedia.org/T292398 (10elukey) 05Open→03Resolved a:03elukey Ah interesting! So the Knative queue-proxy container publishes a lot of metrics, it should be a matter of adding support f... [06:30:39] 10Lift-Wing, 10Machine-Learning-Team: Add prometheus metrics collection for Istio and Knative - https://phabricator.wikimedia.org/T289841 (10elukey) As pointed out by Andy in https://github.com/kserve/kserve/tree/master/docs/samples/metrics-and-monitoring the Knative Queue proxy container should offer a lot of... [06:39:28] good morning :) [06:40:26] so the knative queue-proxy container offers two prometheus endpoint, one for the target pod and one for itself [06:59:58] In theory adding the right prometheus labels should be fine, I have a code change ready but the main question mark is where to put annotations for the queue proxy [07:00:12] since it has a special config (not listed as container spec) [09:05:40] kevinbazira: o/ [09:06:07] first draft of https://wikitech.wikimedia.org/wiki/User:Elukey/MachineLearning/Deploy ready, lemme know (when you have time) if it is clear or not [09:26:23] elukey: o/ [09:26:47] Thank you for sharing the draft. Let me go through it. [10:27:06] np! [10:36:45] * elukey lunch! [11:09:31] Thank you for considering hiding the secrets and service accounts from these config files and making them modular and lean. [11:09:48] In the example we see that the enwiki-goodfaith inference service is deployed using both helmfile.yaml and values.yaml that are in the deployment-charts/helmfile.d/ml-services/revscoring-editquality directory. [11:10:11] If one wanted to deploy the enwiki-damaging inference service that uses the same docker image as enwiki-goodfaith. Would they have to edit both the helmfile.yaml and values.yaml files that are in the deployment-charts/helmfile.d/ml-services/revscoring-editquality directory or the would have to create these 2 files afresh? [11:59:33] kevinbazira: here I am sorry [12:01:15] so the revscoring-editquality dir represents a collection of InferenceServices, so in theory we can just add another element to the list called "enwiki-damaging" [12:01:31] I didn't add the use case to the doc, lemme do it [12:10:20] Great thanks. As you add the use case you might see some changes to the docs - I fixed a few typos. [12:12:55] kevinbazira: ack thanks, added the use case [12:13:38] I am going to also add an entry about what endpoints to check and some debugging commands for kubernetes [12:22:29] kevinbazira: also added https://wikitech.wikimedia.org/wiki/User:Elukey/MachineLearning/Deploy#Test_your_model_after_deployment [12:22:53] Thank you for adding the use case where the same helmfile subdir works for a particular group/category of models. It all makes sense now. [12:25:56] kevinbazira: this is a first idea/draft of how we can deploy things, it can be changed, nothing is set in stone. So please feel free to come up with any new ideas or pain points.. there is still a bit of repetition, for example if we want to deploy 100 models in revscoring-editquality its value.yaml file may become very big [12:26:06] but it is a start, imho we should refine as we go [12:26:56] (Un)fortunately, YAML is not good at repeated values [12:27:07] yep ... we'll definitely have to refine the value.yaml and make it a little more DRY. [12:29:15] it is a compromise between configurability and DRY-ness, and it should be generic-enough (in theory) for any kind of model [12:29:21] (inference service deployment, better) [12:29:36] but we can have a special case for revscoring [12:30:05] sure thing. [12:30:06] should I go ahead and try to deploy the enwiki-damaging inference service? I am checking so that I don't disrupt anything. [12:31:25] elukey ^ [12:32:13] kevinbazira: yes yes please, the first step is the code review, then I can assist to get to a deployment (but it should be really easy) [12:32:18] (last famous words) [12:32:36] alright, let me start working on it. [12:32:46] hihi [12:55:58] elukey: please help review https://gerrit.wikimedia.org/r/731944 [13:20:57] (sorry in a metting but will check in a sec! [13:22:14] That's fine ... whenever you get a minute. [13:27:37] kevinbazira: perfect! [13:27:48] +1ed, do you have +2? [13:29:50] Thanks. Nope, I don't have +2 rights. [13:30:03] ah ok, lemme +2 then we can check later this bit [13:30:16] Alright. [13:30:34] kevinbazira: ah I forgot to document one bit - in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/731944 you can see a CI link [13:30:53] if you inspect the content at the bottom of a looong CI output there is the diff that your change will make [13:31:02] (basically what you'll see from helmfile diff) [13:32:25] checking ... [13:35:00] I ran puppet on deploy1002 so your change should be in the repo [13:35:40] Is this the CI link you're referring to? https://integration.wikimedia.org/ci/job/helm-lint/5856/console [13:35:55] yes correct [13:38:34] if you see the diff gets expanded to a new InferenceService [13:38:38] so it looks good [13:44:06] kevinbazira: we are ready for https://wikitech.wikimedia.org/wiki/User:Elukey/MachineLearning/Deploy#How_to_deploy anytime [13:45:56] Great ... let me get into deploy1002.eqiad.wmnet. [13:50:36] Running "helmfile -e ml-serve-eqiad diff" shows that a cert has also changed and will be part of this deployment. I didn't change any certs, is this expected? [13:50:50] lemme check [13:52:06] kevinbazira: ah yes it is a config that will lead to a no-op, I changed it manually some days ago to test [13:52:09] you can proceed [13:52:34] Alright, thank you for the confirmation. [13:53:56] Deployment completed ... now going to test and see if inference service is up and running. [13:54:02] * elukey dances [13:54:03] niceeee [13:55:13] I run this: [13:55:14] time curl "https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-damaging:predict" -X POST -d @input.json -i -H "Host: enwiki-damaging.revscoring-editquality.wikimedia.org" --http1.1 [13:55:19] And got this: [13:55:41] yeah it failed, I see an issue with the pod [13:55:43] HTTP/1.1 404 Not Found [13:55:43] date: Tue, 19 Oct 2021 13:54:50 GMT [13:55:43] server: istio-envoy [13:55:43] connection: close [13:55:43] content-length: 0 [13:55:43] real 0m0.025s [13:55:45] user 0m0.023s [13:55:47] sys 0m0.000s [13:56:30] Jumping into a meeting now ... will be back shortly. [13:56:32] so there are two ways to start debuggin [13:56:34] ack :) [14:00:38] ah I see [14:00:40] [I 211019 13:53:09 storage:85] Successfully copied s3://wmf-ml-models/damaging/enwiki/202105260914/ to /mnt/models [14:00:47] FileNotFoundError: [Errno 2] No such file or directory: '/mnt/models/model.bin' [14:00:54] the former is the storage initializer [14:00:59] the latter is the kserve-container [14:05:13] mmm weird, on s3 I don't see the model.bin file [14:09:07] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: API Gateway Integration - https://phabricator.wikimedia.org/T288789 (10klausman) Brain dump of a discussion I had with elukey follows. It's meant as a summary of functionality needed from the API Gateway and how it may tie in with confi... [14:09:18] Ok, I got the model's s3 link from: https://github.com/wikimedia/machinelearning-liftwing-inference-services/blob/main/revscoring/editquality/enwiki-damaging/service.yaml#L15 [14:09:18] elukey: as discussed ^^^ [14:13:54] klausman: <3 [14:14:14] It's a bit rambly, but I thought I'd include more rather than less :) [14:14:26] I'll read it in a bit [14:14:57] kevinbazira: I'll add another step to the guide to check the models on s3, for example [14:15:05] elukey@ml-serve1001:~$ s3cmd ls s3://wmf-ml-models/damaging/enwiki/202108302156 [14:15:20] I don't see a model.bin in there, weird [14:16:48] kevinbazira: we can upload the right model.bin in there, and it should work [14:17:07] elukey: there is a model in that dir for me [14:17:14] ml-serve1001 ~ $ s3cmd ls s3://wmf-ml-models/damaging/enwiki/202108302156/ [14:17:15] 2021-08-30 21:57 10344619 s3://wmf-ml-models/damaging/enwiki/202108302156/model.bin [14:17:45] ah so I forgot the trailing / and it wasn't showing it to me [14:17:48] * elukey cries [14:17:59] ah, the joy of almost-POSIX filesystems [14:18:59] so at this point I have the horrible suspect that this is a problem with the kserve stack [14:19:42] even if in theory the other enwiki-goodfaith pod came up fine 4 days ago [14:21:06] trying to force a pod recreation [14:22:30] no ok at least it is reproducible [14:33:44] ahhh now I get it [14:33:53] elukey@ml-serve1001:~$ s3cmd ls s3://wmf-ml-models/damaging/enwiki/ DIR s3://wmf-ml-models/damaging/enwiki/202108302156/ [14:34:00] vs [14:34:21] - name: STORAGE_URI [14:34:22] value: "s3://wmf-ml-models/damaging/enwiki/202105260914/" [14:34:31] klausman, kevinbazira --^ [14:35:00] I got confused as well with the last path bit [14:35:18] we have set an old path in helmfile [14:35:54] so I don't recall exactly the convention for the S3 bucket path [14:36:32] but we can either copy model.bin under 202105260914/ [14:36:45] or change helmfile to point to 202108302156/ [14:38:17] ok ... let me change the helmfile to point to 202108302156/ [14:38:40] kevinbazira: wait a sec, I copied over the model.bn [14:38:44] to see if it works [14:39:04] alright ... standing by... [14:39:45] this is a good use case for "what can go wrong" in the guide [14:42:21] kevinbazira: running! [14:43:27] we can test if it works [14:43:49] I have tested with this: time curl "https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-damaging:predict" -X POST -d @input.json -i -H "Host: enwiki-damaging.revscoring-editquality.wikimedia.org" --http1.1 [14:43:58] And got this: [14:44:08] HTTP/1.1 404 Not Found [14:44:08] date: Tue, 19 Oct 2021 14:43:34 GMT [14:44:08] server: istio-envoy [14:44:08] connection: close [14:44:08] content-length: 0 [14:44:08] real 0m0.024s [14:44:10] user 0m0.018s [14:44:12] sys 0m0.004s [14:47:26] so for some weird reason, this one is addressable via [14:47:27] enwiki-damaging-predictor-default.revscoring-editquality.wikimedia.org [14:47:53] interesting [14:47:59] it may be something new in kserve [14:48:15] kevinbazira: if you swap the Host: header with the above one it should work [14:48:32] Yep, it worked! [14:48:33] time curl "https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-damaging:predict" -X POST -d @input.json -i -H "Host: enwiki-damaging-predictor-default.revscoring-editquality.wikimedia.org" --http1.1 [14:48:34] HTTP/1.1 200 OK [14:48:34] content-length: 113 [14:48:34] content-type: application/json; charset=UTF-8 [14:48:34] date: Tue, 19 Oct 2021 14:48:07 GMT [14:48:36] server: istio-envoy [14:48:38] x-envoy-upstream-service-time: 258 [14:48:40] {"predictions": {"prediction": false, "probability": {"false": 0.8914328685316926, "true": 0.10856713146830739}}} [14:48:43] real 0m0.304s [14:48:45] user 0m0.013s [14:48:47] sys 0m0.010s [14:49:11] I'll investigate why and add more things to the guide, thanks kevinbazira for the patience :) [14:49:27] but this is the first deploy outside what I did earlier on for my tests! [14:50:37] No, thank you for putting the documentation together. It is easy to follow. :) [15:36:40] kevinbazira: I added my debug steps to https://wikitech.wikimedia.org/wiki/User:Elukey/MachineLearning/Deploy#Test_your_model_after_deployment [15:57:01] o/ [15:57:41] elukey: Great. Thank you for adding the debug steps in detail. Tomorrow I will try deploying another model class/category. [15:57:47] accraze: o/ [15:57:52] wow this great news! seems like we might be able to move over to ml-serve soon?? [16:01:46] accraze: yep! [16:02:13] kevinbazira: ack! Another class/category will need a new helmfile config (so something like revscoring-something), we can do it together [16:02:44] ok great ... I will let you know when I am ready to start. [16:08:44] * elukey afk! [16:13:23] 10Machine-Learning-Team, 10artificial-intelligence, 10Wikilabels, 10articlequality-modeling: Build article quality model for Dutch Wikipedia - https://phabricator.wikimedia.org/T223782 (10Ciell) I communicated the first results of the labeling campaign with the Dutch community. https://nl.wikipedia.org/wik... [16:41:34] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: API Gateway Integration - https://phabricator.wikimedia.org/T288789 (10ACraze) Thanks for the brain dump @klausman! **RE: Authentication** > the next question is how this is handled (HTTP Auth? API Keys?) and how we would add/remove p...