404: Model with name enwiki-damaging does not exist.

[05:58:00] 10Lift-Wing, 10Machine-Learning-Team: Sunset MiniKF sandboxes - https://phabricator.wikimedia.org/T293677 (10elukey) We are drafting a guide in https://wikitech.wikimedia.org/wiki/User:Elukey/MachineLearning/Deploy, and Kevin was able to deploy enwiki-damaging to Lift Wing successfully. This is a good sign to... [07:13:48] elukey: o/ [07:15:07] Please let me know whenever you have a minute so we can deploy the enwiki inference service in the revscoring-draftquality category. [07:23:41] kevinbazira: morning! I have all the time now :) [07:25:27] Great. Let me work on a patch that I will push to gerrit shortly. [07:27:18] kevinbazira: this one is more complicated, you are jumping to the most complex use case now, it is not listed in the tutorial yet :D [07:28:01] oh ok ... [07:28:12] if you want I can send the code review, and then we can review it together, so you let me know the unclear parts etc.. [07:28:40] (on meet so we can discuss the various bits) [07:29:13] ok ... that's fine with me. [07:30:19] ack so gimme 10 mins :) [07:32:45] kevinbazira: is there a task for it? [07:33:15] Not yet. [07:33:54] Let me create it now [07:34:06] super so we can collect all the steps in there and use it as guideline [07:36:03] here we go: https://phabricator.wikimedia.org/T293858 [07:37:05] Speaking of steps, these are some of the steps I was going to take; [07:37:05] 1. $ cd deployment-charts/helmfile.d/ml-services/ [07:37:05] 2. $ mkdir revscoring-draftquality [07:37:05] 3. create both helmfile.yaml and values.yaml for enwiki-draftquality inference service [07:37:05] I wonder whether there is more ... [07:38:51] kevinbazira: there is some work that SREs need to do in puppet first, plus a new kubernetes namespace to create (also something that SREs need to do). After that, you can basically duplicate the example helmfile and values config, and replace the placeholders (uppercase stuff) with the new name [07:39:18] but lemme file all the changes on my side, I'll explain them to you and then I'll let you proceed with the helmfile change [07:39:59] oh ok ... looking forward to seeing the other steps :) [08:02:28] ok kevinbazira all the SRE-side config should be in place, do you have time if we jump on meet or do you prefer later? [08:02:59] It's fine we can jump onto the call now [08:03:31] meet.google.com/wwj-ykxt-hdt [10:31:21] * elukey lunch! [12:03:58] elukey: I have checked s3cmd ls s3://wmf-ml-models/draftquality/enwiki/ and this model doesn't seem to exist. [12:04:56] Could you please confirm whether we have this model on swift? [12:05:20] The model in question: https://github.com/wikimedia/machinelearning-liftwing-inference-services/blob/main/revscoring/draftquality/enwiki-draftquality/service.yaml#L15 [13:10:18] kevinbazira: I am back! [13:11:32] I confirm that I don't see the draftquality model, we'd need to upload it [13:14:35] Thank you for the confirmation elukey. Do we have a guide on how we can upload models to swift? [13:18:56] kevinbazira: everything still in flux, you can use https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/719668/3/utils/model_upload.sh from ml-serve1001 [13:19:10] or maybe scp model.py ml-serve1001.eqiad.wmnet: [13:19:14] err model.bin [13:19:24] that get it to your home dir on ml-serve1001 [13:19:27] and then you can use [13:20:14] s3cmd put model.bin s3://wmf-ml-models/draftquality/enwiki/202107141649 [13:20:29] not sure about the 202107.. bit though [13:23:05] Thanks. Let me upload it [14:13:01] elukey: I uploaded the model and pushed a patch for the enwiki-draftquality inference service. Please help review whenever you get a minute: https://gerrit.wikimedia.org/r/732347 [14:21:06] kevinbazira: lgtm! [14:21:19] Ready to deploy? I can +2 and run puppet [14:23:08] Great thanks. Jumping to deploy1002.eqiad.wmnet now. [14:23:36] all ready to deploy :) [14:24:04] * elukey drum rolls [14:25:08] in the meantime, I am writing https://wikitech.wikimedia.org/wiki/User:Elukey/MachineLearning/Deploy#How_to_add_a_new_helmfile_config [14:28:03] the deploy has been completed :) [14:28:29] yessssss [14:28:34] currently checking the pods [14:28:41] yessssssss [14:30:57] mmm the pods are taking a lot to load [14:32:14] kevinbazira: do you know more or less how big is the model in MBs? [14:32:25] I can't get from the s3cmd the size [14:32:43] it says 1843797 but if those are bytes it seems too little [14:33:49] Yep, it's about that size. Mostly because it is bz2 compressed: https://github.com/wikimedia/draftquality/blob/master/models/enwiki.draft_quality.gradient_boosting.model.bz2 [14:36:15] ok then there is an issue with the storage initializer [14:37:10] yep [14:37:11] botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://s3.amazonaws.com/wmf-ml-models?prefix=draftquality%2Fenwiki%2F202107141649%2F&encoding-type=url" [14:37:54] this is surely my fault [14:38:26] yes [14:39:17] so we have a kubernetes secret called "swift-s3-credentials" in all namespaces [14:39:27] and it has some annotations, like the endpoint etc.. [14:39:46] basically following https://github.com/kserve/kserve/blob/master/docs/samples/storage/s3/README.md#create-s3-secret-and-attach-to-service-account [14:40:13] this is the second time that I add the annotations with serving.kserve.org (instead of .io) [14:40:18] so they are not picked up [14:42:43] kevinbazira: your new pods are up! [14:43:01] I am fixing the puppet private config (edited the kubernets secret manually) [14:45:32] Thank you for fixing the pods. When I run time curl "https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-damaging:predict" -X POST -d @input.json -i -H "Host: enwiki-draftquality-predictor-default.revscoring-draftquality.wikimedia.org" --http1.1 [14:45:35] I get: [14:45:52] HTTP/1.1 404 Not Found [14:45:53] content-length: 145 [14:45:53] content-type: text/html; charset=UTF-8 [14:45:53] date: Wed, 20 Oct 2021 14:44:06 GMT [14:45:53] server: istio-envoy [14:45:53] x-envoy-upstream-service-time: 8 [14:45:55] 404: Model with name enwiki-damaging does not exist.404: Model with name enwiki-damaging does not exist. [14:45:58] real 0m0.071s [14:46:00] user 0m0.019s [14:46:02] sys 0m0.005s [14:46:34] I wonder where it is picking enwiki-damaging from 🤔 [14:47:33] kevinbazira: it is in the /v1/models/enwiki-damaging:predict URI [14:47:49] that one needs to be changed according to the model that you deployed [14:48:57] Whoops ... I think my head is getting tired :) [14:49:02] It has worked! [14:49:04] time curl "https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-draftquality:predict" -X POST -d @input.json -i -H "Host: enwiki-draftquality-predictor-default.revscoring-draftquality.wikimedia.org" --http1.1 [14:49:04] HTTP/1.1 200 OK [14:49:05] content-length: 174 [14:49:05] content-type: application/json; charset=UTF-8 [14:49:05] date: Wed, 20 Oct 2021 14:48:18 GMT [14:49:06] server: istio-envoy [14:49:08] x-envoy-upstream-service-time: 2355 [14:49:12] {"predictions": {"prediction": "OK", "probability": {"OK": 0.6755321636163237, "attack": 0.048572863418591725, "spam": 0.13317597668241205, "vandalism": 0.1427189962826724}}} [14:49:15] real 0m2.382s [14:49:17] user 0m0.016s [14:49:19] sys 0m0.008s [14:49:20] \o/ \o/ \o/ [14:49:52] Thank you for the guidance Luca! :) [14:50:21] this is great Kevin, it was the most difficult (for the moment) deployment type [14:51:35] super happy :) [14:51:49] going to take a little break, bbiab! [14:52:19] you're the MVP on this one 👏👏👏 [14:52:26] also taking a break now ... [15:22:27] nice job Kevin! [16:08:03] kevinbazira: I added https://wikitech.wikimedia.org/wiki/User:Elukey/MachineLearning/Deploy#How_to_add_a_new_helmfile_config to the docs, it doesn't contain all the info that we discussed during our chat but it should be a good start [16:09:01] o/ [16:09:33] woooow looks like draftquality can run on lift wing now, nice one kevinbazira and elukey! [16:10:21] that's super exciting [16:11:56] yesssss [16:12:06] accraze: I also added a lot of docs for you :D [16:12:48] Kevin already started to read it, and he doesn't hate me so far [16:13:15] (he is always very polite, not sure if he really likes the docs or not) [16:13:55] hahaha [16:14:26] yeah i def want to try deploying another model/class soon [16:14:53] what all do we have now, editquality & draftquality? [16:15:19] editquality with enwiki-{goodfaith,damaging} and enwiki-draftquality [16:15:51] ahhh nice, wasn't the mvp goal to have 4 enwiki models deployed? [16:16:30] yes something like that IIRC [16:16:56] one thing that I want to discover now is where are the models stored in kubernetes [16:17:02] I mean the /mnt/blabla stuff [16:17:16] ahhh yeah the pvc stuff [16:17:43] totally ignorant about where we store it, and since we are going to have the models stored for each pod, I'd like to triple check that we have enough disk space [16:18:08] otherwise it may be a very big problem when we scale up the pods [16:18:14] good call, i didn't think about that [16:18:23] me too up to today :( [16:19:01] also there is another weird thing, the kserve config seems to add an extra suffix to the Host values of the pods [16:19:04] for example [16:19:21] enwiki-draftquality-predictor-default.revscoring-draftquality.wikimedia.org [16:19:29] vs [16:20:16] enwiki-goodfaith.revscoring-editquality.wikimedia.org [16:20:44] weird! it's like it the isvc pod name now? [16:21:09] we currently have a mixture of istio routes, the enwiki-goodfaith has been created before the move to kserve 0.7 [16:21:35] yes it seems the new naming, but I want to find the commit etc.. [16:21:38] or ask to upstream [16:23:42] https://github.com/kserve/kserve/issues/1397 [16:23:47] so it seems that they want to remove it [16:24:00] but why did we not get it for enwiki-goodfaith? [16:25:35] ah no we have it [16:25:36] enwiki-goodfaith-predictor-default.revscoring-editquality.wikimedia.org [16:25:38] and it works [16:27:10] no idea, will investigate (I'll probably ask to slack)