[07:58:50] hello folks [07:59:00] so after reviewing https://github.com/boto/botocore/pull/515/files it seems that boto supports AWS_DEFAULT_REGION [08:30:02] opened https://github.com/kubeflow/kfserving/issues/1765 [09:04:38] and also https://github.com/kubeflow/kfserving/issues/1766 [09:20:26] so added another workaround to the docker image [09:20:27] aaaan [09:20:29] aaand [09:20:30] [I 210812 09:19:34 storage:50] Copying contents of s3://wmf-ml-models/goodfaith/enwiki/202105140814/ to local [09:20:33] [I 210812 09:19:34 credentials:1102] Found credentials in environment variables. [09:20:36] [I 210812 09:19:34 storage:83] Successfully copied s3://wmf-ml-models/goodfaith/enwiki/202105140814/ to /mnt/models [09:20:39] \o/ [09:23:37] ah now I see [09:23:37] FileNotFoundError: [Errno 2] No such file or directory: '/mnt/models/model.bin' [09:24:03] that is what Andy suspected, namely that we need to call the model "model.bin" in swift [09:27:51] I added a new model.bin file (basically the same model fetched and then re-sent with a different name) [09:46:38] but this leads to... [09:46:38] File "/opt/lib/python/site-packages/revscoring/scoring/models/model.py", line 102, in load [09:46:41] model = pickle.load(f.buffer) [09:46:44] _pickle.UnpicklingError: invalid load key, '\x0a'. [09:47:06] that is revscoring, so it is looking very good :) [09:59:08] kevinbazira: o/ around? [09:59:14] yep [09:59:30] if you have time I'd need to help decrypting the above :D [09:59:57] ahhh wait I think I am stupid [10:00:08] lol [10:00:51] yes yes lemme try to fix it, the model is garbled [10:01:01] I realized that s3cmd get didn't work [10:01:20] This error is possibly because of the way you downloaded the model. I usually use something like this: [10:01:21] wget -O enwiki.goodfaith.gradient_boosting.model https://github.com/wikimedia/editquality/blob/master/models/enwiki.goodfaith.gradient_boosting.model?raw=true [10:03:38] because I used [10:03:38] elukey@ml-serve1001:~$ s3cmd get s3://wmf-ml-models/goodfaith/enwiki/202105140814/enwiki.goodfaith.gradient_boosting.model [10:03:41] download: 's3://wmf-ml-models/goodfaith/enwiki/202105140814/enwiki.goodfaith.gradient_boosting.model' -> './enwiki.goodfaith.gradient_boosting.model' [1 of 1] 110612 of 110612 100% in 0s 762.63 KB/s done [10:03:52] I wanted to rename it and publish it as model.bin [10:04:03] but [10:04:09] elukey@ml-serve1001:~$ file enwiki.goodfaith.gradient_boosting.model [10:04:09] enwiki.goodfaith.gradient_boosting.model: HTML document, UTF-8 Unicode text, with very long lines [10:04:47] Try this: [10:05:00] wget -O model.bin https://github.com/wikimedia/editquality/blob/master/models/enwiki.goodfaith.gradient_boosting.model?raw=true [10:05:41] yep yep just tried [10:06:30] restarting the pod [10:07:42] wow [10:07:42] elukey@ml-serve-ctrl1001:~$ kubectl get pods -A [10:07:42] NAMESPACE NAME READY STATUS RESTARTS AGE [10:07:46] elukey-test enwiki-goodfaith-predictor-default-knzht-deployment-d985c67bbsq 2/2 Running 0 71s [10:07:49] \o/ [10:07:59] Woohoo :D [10:08:01] lemme see if it works [10:14:39] mmm there seems to be something not working with istio routing, will investigate [10:14:42] thanks kevinbazira :) [10:15:04] anytime elukey :) [10:28:59] ok so I am getting a 503 from envoy (the istio ingress gateway) [10:29:14] elukey@ml-serve-ctrl1001:~$ echo $SERVICE_HOSTNAME [10:29:14] enwiki-goodfaith.elukey-test.example.com [10:29:15] elukey@ml-serve-ctrl1001:~$ curl http://ml-serve1001.eqiad.wmnet:8081/v1/models/enwiki-goodfaith:predict -X POST -d @input.json -i -H "Host: $SERVICE_HOSTNAME" [10:47:02] * elukey lunch [12:44:53] found a way to enable debug logs for the ingress gw [12:44:55] and I see [12:44:55] 2021-08-12T12:39:03.392868Z debug envoy router [C684908][S2772849604290081203] unknown cluster 'outbound|80||knative-local-gateway.istio-system.svc.cluster.local' [12:58:04] we have cluster-local-gateway.istio-system.svc.cluster.local [12:58:22] that is expected in theory, the knative-local-gateway came only after 0.18 [13:03:32] ahh it is in the config map of our config, it may be a new thing of 0.6 [13:04:57] namely https://github.com/kubeflow/kfserving/commit/3b5b1b773d26b11a4676a9417b9e6fe452b64152 [13:05:06] but IIRC the old cluster-local was supported [13:17:18] tried to ask on slack, but I suspect this is a bug [13:29:08] all right found the issue, there is a nit to configure in the config-map, I'll send a code review [13:29:11] and now [13:29:14] drum rolls [13:29:29] elukey@ml-serve-ctrl1001:~$ curl http://ml-serve1001.eqiad.wmnet:8081/v1/models/enwiki-goodfaith:predict -X POST -d @input.json -i -H "Host: $SERVICE_HOSTNAME" [13:29:32] HTTP/1.1 200 OK [13:29:35] content-length: 112 [13:29:37] content-type: application/json; charset=UTF-8 [13:29:40] date: Thu, 12 Aug 2021 13:27:13 GMT [13:29:42] server: istio-envoy [13:29:45] x-envoy-upstream-service-time: 31933 [13:29:45] YYYYYYYYYYYYYYYYEEEEEEEEEEEEEEEEEEEEESSSSSSSSSSSSS \o/ \o/ \o/ [13:29:48] {"predictions": {"prediction": true, "probability": {"false": 0.03387957196040836, "true": 0.9661204280395916}}} [13:29:55] * elukey dances [13:34:58] kevinbazira: --^ [13:36:23] niiiiice ... good job elukey 👏👏👏 [14:08:56] kevinbazira: this is a big work from the whole team, you and Andy made it possible to run ORES models on k8s :) [14:58:58] Great job!! This is huge [15:31:19] Outsanding work, Luca! [15:43:31] <3 [15:52:24] \o/ [15:52:38] elukey: omg this is amazing!! [15:57:10] two years ago, running ORES models on k8s was basically seen as impossible [15:58:13] and now we are doing it on all wmf infrastructure too -- so cool! [16:03:35] All the way up from the bare metal [16:06:23] Whhheeeeeeew [16:10:29] It is cool that it is on a model originally created on RevScoring [16:10:39] massive congratulations on reaching the kubeflow & k8s milestone to chrisalbon and team! [16:11:23] accraze: always challenge what's possible [16:11:32] This is definitely a huge win for the whole team. Both getting the infrastructure built from scratch AND figuring out how to migrate models over [16:15:41] accraze: \o/ one thing that I noticed though was that the first predict calls were very slow (like 20s slow) [16:15:47] but some tuning is probably needed [16:17:11] ahh yeah im sure we'll need to tune [16:19:47] elukey: what revid were you using to test the prediction? [16:23:19] accraze: 1234 IIRC, a random one [16:33:26] 10Machine-Learning-Team, 10SRE Observability: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10elukey) To keep archives happy - the errors are down, but the original problem was not fixed. This may be due to Knative's logstash support (no idea where we tune it, it... [17:02:12] accraze, kevinbazira - I created https://wikitech.wikimedia.org/wiki/User:Elukey/MachineLearning/kfserving#Deploy_a_custom_InferenceService [17:02:24] if you want to do some tests, it may be a good way [17:03:19] I'd ask you to use the custom namespace as playground (with some precautions of course :) and to write in the chan if you modify other values outside your namespace (so we all know what changed) [17:04:21] in theory it should work [17:04:31] if not lemme know and I'll amend the guide :) [17:04:47] going afk, will read later in case some question pops up :) [17:11:42] cool will take a look later today [23:41:05] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): API Gateway Integration - https://phabricator.wikimedia.org/T288789 (10ACraze) [23:55:57] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): API Gateway Integration - https://phabricator.wikimedia.org/T288789 (10ACraze) @elukey was able to get our enwiki-goodfaith model running on the production ml-serve cluster today. I think we should use that inference service as a target for our first API rou...