[14:16:37] accraze: o/ [14:16:51] let's sync when you are online, I am not getting the issue that you described above [14:17:18] for the isvc part, I checked /home/accraze/test.sh and I think it is not right (or maybe it used to work but not in newer versions [14:17:21] root@ml-sandbox:/home/elukey# kubectl get isvc enwiki-goodfaith -n kserve-test [14:17:24] NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE [14:17:27] enwiki-goodfaith http://enwiki-goodfaith.kserve-test.example.com True 100 enwiki-goodfaith-predictor-default-2bhvc 6h42m [14:17:42] service hostname becomes root@ml-sandbox:/home/elukey# kubectl get isvc enwiki-goodfaith -n kserve-test [14:17:45] NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE [14:17:48] enwiki-goodfaith http://enwiki-goodfaith.kserve-test.example.com True 100 enwiki-goodfaith-predictor-default-2bhvc 6h42m [14:17:51] oof sorry [14:17:52] --- [14:18:03] SERVICE_HOSTNAME becomes http://enwiki-blabla [14:18:08] and then it is used as Host header [14:19:54] but you also mentioned articlequality and I don't see pods related to it on minikube [14:20:00] ( I am surely missing something) [14:57:33] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Add an envoy proxy sidecar to Kserve inference pods - https://phabricator.wikimedia.org/T294414 (10elukey) I was able to use the following config to do allow pods to call via HTTP the egress gateway and force it to use https to connect to the MW api: ` apiV... [14:57:47] I was able to use the istio egress gateway pod to: [14:58:15] - get an http call for its service with Host: header set to the target wiki [14:58:28] - make egress to call the mw api via https [14:58:51] we can of course use https from pods to egress as well, but this one was simpler to test :) [14:59:32] https://istio.io/latest/docs/tasks/policy-enforcement/rate-limit/ is also a very nice reading [15:04:42] the above mentions the use of Redis to keep track of the limits, need to double check how it works (but I suspect it needs a backend to store temp data, IIUC api-gateway does the same) [15:05:01] on the egress gateway note, we could proceed in two ways [15:05:17] 1) have one egress per namespace, configuring it with things to call [15:05:25] 2) have a single egress for all namespaces [15:05:30] --- [15:05:55] 2) is more flexible, but all namespaces will be able to access the endpoints that we configure. I think it is fine for our use case, but we need to double check [15:06:06] 1) is more granual but I think overkill initially [15:10:39] in theory now, from the pods, we could call istio-egressgateway.istio-system.svc.cluster.local:80 with the right host header [15:10:57] (still hacky and not helm-fied, all manual) [15:21:20] o/ [15:30:59] accraze: o/ [15:32:24] elukey: nice work on istio egress gw pod [15:33:06] i agree -- having a single egress per namespace would give us a ton of granularity but might add more complexity to the mvp [15:33:44] yeah plus we don't really need to shield the mw api or other things from our pods, except of course circuit breaking [15:33:48] how difficult would it be to switch to that approach later and do a single egress for all namespaces initially? [15:33:57] I think it shouldn't be hard [15:34:14] in theory (still under testing :P) the config for inference service will need to change like [15:34:29] - name: WIKI_URL [15:34:29] value: http://istio-egressgateway.istio-system.svc.cluster.local [15:34:48] so if we wanted to add more gateways in the future we'll just need to add them and change --^ [15:35:10] and the good bit about endpoints like above is that we can have multiple pods load balanced behind it [15:35:10] ahhh i see, that's pretty straight-forward [15:35:17] so we can scale egress easily [15:37:47] moreover if we wanted to switch to mtls we could, since we would keep the mesh architecture (ingress -> pods -> egress) [15:43:17] niiiice, this is sounding really good [15:44:29] morning all! [15:46:15] o/ [15:46:25] o/ [15:51:50] elukey: re: the ml-sandbox, i just re-deployed the updated enwiki-articlequality isvc (w/ transformer) and still getting the same issue [15:52:10] can you write in here how to repro? [15:52:13] so I can check [15:52:14] i'm using the test-aq.sh script [15:52:24] ah nice, checking [15:52:58] I can hit the predictor and transformer separately [15:53:24] but if i use SERVICE_HOSTNAME="enwiki-articlequality.kserve-test.example.com" [15:53:39] i get 503 Service Unavailable [15:54:06] when it should route the `request->transformer->predictor->result` [15:55:25] mmm i thought that the transformer would have been in the same pod [15:55:28] as the predictor [15:56:09] ahh no i think they are actually separate pods [16:00:19] i'm going through the kserve debug guide trying to trace down where it might break in the request flow [16:00:21] https://github.com/kserve/website/blob/main/docs/developer/debug.md#debug-kserve-request-flow [16:01:56] I never used the svc hostname like the above [16:02:21] but I see [16:02:22] enwiki-articlequality http://enwiki-articlequality.kserve-test.example.com True 100 enwiki-articlequality-predictor-default-hh4rn 39m [16:02:31] that points to the predictor [16:03:53] oh interesting, hmmm the latest revision only points to the predictor [16:05:13] I tried to force the SERVICE_HOSTNAME to the transformer [16:05:17] and I get a 500 [16:05:20] in the logs [16:05:21] Max retries exceeded with url: /v1/models/enwiki-articlequality:predict (Caused by NewConnectionError(': Failed to establish a new connection: [Errno -5] No address associated with hostname')) [16:05:47] ah sorry [16:05:48] urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='enwiki-articlequality-predictor-default.kserve-test', port=80): Max retries exceeded with url: /v1/models/enwiki-articlequality:predict (Caused by NewConnectionError(': Failed to establish a new connection: [Errno -5] No address associated with hostname')) [16:07:48] ohhh ok that makes sense because the transformer needs the predictor hostname [16:07:50] https://github.com/wikimedia/machinelearning-liftwing-inference-services/blob/main/revscoring/articlequality/transformer/articlequality_transformer/__main__.py#L14 [16:08:02] that's supposed to be supplied by the isvc [16:12:37] how should it pass it? [16:21:01] not 100% sure but i believe the top-level virtual service handles the routing [16:21:06] (still reading docs lol) [16:24:34] I am testing the egress solution that seems working, but I don't find traces of it in tcpdump/logs [16:24:37] that is weird [16:25:26] no I am stupid, I found it [16:25:45] works nicely :) [16:25:48] \o/ [16:30:18] nice one! [16:31:43] this is the single egress for all namespaces? [16:42:19] exactly yes [17:52:58] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Add an envoy proxy sidecar to Kserve inference pods - https://phabricator.wikimedia.org/T294414 (10elukey) I added the following bit to an inference service and it worked! ` - name: WIKI_URL value: "http://istio-egressgateway.istio-system.svc.clus... [18:00:15] Buuuuddddgets [18:00:49] Also, Tajh wants us to come up with a name for the full new infrastructure, something that covers Lift Wing and Train Wing [18:00:56] I'm open to all suggestions [18:02:25] probably should be bird or flight related [18:04:05] I'd call it Skynet but we can't [18:04:34] Tajh suggestion some bird of prey from mythology that like murders travelers or something [18:04:53] seems like that might not work either [18:52:42] * elukey afk! [21:14:29] ok been digging a bit further into the articlequality transformer isvc issue [21:14:50] i can see knative-serving/cluster-local-gateway, however it seems that istio has no knowledge of it [21:16:48] been digging through the docs and starting to wonder if there needs to be an envoy proxy in istio to route internal calls (like isvc->transformer->predictor) [21:58:22] but yeah the root issue is that the transformer and predictor can't talk to each other, which is leading me to believe that the cluster-local config is not setup correctly [21:59:01] will dig in more tomorrow, going to step away from the gateway stuff for a bit and work on upgrading the python servers to use kserve v0.7.0