[07:01:25] accraze: o/ [07:01:32] your theory is correct [07:01:33] istio-system istio-ingressgateway-5fb74f8ddf-7fddf 1/1 Running 0 36h [07:01:36] istio-system istiod-5d6974fdfb-dn7nk 1/1 Running 0 36h [07:01:42] there should be a cluster local istio gw pod [07:02:10] knative <= 0.18 needs it, after that version the knative local gateway doesn't (it relies only on the main ingressgateway) [07:02:23] we have the config in deployment-chart's custom.d/istio/etc.. directory [07:03:16] ah yeah I see the istio-minimal-operator.yaml in your home [07:03:22] lemme update it and run istioctl [07:08:34] added [07:14:21] I checked the "inferenceservice" configmap in kserve, it mentions correctly cluster-local-gateway.istio-system.svc.cluster.local as local gateway, that is now available [07:15:58] ok so in the transformer's logs I see [07:15:59] [W 211202 07:05:50 web:2243] 404 POST /v1/models/enwiki-articlequality:predict (127.0.0.1) 192.60ms [07:17:19] right this is the URI in test-aq.sh [08:14:11] * elukey running errand for a couple of hours, bbl! [10:13:05] back [10:13:22] so I deleted the pods (transformer + predictor) and started from scratch [10:13:30] I don't see any log on the transformer side now [10:13:33] but I see the 404 [10:14:43] and it seems coming from istio-proxy, so the ingress gateway [10:14:46] sudo istioctl-1.9.5 proxy-config route istio-ingressgateway-5fb74f8ddf-7fddf.istio-system [10:14:53] the above is useful to see the routes [10:17:18] of course my test-aq.sh was out of date, syncing with Andy's :) [10:18:29] so [10:18:31] --- [10:18:31] elukey@ml-sandbox:~$ sudo istioctl-1.9.5 proxy-config route istio-ingressgateway-5fb74f8ddf-7fddf.istio-system | grep enwiki-articlequality.kserve-test.example.com [10:18:35] http.80 enwiki-articlequality.kserve-test.example.com /* enwiki-articlequality.kserve-test [10:18:39] then [10:19:26] elukey@ml-sandbox:~$ sudo kubectl get vs -A | grep enwiki-articlequality.kserve-test [10:19:29] kserve-test enwiki-articlequality [knative-serving/cluster-local-gateway knative-serving/knative-ingress-gateway] [enwiki-articlequality.kserve-test.svc.cluster.local enwiki-articlequality.kserve-test.example.com] [10:22:52] --- [10:22:53] elukey@ml-sandbox:~$ sudo istioctl-1.9.5 proxy-config route cluster-local-gateway-c46c9b659-rc8mt.istio-system | grep enwiki-articlequality.kserve-test.example.com [10:22:56] http.80 enwiki-articlequality.kserve-test.example.com /* enwiki-articlequality.kserve-test [10:23:01] so far everything seems configured [10:25:27] mmmm [10:27:41] ok interesting, if I use [10:27:58] SERVICE_HOSTNAME="enwiki-articlequality-predictor-default.kserve-test.example.com" [10:28:01] I see a 500 [10:28:12] and if I check logs of the transformer [10:28:23] nothing [10:28:25] predictor [10:28:44] File "model-server/model.py", line 20, in predict [10:28:44] inputs = request["article_text"] [10:28:44] KeyError: 'article_text' [10:28:44] [E 211202 10:26:53 web:2243] 500 POST /v1/models/enwiki-articlequality:predict (127.0.0.1) 7.61ms [10:29:33] ah ok we need to use /home/accraze/aq-input.json not rev-id.json [10:30:33] yep worked :) [10:31:02] something interesting is [10:31:02] elukey@ml-sandbox:~$ sudo kubectl get svc -n kserve-test [10:31:03] NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE [10:31:07] enwiki-articlequality ExternalName cluster-local-gateway.istio-system.svc.cluster.local 17h [10:31:11] enwiki-articlequality-predictor-default ExternalName cluster-local-gateway.istio-system.svc.cluster.local 80/TCP 17h [10:32:10] so the first name, that leads to a 404 for us, is indeed not pointing to any port [10:32:14] the second one it is [10:40:08] (pointing to a port) [10:40:36] so, good news is that the transformer -> predictor machinery works (great work Andy and Kevin) [10:41:12] the minor annoyance is to figure out why enwiki-articlequality [10:41:18] vs enwiki-articlequality-predictor-default [10:41:21] leads to differences [14:45:33] so I'm currently in the process of migrating the codfw Ganeti cluster to Buster which involves a lot of shuffling of VMs around the reimages. the ml-etcd2* nodes are currently not using ganeti, but "plain" storage, i.e. they are non-redundant and only kept on a single node to minimise I/O [14:46:18] for the migration I need to temporarily migrate them to use DRBD for a time frame of ~ 2 weeks before they are eventually switched back to "plain" [14:47:16] let me know if that causes any issues, but switching them back to "plain" after every single instance migration would be quite an overhead since those instances will likely be migrated to new nodes multiple times as reimages progress [14:48:25] (and with the current KVM machine type that we use changes to the machine config tend to change the "PCI slot" of the VM, which causes the NIC interface to change) [14:54:09] moritzm: green light to go ahead, no issues from our side :) [15:04:27] ok :-) [15:05:04] I'll convert them to DRBD probably tomorrow and will ping you when they are reverted to their original state [15:07:31] perfect [15:32:47] I have depooled ores1001, currently trying to se debug logging [15:32:57] my aim is to figure out the source of all those scores errored [15:33:18] I can't find a way to lower the logging level of uwsgi though [15:34:09] I managed to find a way for celery in the systemd unit [15:39:37] but not sure if it is the right one [15:40:24] o/ [15:44:28] accraze: o/ [15:45:26] Morning all! [15:46:52] elukey: thanks for looking into the articlequality transformer issue :) [15:48:26] np! It seems working! [15:48:36] the only weird thing is the endpoint config [15:48:57] I didn't dig deep into it but I suspect it is some weirdness with old-ish versions of knative [15:49:06] yeah the routes issue is quite strange [15:49:16] not a blocker tho [15:53:25] in theory no [15:53:47] is it just due to not having a port set? [15:54:01] accraze: one nit to fix when upgrading the images - can we move away from the UA KFserving test etc.. ? [15:54:30] oh yeah! i've been meaning to update that, maybe move to env var? [15:54:34] I'd use something more relevant, that indicates us [15:54:44] could be an option yes [15:55:14] if we could specify that it is the ML team it is better, so if people find traces of our UA in the logs it will be clear who to contact [15:56:08] agreed [15:57:44] i think specify ml team and maybe model image type? [15:58:09] yes +! [15:58:11] +1 [15:58:26] do we have a team email yet, something like ml@ ? [15:58:37] chrisalbon: ^^^ [15:58:47] I don't think so, but stating ML team is sufficient [15:58:50] people will know [15:58:54] ah okay cool [15:59:37] accraze: I have a new version of the istio minimal config in my home dir on the ml sandbox, didn't modify yours [15:59:46] but basically it adds the second cluster local ingress [16:00:03] newer kserve docs don't specify it since they rely on the newer knatives [16:00:14] niiice [16:00:51] yeah i had to dig deep into older commits in the knative docs repo while debugging yesterday [16:32:05] accraze: IIUC we are not blocked anymore for the kserve 0.7 + transformer bits testing right? [16:32:09] (otherwise I'll keep working on it) [16:32:37] yup that's correct -- we should be fine to move forward on both! [16:34:03] \o/ [16:34:25] ok I'll try to concentrate on adding to deployment-charts the base egress gw config [16:34:50] when you are ready we can also check how to add the transformer config in the charts if you want [17:43:15] We do have a team email! let me go find the google group [17:45:04] oh wait nevermind we don't. I lied. I thought we made one as part of the Data Science and Engineering teams group email but apparently all of our names are just hard coded into that group. Let me fix that now [17:47:31] ticket filed! [18:27:16] * elukey afk! [21:33:58] elukey: i traced through what you looked at w/ articlequality transformer, still seems like we are unable to hit it with out directly specifying it with the hostname header [21:34:44] the enwiki-articlequality-predictor-default just goes to the predictor, the transformer does not seem to get called [21:35:10] i think all the routing happens at the isvc level, which we still get 404 [21:35:53] super weird [22:08:45] oh interesting, working on upgrading model-server images to kserve v0.7.0, and finding that pip freeze-ing our deps are making things much harder to reason about [22:10:21] there were a couple older libraries included from kfserving deps that no longer are used by kserve deps [22:11:44] the only downside to hand-crafting our requirements.txt file is that we offload the dependency compatibility checks to pip which can take a loooong time in some cases [22:12:25] (new backtracking feature gets hung on libraries with a ton of releases....like tornado)