[06:32:55] accraze: o/ [06:33:16] interesting I'll take a look [06:33:49] for some reason I thought that the predictor's image wasn't able to work without transformer [06:33:52] with the actual settings [06:37:39] if I try to hit the transformer directly I get [06:37:39] mwapi.errors.APIError: badinteger: Invalid value "None" for integer parameter "revids". -- None [06:37:42] [E 211203 06:36:12 web:2243] 500 POST /v1/models/enwiki-articlequality:predict (127.0.0.1) 184.68ms [06:39:49] I am using @/home/accraze/aq-input.json in curl, switching to the rev-id one [06:40:38] ok and now I see the 404 [06:40:41] mmmmm [06:46:39] okok [06:46:41] so elukey@ml-sandbox:~$ sudo kubectl get vs enwiki-articlequality -n kserve-test -o yaml [06:46:54] this shows that the endpoint clearly goes to the transformer [06:47:07] http: [06:47:07] - headers: [06:47:07] request: [06:47:07] set: [06:47:07] Host: enwiki-articlequality-transformer-default.kserve-test.svc.cluster.local [06:49:20] accraze: in my opinion we should stop debugging these issues, it may very well be kfserving==0.3 (I bet on it) [06:49:47] even if in theory the kserve controller is 0.7 [06:50:25] but testing different versions may lead to subtle bugs [10:34:41] elukey: scratch what I wrote yesterday, the ml-etcd nodes are already on DRBD, this setting was only ever applied to the kubernetes etcd cluster [10:36:47] moritzm: ack! Do you think that we should move to drbd as well after you have finished your work? [10:38:03] are are on DRBD, only the k8s etcd are the exception to use non-replicated local disk storage [10:38:17] I'd say if the current etcd setup works for you, let's keep it as it is [10:38:47] for k8s there were some latency issues with replication, but I suppose the data footprint is much larger as well [10:48:28] yeah but I'd prefer not to see the problem when we load more services :D [11:34:01] * elukey lunch! [11:44:19] 10Machine-Learning-Team, 10artificial-intelligence, 10Wikilabels, 10articlequality-modeling: Build article quality model for Dutch Wikipedia - https://phabricator.wikimedia.org/T223782 (10Ciell) final label column is now filled by one user, others are checking and leaving comments. I expect we can take thi... [14:45:11] got this link today [14:45:12] https://arize.com/resource/ebook-machine-learning-observability-101/?utm_campaign=Q42021:%20ML%20Observability%20Ebook&utm_source=email&utm_content=The%20sequence [14:45:20] needs registration but it looks interesting [15:31:54] o/ [15:38:31] good morning :) [15:38:53] elukey: agreed on prioritizing upgrading images to kserve v0.7.0 [15:39:16] i wouldn't be surprised if that was the main issue w/ transformer lol [15:40:19] it may be something else but let's start with the same version everywhere, just to be sure [15:51:05] filed the first CR for the egress gw https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/743438/ [15:51:10] what a pain yaml :D [15:51:26] haha [15:51:32] nice! [15:52:40] I will add a basic layer7 config in the knative chart, we have the igress one in there too [15:52:51] and the last bit will be the circuit breaking stuff [15:53:03] the idea is to go with the single egress for the moment [15:53:06] as we discussed [16:11:05] elukey: does this mean we'll need to take the `sidecar.istio.io/inject: "false"` annotation off the isvcs ? [16:12:03] nono [16:12:52] we will use the egress endpoint dns name in our configs [16:13:01] ahhh ok gotcha [16:13:20] I'd love to switch to mTLS someday in the future [16:13:37] in case we'll need to just switch from that dns name to the ones that we need to access [16:13:46] and istio's sidecars should take care of the magic [16:17:49] (brb) [17:02:49] hah of course the kserve and revscoring dependencies don't play nicely together [17:04:20] yeah I kinda imagined it :D [17:07:37] yeah might need to get a bit creative during the multi-stage builds, but it's doable [17:57:41] ok I tried to add another endpoint to the egress gw and it doesn't work [17:58:03] I declare defeat for today and log-off for the weekend :) [17:58:09] have a nice day/weekend folks! [18:10:24] see ya elukey! [18:53:18] phew ok i think i figured out the dependency upgrades for articlequality model-server image [18:55:11] that was hard [18:56:49] CR: https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/734432 [18:58:08] there's a weird edge case with a dependency of revscoring called yamlconf. it's a wrapper around pyyaml but it was pinned to us an older version of pyaml that kserve and other k8s tools no longer use [19:00:34] theres a hack in requirements.txt where we install articlequality & yaml conf using `git+https://.git@commit` syntax [19:01:31] not the best long-term solution, but it works for now [19:04:00] i suspect we may need to do something similar for the other revscoring images :/ [19:14:18] 10Machine-Learning-Team, 10artificial-intelligence, 10Wikilabels, 10articlequality-modeling: Build article quality model for Dutch Wikipedia - https://phabricator.wikimedia.org/T223782 (10Halfak) Fantastic! I'll work to get something together before Thursday so we might be able to review then.