[07:39:53] good morning :) [08:10:48] (03PS7) 10AikoChou: outlink: use async HTTP calls to fetch data [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/807135 (https://phabricator.wikimedia.org/T311043) [08:11:36] aiko: o/ [08:11:52] did you see https://pypi.org/project/mwapi/0.6.0/ ? :) [08:14:18] elukey: o/ I saw it, but there is a small issue I found https://phabricator.wikimedia.org/T313493#8100567 [08:15:58] ah snap! [08:19:54] (03CR) 10CI reject: [V: 04-1] outlink: use async HTTP calls to fetch data [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/807135 (https://phabricator.wikimedia.org/T311043) (owner: 10AikoChou) [08:33:52] I am testing all new docker images locally with kserve 0.8 and for the moment all works [08:51:11] nice all tests went fine! [08:51:57] (03CR) 10Elukey: [C: 03+2] "Tested all models locally with Docker, predictions came out fine without any visible regression. I think that we are ready to build the Do" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/815721 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey) [09:00:46] (03Merged) 10jenkins-bot: Update Python model servers and requirements to KServe 0.8 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/815721 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey) [09:01:39] new docker images in progress [09:20:03] filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/816710/ to test editquality and articlequality on staging [09:20:12] the control plan is still to be migrated [09:33:15] elukey: o/ good morning! [09:33:25] klausman: thanks for the merge. [09:34:06] I am going to deploy cs and enwiki articletopic on codfw staging. [09:36:15] o/ [09:36:27] the codfw staging deployment has been completed successfully. [09:36:28] checking pods now ... [09:38:37] all new pods are up and running: [09:38:38] NAME READY STATUS RESTARTS AGE [09:38:38] cswiki-articletopic-predictor-default-mvxdz-deployment-7fcrv4px 3/3 Running 0 2m46s [09:38:38] enwiki-articletopic-predictor-default-mlpfk-deployment-594xxcm9 3/3 Running 0 2m44s [09:39:07] kevinbazira: I am going to probably upgrade all Docker images to kserve 0.8 this afternoon, and the control plane too. There should be no change in what you are doing now in theory, but I'll ping you beforehand as warning (I'll work only in staging) [09:40:29] Oh ok. Thank you for letting me know elukey. My work on staging is done for now. I am going to start preparing for prod deployments. [09:42:15] ok super [10:18:34] I think our inference-services repo is a bit complicated for beginner to learn kserve, so I created a kserve-example repo https://github.com/AikoChou/kserve-example [10:19:33] There are an alexnet image classification model (modified from the official doc) and outlink topic model examples with dockerfiles, so they don't need to understand blubber.. [10:20:30] aiko: I saw it, I think it is a great start! Ideally people, at least at the beginning, should leave aside our repo + k8s + etc.. and focus only on Kserve [10:20:33] I'll add instructions for testing the models on a local docker instance this afternoon [10:20:33] so +1 to it! [10:20:43] I'll review everything once done [10:20:53] That's great Aiko. Thank you for putting the example together. [10:21:01] maybe we can add all this work to the inference-services repo? [10:21:08] and on wikitech I mean [10:21:19] yep ... with time it will be better there. [10:22:19] yeah I'm thinking to add it on wikitech [10:23:08] We should probably organize the Kserve docs splitting the examples between "your service is already in inference-services" vs "you have a new service" [10:24:08] so the instructions will be on both wikitech and the GitHub readme [10:24:50] elukey +1 that's a good idea [10:25:47] ack [10:25:57] all docker images on staging are now on kserve 0.8 [10:26:08] I'll update the control plane after lunch and make some tests [10:26:19] if all goes fine we should be able to proceed to prod during the next days [10:26:44] * elukey lunch [12:17:56] \o [13:02:28] deploying kserve 0.8's control plane in staging [13:03:51] of course the pod doesn't come up [13:04:16] msg":"unable to get deploy config.","error":"configmaps \"inferenceservice-config\" is forbidden: User \"system:serviceaccount:kserve:default\" cannot get resource \"configmaps\" in API group \"\" in the namespace \"kserve\""} [13:04:39] ahhh I think there is a new service account yes yes [13:07:02] weird though [13:15:46] so in the new kserve yaml there is a service account being renamed [13:16:22] or better, a new service account being created that is then used in various cluster role bindings etc.. [13:18:51] but in the error above, "default" seems still mentioned [13:18:52] mmmm [13:26:32] the new service account should be kserve-controller-manager [13:26:38] but it seems not used by the new controller [13:28:05] the change should be https://github.com/kserve/kserve/pull/1996 [13:30:55] I think we are missing https://github.com/kserve/kserve/blob/master/charts/kserve/templates/deployment.yaml#L26 [13:31:03] (kserve now has a helm chart! [13:32:56] yeah applied manually, it worked :) [13:34:05] so staging is fully running kserve 0.8! [13:36:56] try to play with it, deleting pods etc.. [13:41:19] the storage initializer is broken [13:41:20] mmmm [13:47:11] it seems protocol buffer related [13:47:27] and of course since I mounted models directly in the kserve-container while testing I haven't seen this locally [13:47:32] good note for the future [14:00:09] ah lovely [14:01:00] https://github.com/kserve/kserve/pull/2346/files [14:01:24] I inspected the storage init's docker image and it contains protobuf==4.21.2 [14:02:23] I need to see if I can somehow fix this in our storage init's image [14:37:55] currently building/testing https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/816792 [14:38:22] So the protobuf dep was missing entirely with our setup? or was it a wrong version? [14:39:54] it is protobuf==4.21.2 in our version [14:40:00] without any cap [14:40:35] I see. That's a whole major version newer than what upstream has [14:40:55] the fix was backported to the kserve 0,8 branch but no new git tag was created [14:41:08] so in theory the above should work (still locally building the image, it takes ages) [14:41:31] if not, the alternative is to change the build docker image to pull the kserve 0,8 branch, not the tag [14:42:08] That's dangerous though, since it can sneak in updates. I dunno the branch policy of kserve [14:43:10] they backport new things when explicitly needed I think, but it should be safe since to pick up new changes we'd need to bump the changelog's version of the build image [14:43:37] Roger [14:43:49] but we could pin the commit if we wanted to be super sure [14:43:56] I hope that the above fix is sufficient [14:44:54] (going to step afk for a few while the docker image builds.. ) [14:45:11] Alright, ttyl [14:46:45] Good morning! This is me all day today https://docs.google.com/document/d/1nAEZUzt0sKzL5DkrS52T9j5cE_O3CnyftZ5Q4H3PLYQ/edit# [14:51:18] \o [14:51:53] Why did I agree to finish this by tomorrow I will never know [15:28:41] chrisalbon: good luck :) [15:28:48] ha thanks [15:46:22] hope that one comment was helpful [15:46:33] (and yes, I accidentally made it with my private Goo account) [16:32:30] the new storage initializer image is ready, but puppet is having some issue so deploy1002 is off limits atm [16:32:37] will deploy it tomorrow :) [16:42:59] going afk for the evening in a bit, have a nice rest of the day folks