[06:43:57] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): API Gateway Integration - https://phabricator.wikimedia.org/T288789 (10elukey) >>! In T288789#7426831, @ACraze wrote: >> Currently only services with a discovery configuration can be routed to. > @elukey: I just found the above in the API Guidelines: https:... [07:25:23] https://github.com/kserve/kserve/releases/tag/v0.7.0 - first kserve release out :) [07:29:01] so I was reviewing the kfserver.py code that we use for our model.py [07:29:22] and it seems using Tornado, so the http server should be based on an event loop [07:29:57] there is the possibility to set the workers size, and I guess it means creating multiple separate ioloops [07:30:20] we use the default of 1 worker, that for our use case should be enough [07:30:31] https://github.com/kserve/kserve/pull/1687/files was merged with kfserving==0.6 [07:30:45] that seems pretty useful, meanwhile we are still using kfserving==0.3.0 [07:31:14] now that the first kserve version is out we should probably migrate everything to it [07:44:33] but maybe we could attempt to upgrade our docker images to kfserving==0.6.0 first [07:56:19] I am also wondering one thing [07:57:27] would it be worth to create in the inference-services repo a revscoring/common dir that contains a shared model.py? [07:58:08] I am checking the code and it seems very similar, we could then symlink it in the various subdirs [07:58:36] the aim is to avoid code repetition and to share a single version of model.py [08:00:29] 10Lift-Wing, 10Machine-Learning-Team: Review egress rules for ml-serve cluster - https://phabricator.wikimedia.org/T284091 (10elukey) [08:00:38] 10Lift-Wing, 10Machine-Learning-Team: Add network policies to the ML k8s clusters - https://phabricator.wikimedia.org/T289834 (10elukey) [08:01:21] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Install Istio on ml-serve cluster - https://phabricator.wikimedia.org/T278192 (10elukey) >>! In T278192#7215471, @elukey wrote: > Things to do before closing: > 1) Do we need to add a custom TLS certificate for istiod? If not added then istiod creates... [08:02:18] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Install KFServing standalone - https://phabricator.wikimedia.org/T272919 (10elukey) [08:02:41] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Install Knative on ml-serve cluster - https://phabricator.wikimedia.org/T278194 (10elukey) 05Open→03Resolved a:03elukey Marking this as completed, metrics will be added in T289841. [08:19:06] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Migrate from Kfserving to Kserve - https://phabricator.wikimedia.org/T293331 (10elukey) [08:44:38] need to run errand for a bit, ttl! [14:22:22] 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10fgiunchedi) Following today's deploy of knative the errors are back, unclear though what's causing them (they will clear at 12 utc when indices roll over) [14:28:59] 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10fgiunchedi) To followup, the issue is that `knative_dev/key` normally is a text field, but sometimes (unclear yet when) it is a nested field and thus can't be indexe... [14:35:49] wow I am very surprised but I just moved our stack to kserve 0.7 [14:35:51] all works :) [14:39:14] 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10fgiunchedi) I was wrong re: recovery at midnight, the errors have stopped now: {F34688617} [14:59:40] 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10elukey) All related to what I was doing, some pods were down and knative was probably complaining about its revisions. It seems that this mess happens when the knati... [15:21:36] going to take a break! [16:04:36] o/ [16:05:37] elukey: nice work on kserve! [16:06:02] i was assuming we would need to upgrade k8s versions first [16:24:15] accraze: o/ [16:24:50] nono it works nicely with the current istio + knative + k8s version, but we'd need to upgrade our model.py to kserve==0.7.0 [16:25:17] it works even with the current docker images, but there are things like https://github.com/kserve/kserve/pull/1687/files that we are missing [16:25:46] do you think it is feasible to move from kfserving==0.3.0 to kserve==0.7.0? [16:34:02] yeah i think it should be feasible, my only concern is if we'll need to upgrade the sandbox clusters to kserve too... [16:34:58] hmmm, this makes me think we should begin to move off the minikf sandboxes in the nearish future [16:37:54] accraze: is there anything that we currently do on minikf that cannot be done on ml-serve-eqiad? [16:38:11] modulo of course the first tests for helm etc.. [16:41:24] lol was just thinking that, the only thing is looking at how metrics are setup (which i will dig into again later today) [16:41:47] but other than that, i think we could just use ml-serve-eqiad [16:42:54] exactly yes, minikf may become a burden if we keep it alive [16:47:04] accraze: another qs, totally unrelated - I was wondering this morning if we could have something like revscoring/common/model.py in the inference-service repo, and then symlink it [16:47:47] article quality is the only one with a different model.py (there is also the diff between model.bin and model.bz2 but we can add a line of code to generalize model.py) [16:49:53] yeah i think there is a strong case to clean up the duplicated revscoring model server code [16:50:52] there was talk about extracting the MW API call to a generic preprocessing transformer as well: https://phabricator.wikimedia.org/T285909 [16:51:35] ah okok [16:53:22] but now that you mention it, the rest of the `predict` method is very similar (minus model.bin/model.bz2 that you pointed out) [17:00:52] just to avoid copy/paste across multiple files when not neeeded [17:19:34] 10Machine-Learning-Team, 10artificial-intelligence, 10editquality-modeling, 10Hindi-Sites: Train and test editquality models for Hindi Wikipedia - https://phabricator.wikimedia.org/T252581 (10Halfak) Thanks to @Nikhil1194's work. We have an initial pair of editquality models ready for Hindi Wikipedia. Se... [17:35:57] I started https://wikitech.wikimedia.org/wiki/User:Elukey/MachineLearning/Deploy, I'll try to get to its first draft tomorrow :) [17:36:17] (we'll find a more stable/canonical place when we'll finalize it) [17:37:30] I don't want to repeat too much what listed in https://wikitech.wikimedia.org/wiki/Deployment_pipeline [17:45:03] niiiiice [19:32:23] starting to think through our `MODEL_NAME => host` mappings -- i think we can just do if/else using helm templating [19:35:46] it could get a bit verbose though... [22:31:10] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): API Gateway Integration - https://phabricator.wikimedia.org/T288789 (10ACraze) [22:31:12] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10ACraze) [23:43:59] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): API Gateway Integration - https://phabricator.wikimedia.org/T288789 (10ACraze) > Since we are not production ready I'd loved to avoid this until the last moment, but we could push the trigger anytime in theory. Ah yeah good call on this, let's avoid pinging...