[06:10:47] 10Machine-Learning-Team: revscoring model should be able to query an internal mw endpoint - https://phabricator.wikimedia.org/T289778 (10kevinbazira) Thanks for the update @elukey, it's great that the changes worked :) Just a small note - when env variable changes are added to `model.py` we usually also add the... [06:18:29] 10Machine-Learning-Team: revscoring model should be able to query an internal mw endpoint - https://phabricator.wikimedia.org/T289778 (10elukey) >>! In T289778#7317625, @kevinbazira wrote: > Thanks for the update @elukey, it's great that the changes worked :) > > Just a small note - when env variable changes ar... [06:20:38] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Add inference-services CI pipelines to the Zuul gate-and-submit - https://phabricator.wikimedia.org/T289562 (10kevinbazira) [06:20:42] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10kevinbazira) [06:21:15] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10kevinbazira) [06:21:19] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Add inference-services CI pipelines to the Zuul gate-and-submit - https://phabricator.wikimedia.org/T289562 (10kevinbazira) 05Open→03Resolved Thanks @mmodell marking this as resolved. [06:56:52] 10Machine-Learning-Team, 10Patch-For-Review: revscoring model should be able to query an internal mw endpoint - https://phabricator.wikimedia.org/T289778 (10kevinbazira) Thank you for the swift changes @elukey, I have +2'd. > Is it meant to be for us or for the community? This is a pattern we got from the KF... [09:34:42] I added the basic metrics to the Istio dashboard: https://grafana-rw.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-backend=All [09:38:25] Nice! [09:39:07] Would it be useful to have a recording rule that strips the random suffix? So e.g. drop -x8dj9 from enwiki-goodfaith-predictor-default-x8dj9 [09:39:36] OTOH, sometimes it's good to know which instance of a model had errors/traffic spikes. [09:43:04] yes yes this is a raw dashboard, plenty of things to trim/drop/etc.., feel free to change it anytime :) [09:45:55] I am not sure if long term we'll need to have separate metrics for the random suffixes (say new versions of InferenceService etc..) [09:46:11] or better, to visualize them on the dashboards cleanly [09:53:52] One option we have is to create a trimmed name at metric-gathering time, and then make separate graphs or even dashboards with/without the random suffix. [09:54:31] I doubt we'll have enough metrics for that to be a metrics/label cross-product explosion that would concern Prometheus itself [09:55:36] 10Lift-Wing, 10Machine-Learning-Team: Add prometheus metrics collection for Istio and Knative - https://phabricator.wikimedia.org/T289841 (10elukey) Added basic metrics to the istio dashboard (rps, bytes, latency) broken down by backend. The dashboard is still very raw but it is returning good info, let's proc... [09:57:04] klausman: yeah but how can we make the connection between revisions for a model and its metrics with the trimmed names? [09:57:51] for example when rolling our new versions etc.. [09:58:01] (hopefully via canarying or similar) [10:18:10] We may need more labels, then. I'll have a look [10:37:17] * elukey lunch! [13:31:49] 10Lift-Wing, 10Machine-Learning-Team: Add network policies to the ML k8s clusters - https://phabricator.wikimedia.org/T289834 (10elukey) It seems that the `GlobalNetworkPolicies` are split into two parts: - global ones (per cluster) that include things like allowing egress between each pod, allow DNS traffic... [15:29:58] mmm the gate and submit job in https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/715440/ seems stuck [15:30:17] I don't see it in https://integration.wikimedia.org/ci/job/trigger-inference-services-pipeline-editquality/ [16:05:11] -- [16:05:35] I am wondering what is best for helm when defining what a "service" is in that context [16:05:54] for example, in the deployment-charts repo there are: [16:06:21] 1) charts, in which people define the templates/base-values/etc.. of a service (like a debian package if we wanted to make a parallel) [16:06:58] 2) helmfile.d/services, where there is a definition of a chart to be deployed and with what values (secrets from the private repo, prod values for templates, etc..) [16:07:34] in our case, the chart is called "kubeflow-kfserving-inference", and (even if it is missing something) it allows to define multiple InferenceService specs [16:08:12] this would allow us to say have a "revscoring-editquality" namespace in kubernetes, running various InferenceServices (enwiki-goodfaith, etc..) [16:08:36] all the InferenceService instances will have a pod, and will use the same secret to access swift, etc.. [16:09:22] seems convenient, but I recall that SRE told us to think about a good compromise between having too many things in the same service/namespace and having less hassle when deploying [16:09:29] use cases: [16:10:31] 1) if we change the kubeflow-kfserving-inference chart (say that a parameter in InferenceService is needed, etc..) a new version is created. Deploying that version to a service will mean recycling all pods (that may be cheap, and done via canary, etc.. but still a lot of things changing) [16:11:30] 2) a namespace with a lot of pods may be more difficult to limit (memory/cpu/etc..) in case needed. We should have per-pod limits so it could be nothing for us, but better to keep it in mind. [16:12:11] on the other side, adding more granularity may lead to a lot of boilerplate actions when deploying [16:22:23] o/ [16:23:11] good morning :) [16:23:35] elukey: i'm still in favor of the helmfile.d/services approach i think [16:24:07] also i see the gate-and-submit issue....I ran into that last week after doing a rebase, will look into it a bit more [16:25:40] accraze: yes yes what I wrote above is about what directories to create under helmfile.d/services (for us it will be ml-services but it is the same) [16:27:57] yeah i think splitting by model type (i.e. revscoring-editquality, outlinks-topic, etc.) could be helpful since they all will work slightly differently [16:29:35] also some pods will have different architecture, like outlinks has an InferenceService and a Transformer for pre/post-processing [16:30:05] ok so I'll try to proceed with splitting everything by model type [16:35:13] cool that sounds good, if the granularity becomes too much hassle, we could always groups all ores/revscoring models together [16:36:08] but i'm in favor of keeping them split by type for now since some of the types (topic models mainly) will be superseded in the near future [16:43:34] yes I think that the granularity will be model-type as starter, and if we see that it doesn't fit we can add more (not reducing it since we'll end up with a giant namespace) [16:44:18] the details about psp/rbac/deploy-users/etc.. are still very fuzzy to me, so it might take a bit to iron out all details [16:45:19] accraze: in the meantime, can I create more InferenceServices in my test namespace? I mean more goodfaith ones, or is enwiki the only one that we support? [16:45:24] (any language is good) [16:45:40] ah right it depends on what we have on swift of course [16:51:27] soooo is there a way to get another model in there? :D :D :D [16:51:42] so I'll test istio proxying [16:54:01] ahhh actually i don't think we have anything else in swift yet :( [16:54:59] but if you are feeling good about things, maybe we just move the other 4-5 english model binaries into thanos swift? [16:58:43] accraze: I would like to get 1/2 more goodfaith models, to deploy together with enwiki in the same namespace if possible [16:58:54] (if we have them handy, otherwise no problem) [16:59:04] other english models binaries are good as well [17:33:22] ahh just remembering that i can't ssh into ml-serve still [17:34:24] accraze: IIRC you have the new fingerprints right? If they are matching what ssh returns it may be just a matter to clean up old ssh entries and accept the new ones [17:35:22] your s3 creds should be on ml-serve1001 [17:35:34] aha! got it, just had to remove the offending ECDSA key [17:35:38] (that is not the ideal place but let's just work with it for the moment) [17:35:42] perfect :) [17:36:45] ok so i will upload another couple of editquality model binaries (goodfaith and/or badfaith) today and share the URIs [17:37:24] i'll try a non-enwiki one too in order to make sure our dictionary packages are working [17:41:36] perfect [17:41:41] thanks a lot :) [17:42:01] in the kfserving slack channel somebody asked how new versions of models would be deployed [17:42:40] and upstream suggested to simply change the storage uri path of InferenceService [17:43:02] I've literally had this pizza before https://retailalliance.com/wp-content/uploads/2014/10/CPKsaladpizzaphoto.jpg [17:43:11] so having the timestamp or a unique id of a model in the storage path is great [17:43:34] chrisalbon: it is a stretch to call that thing a pizza :D [17:43:42] right?! [17:45:11] :) [17:45:20] going to log off, have a good day/evening folks! [17:46:24] see ya elukey! [17:48:05] thanks for everything today elukey! [22:18:12] wow, was just able to upload a couple of editquality model binaries to swift and it was a really nice experience, super seamless [22:18:27] much better than battling git lfs [22:19:35] either way we have three more editquality models in the bucket now: enwiki-damaging, itwiki-goodfaith and itwiki-damaging [23:22:08] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Fix editquality production pipeline - https://phabricator.wikimedia.org/T289886 (10ACraze) [23:22:11] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze) [23:22:19] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze) [23:22:21] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Fix editquality production pipeline - https://phabricator.wikimedia.org/T289886 (10ACraze) 05Open→03Resolved Pipeline is back in working order and publishing images. https://integration.wikimedia.org/ci/job/inference-services-pipeline-editquality/ Marki...