[06:10:47] <wikibugs>	 10Machine-Learning-Team: revscoring model should be able to query an internal mw endpoint - https://phabricator.wikimedia.org/T289778 (10kevinbazira) Thanks for the update @elukey, it's great that the changes worked :)  Just a small note - when env variable changes are added to `model.py` we usually also add the...
[06:18:29] <wikibugs>	 10Machine-Learning-Team: revscoring model should be able to query an internal mw endpoint - https://phabricator.wikimedia.org/T289778 (10elukey) >>! In T289778#7317625, @kevinbazira wrote: > Thanks for the update @elukey, it's great that the changes worked :) >  > Just a small note - when env variable changes ar...
[06:20:38] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Add inference-services CI pipelines to the Zuul gate-and-submit - https://phabricator.wikimedia.org/T289562 (10kevinbazira)
[06:20:42] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10kevinbazira)
[06:21:15] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10kevinbazira)
[06:21:19] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Add inference-services CI pipelines to the Zuul gate-and-submit - https://phabricator.wikimedia.org/T289562 (10kevinbazira) 05Open→03Resolved Thanks @mmodell marking this as resolved.
[06:56:52] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: revscoring model should be able to query an internal mw endpoint - https://phabricator.wikimedia.org/T289778 (10kevinbazira) Thank you for the swift changes @elukey, I have +2'd.  > Is it meant to be for us or for the community?  This is a pattern we got from the KF...
[09:34:42] <elukey>	 I added the basic metrics to the Istio dashboard: https://grafana-rw.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-backend=All
[09:38:25] <klausman>	 Nice!
[09:39:07] <klausman>	 Would it be useful to have a recording rule that strips the random suffix? So e.g. drop -x8dj9 from enwiki-goodfaith-predictor-default-x8dj9
[09:39:36] <klausman>	 OTOH, sometimes it's good to know which instance of a model had errors/traffic spikes.
[09:43:04] <elukey>	 yes yes this is a raw dashboard, plenty of things to trim/drop/etc.., feel free to change it anytime :)
[09:45:55] <elukey>	 I am not sure if long term we'll need to have separate metrics for the random suffixes (say new versions of InferenceService etc..)
[09:46:11] <elukey>	 or better, to visualize them on the dashboards cleanly
[09:53:52] <klausman>	 One option we have is to create a trimmed name at metric-gathering time, and then make separate graphs or even dashboards with/without the random suffix.
[09:54:31] <klausman>	 I doubt we'll have enough metrics for that to be a metrics/label cross-product explosion that would concern Prometheus itself
[09:55:36] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Add prometheus metrics collection for Istio and Knative - https://phabricator.wikimedia.org/T289841 (10elukey) Added basic metrics to the istio dashboard (rps, bytes, latency) broken down by backend. The dashboard is still very raw but it is returning good info, let's proc...
[09:57:04] <elukey>	 klausman: yeah but how can we make the connection between revisions for a model and its metrics with the trimmed names?
[09:57:51] <elukey>	 for example when rolling our new versions etc..
[09:58:01] <elukey>	 (hopefully via canarying or similar)
[10:18:10] <klausman>	 We may need more labels, then. I'll have a look
[10:37:17] * elukey lunch!
[13:31:49] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Add network policies to the ML k8s clusters - https://phabricator.wikimedia.org/T289834 (10elukey) It seems that the `GlobalNetworkPolicies` are split into two parts:  - global ones (per cluster) that include things like allowing egress between each pod, allow DNS traffic...
[15:29:58] <elukey>	 mmm the gate and submit job in https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/715440/ seems stuck
[15:30:17] <elukey>	 I don't see it in https://integration.wikimedia.org/ci/job/trigger-inference-services-pipeline-editquality/
[16:05:11] <elukey>	 --
[16:05:35] <elukey>	 I am wondering what is best for helm when defining what a "service" is in that context
[16:05:54] <elukey>	 for example, in the deployment-charts repo there are:
[16:06:21] <elukey>	 1) charts, in which people define the templates/base-values/etc.. of a service (like a debian package if we wanted to make a parallel)
[16:06:58] <elukey>	 2) helmfile.d/services, where there is a definition of a chart to be deployed and with what values (secrets from the private repo, prod values for templates, etc..)
[16:07:34] <elukey>	 in our case, the chart is called "kubeflow-kfserving-inference", and (even if it is missing something) it allows to define multiple InferenceService specs
[16:08:12] <elukey>	 this would allow us to say have a "revscoring-editquality" namespace in kubernetes, running various InferenceServices (enwiki-goodfaith, etc..)
[16:08:36] <elukey>	 all the InferenceService instances will have a pod, and will use the same secret to access swift, etc..
[16:09:22] <elukey>	 seems convenient, but I recall that SRE told us to think about a good compromise between having too many things in the same service/namespace and having less hassle when deploying
[16:09:29] <elukey>	 use cases:
[16:10:31] <elukey>	 1) if we change the kubeflow-kfserving-inference chart (say that a parameter in InferenceService is needed, etc..) a new version is created. Deploying that version to a service will mean recycling all pods (that may be cheap, and done via canary, etc.. but still a lot of things changing)
[16:11:30] <elukey>	 2) a namespace with a lot of pods may be more difficult to limit (memory/cpu/etc..) in case needed. We should have per-pod limits so it could be nothing for us, but better to keep it in mind.
[16:12:11] <elukey>	 on the other side, adding more granularity may lead to a lot of boilerplate actions when deploying
[16:22:23] <accraze>	 o/
[16:23:11] <elukey>	 good morning :)
[16:23:35] <accraze>	 elukey: i'm still in favor of the helmfile.d/services approach i think
[16:24:07] <accraze>	 also i see the gate-and-submit issue....I ran into that last week after doing a rebase, will look into it a bit more
[16:25:40] <elukey>	 accraze: yes yes what I wrote above is about what directories to create under helmfile.d/services (for us it will be ml-services but it is the same)
[16:27:57] <accraze>	 yeah i think splitting by model type (i.e. revscoring-editquality, outlinks-topic, etc.) could be helpful since they all will work slightly differently
[16:29:35] <accraze>	 also some pods will have different architecture, like outlinks has an InferenceService and a Transformer for pre/post-processing
[16:30:05] <elukey>	 ok so I'll try to proceed with splitting everything by model type
[16:35:13] <accraze>	 cool that sounds good, if the granularity becomes too much hassle, we could always groups all ores/revscoring models together
[16:36:08] <accraze>	 but i'm in favor of keeping them split by type for now since some of the types (topic models mainly) will be superseded in the near future
[16:43:34] <elukey>	 yes I think that the granularity will be model-type as starter, and if we see that it doesn't fit we can add more (not reducing it since we'll end up with a giant namespace)
[16:44:18] <elukey>	 the details about psp/rbac/deploy-users/etc.. are still very fuzzy to me, so it might take a bit to iron out all details
[16:45:19] <elukey>	 accraze: in the meantime, can I create more InferenceServices in my test namespace? I mean more goodfaith ones, or is enwiki the only one that we support?
[16:45:24] <elukey>	 (any language is good)
[16:45:40] <elukey>	 ah right it depends on what we have on swift of course
[16:51:27] <elukey>	 soooo is there a way to get another model in there? :D :D :D
[16:51:42] <elukey>	 so I'll test istio proxying
[16:54:01] <accraze>	 ahhh actually i don't think we have anything else in swift yet :(
[16:54:59] <accraze>	 but if you are feeling good about things, maybe we just move the other 4-5 english model binaries into thanos swift?
[16:58:43] <elukey>	 accraze: I would like to get 1/2 more goodfaith models, to deploy together with enwiki in the same namespace if possible
[16:58:54] <elukey>	 (if we have them handy, otherwise no problem)
[16:59:04] <elukey>	 other english models binaries are good as well
[17:33:22] <accraze>	 ahh just remembering that i can't ssh into ml-serve still
[17:34:24] <elukey>	 accraze: IIRC you have the new fingerprints right? If they are matching what ssh returns it may be just a matter to clean up old ssh entries and accept the new ones
[17:35:22] <elukey>	 your s3 creds should be on ml-serve1001
[17:35:34] <accraze>	 aha! got it, just had to remove the offending ECDSA key
[17:35:38] <elukey>	 (that is not the ideal place but let's just work with it for the moment)
[17:35:42] <elukey>	 perfect :)
[17:36:45] <accraze>	 ok so i will upload another couple of editquality model binaries (goodfaith and/or badfaith) today and share the URIs
[17:37:24] <accraze>	 i'll try a non-enwiki one too in order to make sure our dictionary packages are working
[17:41:36] <elukey>	 perfect
[17:41:41] <elukey>	 thanks a lot :)
[17:42:01] <elukey>	 in the kfserving slack channel somebody asked how new versions of models would be deployed
[17:42:40] <elukey>	 and upstream suggested to simply change the storage uri path of InferenceService
[17:43:02] <chrisalbon>	 I've literally had this pizza before https://retailalliance.com/wp-content/uploads/2014/10/CPKsaladpizzaphoto.jpg
[17:43:11] <elukey>	 so having the timestamp or a unique id of a model in the storage path is great
[17:43:34] <elukey>	 chrisalbon: it is a stretch to call that thing a pizza :D
[17:43:42] <chrisalbon>	 right?!
[17:45:11] <elukey>	 :)
[17:45:20] <elukey>	 going to log off, have a good day/evening folks!
[17:46:24] <accraze>	 see ya elukey!
[17:48:05] <chrisalbon>	 thanks for everything today elukey!
[22:18:12] <accraze>	 wow, was just able to upload a couple of editquality model binaries to swift and it was a really nice experience, super seamless
[22:18:27] <accraze>	 much better than battling git lfs
[22:19:35] <accraze>	 either way we have three more editquality models in the bucket now: enwiki-damaging, itwiki-goodfaith and itwiki-damaging
[23:22:08] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Fix editquality production pipeline - https://phabricator.wikimedia.org/T289886 (10ACraze)
[23:22:11] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze)
[23:22:19] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze)
[23:22:21] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Fix editquality production pipeline - https://phabricator.wikimedia.org/T289886 (10ACraze) 05Open→03Resolved Pipeline is back in working order and publishing images. https://integration.wikimedia.org/ci/job/inference-services-pipeline-editquality/  Marki...