[07:28:34] good morning! [07:32:06] accraze: in theory it is the controller that needs to fetch data from the registry [07:32:30] it converts the docker images to digests, so it can reliably have revisions [07:35:21] kubectl get events -n kserve-test [07:35:32] Unable to fetch image "docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-editquality:2021-07-28-204847-production": failed to resolve image to digest: Get "https://docker-registry.wikimedia.org/v2/": x509: certificate signed by unknown authority [07:40:58] I am wondering if it is due to the absence of ca-certificates on the docker image [07:41:38] we have wmf-certificates [07:45:37] mmm it seems having it [07:49:58] ok I used a temporary hack, namely `kubectl edit configmap config-deployment -n knative-serving` -> added registries skipping tag resolving etc.. [07:50:07] I see the pods coming up [07:51:31] ok the storage-initializer fails since it doesn't find credentials [07:53:39] at this point we'd need a fake s3 endpoint [11:57:36] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [11:57:46] * elukey lunch! [11:58:08] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Add network policies to the ML k8s clusters - https://phabricator.wikimedia.org/T289834 (10elukey) 05Open→03Resolved a:03elukey [14:45:19] 10Lift-Wing: Workflow to upload models to Swift - https://phabricator.wikimedia.org/T294409 (10elukey) This was completed with: https://gerrit.wikimedia.org/r/c/operations/puppet/+/736848/ Tested by multiple people, worked fine, closing task! [14:46:06] 10Lift-Wing: Workflow to upload models to Swift - https://phabricator.wikimedia.org/T294409 (10elukey) 05Open→03Resolved a:03elukey This was completed with: https://gerrit.wikimedia.org/r/c/operations/puppet/+/736848/ Tested by multiple people, worked fine, closing task! [14:46:08] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [14:51:59] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Naming convention for the model storage structure - https://phabricator.wikimedia.org/T280467 (10elukey) Anything left to do @ACraze ? [14:53:52] 10Lift-Wing: Bootstrap the ml-serve-codfw cluster - https://phabricator.wikimedia.org/T294412 (10elukey) a:03klausman [15:12:34] o/ [15:44:59] 10Lift-Wing: Workflow to upload models to Swift - https://phabricator.wikimedia.org/T294409 (10ACraze) [15:45:01] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Find a way to store models for Kubeflow - https://phabricator.wikimedia.org/T280025 (10ACraze) [15:45:23] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Naming convention for the model storage structure - https://phabricator.wikimedia.org/T280467 (10ACraze) 05Open→03Resolved a:03ACraze @elukey - I think we are all good on this task. Major thanks to @Theofpa for helping us out on t... [15:47:47] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Install Istio on ml-serve cluster - https://phabricator.wikimedia.org/T278192 (10elukey) 05Open→03Resolved a:03elukey ` elukey@ml-serve-ctrl1001:~$ kubectl get secrets istio-ca-secret -o jsonpath="{.data.ca-cert\.pem}" -n istio-system | base64... [15:47:51] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Install Knative on ml-serve cluster - https://phabricator.wikimedia.org/T278194 (10elukey) [15:47:54] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Install KFServing standalone - https://phabricator.wikimedia.org/T272919 (10elukey) [15:54:40] 10Machine-Learning-Team: Investigate separating k8s-level users between our k8s and thr ServiceOps k8s - https://phabricator.wikimedia.org/T277492 (10elukey) This has been done in other tasks, but the infrastructure users are shared between eqiad and codfw. I personally think that it should be fine for our use c... [15:55:27] 10Machine-Learning-Team, 10ORES: Add approvals on Github for all the ORES-related repositories - https://phabricator.wikimedia.org/T281711 (10elukey) 05Open→03Resolved a:03elukey Closing since I think no more actions are pending, please re-open if I have missed something! [16:10:15] elukey: thanks for digging into the x509 issue, i plan to dig into the model storage piece today (maybe pvc for local dev?) [16:12:49] accraze: o/ still not clear why it was complaining about the certs, but for local dev I think that skipping the tag->digest conversion is fine [16:16:49] I was wondering today about the Redis nodes that will represent our online feature store (scheduled for next quarter) [16:17:14] feast in my mind should be used when we need to precompute things, and schedule periodical refresh via airflow etc.. [16:17:37] but then we could also use the Redis nodes as cache, bypassing the feast API basically, for simpler use cases [16:17:43] does it make sense? [16:17:57] for example, fetching a certain rev_id [16:18:20] the transformer could simply try to fetch from Redis, and if doesn't find anything, use the MW api and store the value [16:18:34] ^yup that is how i was thinking about it as well [16:18:40] ack perfect [16:18:55] it would be great to avoid 100 calls to mw api if one requests 100 time the same thing [16:19:06] LOL agreed [16:19:31] we'll still need to figure out a more efficient way of handling large blobs of article text [16:20:33] accraze: one question - in the article revscoring use case, IIUC we need to fetch all revisions/edits of an article since the beginning of time before scoring it? [16:21:57] anyway, one possible way to handle edits, say for enwiki, could be to listen to the revision changes kafka topic, filter for enwiki and apply the edit to the version of the article on the feature store [16:22:18] and keep a certain amount of articles of course [16:22:36] hmm i believe this is correct, but i'd need to double check, it might just work at the current revision level [16:23:07] asking because I am super ignorant about it, didn't check what model.py is doing [16:23:34] but yeah i see what you are saying, we could always have a fresh score available if we listen for revision changes [16:24:31] scoring or simply caching the feature [16:24:46] better - caching the most up to date version of the feature [16:25:24] but it is also fine as starter to just do very simple use cases first, and use the mw-api for fresh data [16:26:15] a lot of things to brainstorm, we should start doing some group discussions now that we are reaching a good stage of the stack [16:26:43] absolutely, i think we are getting to a point where group brainstorms will become really valuable [16:30:20] i know there was some difficulty with the caching layer for ORES in the past [16:30:53] using something like redis sentinel may have alleviated some of it, but was never implemented [16:38:00] there are also proxies like twemproxy that can act as client proxies and handle redis shards transparently [16:38:09] but the logic is on the client side [17:27:24] * elukey afk!