[11:09:59] accraze: o/ [11:10:07] I have done some changes to the ml-sandbox [11:10:17] 1) /home dirs now point to /srv/home [11:10:24] 2) /var/lib/docker now points to /srv/docker [11:10:37] this should prevent the root partition to fill up, since on /srv we have ~60G [11:11:11] I stopped minikube and docker before the move, but now I can't bring back minikube, I don't know how you started it.. When you are online let's look at it (docker works) [11:42:45] elukey: \o [11:42:51] elukey: did we ever formulate a plan for using the spinning rust for extra storage that would be visible to (select) Docker containers? [11:43:11] I don't think we did, and it's probably not urgent [11:46:39] going afk for lunch! [11:47:32] but IIRC no we didn't think about it yet [11:47:38] roger [14:43:39] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Experiment with the Istio TLS mesh - https://phabricator.wikimedia.org/T297612 (10elukey) [14:47:44] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Experiment with the Istio TLS mesh - https://phabricator.wikimedia.org/T297612 (10elukey) [15:37:27] really interesting video about model monitoring [15:37:28] https://www.youtube.com/watch?v=_UQ5IV73MW4&ab_channel=MLOps.community [15:37:50] there are some useful thoughts about what to monitor [15:38:14] for example, variations in features, variations in the model's predicted values, etc.. [15:39:10] lol I have now a suggested video about Chris Albon explaining what we do at Wikimedia [15:40:36] elukey, I would be interested in seeing this video from Chris. Can you send its link? [15:41:24] SimMaig: hi! Sure, https://www.youtube.com/watch?v=TKVR5RYkqnc&ab_channel=MLOps.community [15:42:38] If you have any other support from which I could understand how the Wikimedia ML works, I'm also interested. [15:52:19] o/ [15:55:02] elukey: thanks for helping move ml-sandbox to the other partition, I was going to try on Friday but got sidetracked with image upgrade stuff [15:55:44] last time i started minikube as root with `minikube start --kubernetes-version=v1.16.15 --cpus 4 --memory 8192` [15:56:08] i'll see if that works, but no worries if not, I have a script that can rebuild the stack easily [15:59:02] (i should provision more cpu & mem too lol) [15:59:58] ack lemme know! [16:00:04] then we can check the routing issue again [16:02:36] cool cool! i think we can start deploying the upgraded images on ml-serve now too :) [16:04:56] accraze: I started looking at the istio mesh config, and it is way more complex than I imagined [16:05:05] hahaha [16:05:17] uh oh [16:06:42] so the "easy" way is to allow the pods, via pod security policies, to mess with the pod's networking stack, to instruct iptables about what to do [16:07:19] but this opens a security hole, since if the pod gets compromised, it has a broader set of things to change/exploit [16:07:50] the more secure way is to set up the istio-cni-plugin, a binary that should work on a separate container/pod as daemonset [16:07:54] providing the service to all pods [16:08:13] so in this way, only the pods running the cni plugin would need the extra security capabilities [16:08:42] BUT, the istio cni plugin is not very easy to install/deploy/etc.. in our settings [16:12:44] ahh i see, we would need to port the plugin to our own infra [16:13:19] yeah [16:13:37] so for the moment I am inclined to just add a certificate to the istio egress gw [16:14:00] yeah that's probably fine for now [16:14:02] from the inference service code we'll need to call the endpoint and set the right header [16:15:56] do you mean in model.py? [16:17:10] yes, like we do at the moment with api-ro.discovery.wmnet [16:17:20] we'll use the k8s egress svc endpoint [16:17:40] that will know, for example, to map *.wikipedia.org to api-ro.discovery.wmnet [16:17:45] but we'll need to set the right env variables [16:18:38] ahh ok i see what you are saying now [16:19:13] i think that's probably fine for now [16:27:59] 10Machine-Learning-Team, 10Analytics-Radar, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10odimitrijevic) [16:35:31] rebuilding ml-sandbox cluster real quick, there were some issues moving to the other partition [16:35:47] ah snap [16:35:58] lol no worries elukey! [16:38:14] i've got alot of practice, each time with more automation involved :D [16:41:12] the only manual part left is editing the knative configmap to add `registriesSkippingTagResolving` [16:46:42] cool ml-sandbox is back up [17:05:33] Morning! [17:05:41] Last week for me, time to power through [17:23:35] ^ same here - last week of 2021, we got this :) [17:30:45] accraze: if you want to deploy go ahead :) [17:40:32] ok gonna try it now [17:46:23] awesome, new editquality image is deployed and i was able to get a prediction back! [17:49:05] \o/ [17:49:14] and you are going through the egress gw [17:49:26] (still via http not great, I'll upgrade to https tomorrow morning) [17:49:46] great work :) [17:49:56] doing draftquality now [17:50:00] :clapclap: [17:50:05] the migration to kserve 0.7 was painful but really needed [17:50:07] thanks :) [17:51:59] accraze: for the moment don't deploy to codfw please, I'd need to deploy the egress settings in there etc.. [17:53:01] ah ok noted, i'm just doing `helmfile -e ml-serve-eqiad [...]` for now [17:53:34] aaand looks like draftquality is good too :) [17:53:49] perfect :) [17:55:15] will send a CR later today for articlequality + transformer [17:55:42] we may need to update the helmfile a bit to support the transformer [17:59:57] in theory it should be flexible enough that only values.yaml should be changed [18:00:00] in theory :) [19:54:11] ok so digging around in the deployment-charts repo, it seems we may need to update the kserve-inference chart to support transformers [19:54:29] somewhere like here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/kserve-inference/templates/services.yaml#21 [19:55:01] we'll need to add `transformer` to spec, but it will be fairly similar to the `predictor` [19:55:49] i'm going to do a CR for just the articlequality predictor image upgrade first though