[07:09:11] 10Machine-Learning-Team: ML Serve controller vms show a slowly increasing resource usage leak over time - https://phabricator.wikimedia.org/T287238 (10elukey) >>! In T287238#7240394, @elukey wrote: > Deployed the new iptables to all ML buster clusters, preliminary results look really good. Will wait for a day be... [07:15:40] good morning! latency/cpu regression gone with the new iptables! [08:48:02] hello folks [08:48:09] created a simple cookbook to roll restart ores - https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/708478 [08:48:18] (we need to do it today after some upgrades) [08:48:30] very simple, nothing fancy [10:41:28] the knative chart has been deployed! There is an issue with the docker image that I'll fix after lunch, but it looks promising [10:41:30] * elukey lunch [14:47:02] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Inference Clients - https://phabricator.wikimedia.org/T287051 (10kevinbazira) a:03kevinbazira I tried generating an `authservice_session` cookie from KFServing v1.3 dex auth using `curl` and kept running into a "CSRF check failed" issue. I didn't face this... [15:16:23] knative-serving pods running!! \o/ [15:16:33] (not entirely sure if all works of course) [15:24:47] woo! [15:26:37] just filed a change for the manual fixes that I did on the cluster (to test) - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/708545 [15:27:21] but so nice to see [15:27:23] knative-serving activator-867d54cc88-vwdpt 1/1 Running 2 20m [15:27:27] knative-serving autoscaler-cfc4cc49f-zbvmv 1/1 Running 0 13m [15:27:30] knative-serving controller-784f95f8df-m4djc 1/1 Running 0 12m [15:27:33] knative-serving istio-webhook-b8854d86f-4lxh2 1/1 Running 0 11m [15:27:36] knative-serving networking-istio-857f9bbdf6-bdcp5 1/1 Running 0 11m [15:27:39] knative-serving webhook-5bf64fb48d-qvr8x 1/1 Running 0 11m [15:35:35] ahhaha of course others had the same issue and resolved with a nice template [15:35:51] bad Luca added raw variables to the yamls [15:35:54] :D [15:40:36] whatever works, my man [15:41:48] Plus, as Janis pointed out: it only arrived today [15:53:50] yeah but in a second my patch looked horrible :D [15:55:53] ah we have to roll restart ores [15:56:27] I created https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/708478 but I am not 100% sure that the last patch is fully correct [15:59:01] elukey: wow! that's really exciting! [16:00:03] accraze: \o/ - the kfserving chart is in a good state (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/700470) but the helm linters don't like it [16:04:20] that's so cool, we are really close! [16:04:40] for kfserving there is the question mark about the tls cert for the webhook [16:04:55] it is mandatory, so we'll have to come up with something [16:05:38] ohhh right, hmmm [18:02:32] BOOM, anyone around to +2 this? https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/708352 [18:02:52] this will publish the editquality image [18:03:47] yep done [18:04:00] awesome thanks! [18:07:43] hell yeah it's up [18:07:53] `docker pull docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services:2021-07-28-175322-production `` [18:08:11] yessssss [18:08:29] gotta change the name to specify that this the `editquality` image but that's easy [18:10:54] major props to kevinbazira for figuring out how to containerize everything and making the initial dockerfile [18:27:37] https://docker-registry.wikimedia.org//wikimedia/machinelearning-liftwing-inference-services/tags/ [20:34:44] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze) We have successfully ran the editquality pipeline and have published our editquality image: https://docker-registry.wikimedia.org//wikim... [20:37:56] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze)