[10:01:23] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10achou) Hi, yesterday I had a meeting with @diego and @MunizaA in the Research Team. We're currently studying ORES models usage from the scores stored... [12:09:53] \o [12:10:44] elukey: when you have a moment, I am struggling with the PKI change and PCC [12:56:29] klausman: sure :) [12:56:47] https://gerrit.wikimedia.org/r/c/operations/puppet/+/807502 <- So this is the change [12:56:59] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35998/ <- and this is the PCC run [12:57:37] It has errors about missing keys (passwords for certs), despite them being present both "for real" on the puppetmaster, as well as in the private repo as dummies [13:01:37] And I have no idea if that needs fixing or not [13:01:42] (or what the fix would be) [13:02:17] ah yeah so puppet wants a sensitive value [13:02:18] entry 'key' expects a value of type Sensitive[String] or Pattern[/^[a-fA-F0-9]{16}$/] [13:02:25] we have [13:02:26] key: xxxxzzzzwwwwwwww [13:02:38] lemme change it [13:02:42] But the other stanzas look exatly the same? [13:03:02] not in the letters used [13:03:13] check the above Pattern [13:03:51] Oh, those have to be _hex_ [13:04:11] yep yep [13:04:12] That is decidedly non-obvious [13:04:19] https://gerrit.wikimedia.org/r/c/labs/private/+/807544 [13:05:25] ok so pcc now should lead to a better result [13:05:28] klausman: can you recheck? [13:05:33] on it [13:06:25] yes, that works better [13:06:30] merging the original change [13:14:06] elukey: https://phabricator.wikimedia.org/P29965 \o/ [13:14:23] nice :) [13:14:40] I added the last code reviews to the k8s docs [13:14:45] so others will know better [13:15:08] thx! [13:15:26] yep I see the knative-serving-tls-certificate cert successfully generated on k8s [13:15:29] super [13:15:39] so I think that the remaining step is to deploy kserve [13:15:50] It still needs Prometheus setup, I'll do that today/tomorrow [13:15:53] and then after it, figure out what/how to deploy pods to staging [13:16:15] it should be done in theory (prometheus) [13:16:30] Including the volumes and all? [13:18:09] I'll run the kserve sync in a moment, unless you're already on that :) [13:18:27] yeah the volumes are missing, but IIRC we had metrics [13:18:28] mmm [13:18:56] ah no nevermind [13:19:02] please go ahead with kserve [13:19:54] NAME READY STATUS RESTARTS AGE [13:19:56] kserve-controller-manager-0 1/1 Running 0 39s [13:19:59] ack so I'll let you do Prometheus tomorrow/anytime [13:20:01] super [13:20:21] klausman: can you update the task with missing steps etc.. ? [13:20:25] so we know what is missing [13:20:27] kserve log looks clean [13:20:41] which steps do you mean? [13:21:28] that we deployed PKI settings, knative (I did it two days ago IIRC) and kserve, and that prometheus setup is missing [13:21:37] just to keep a note on the task [13:21:45] As a comment or in the main desc? [13:22:33] no no I meant a new comment in the task [13:24:10] ok :) [13:25:27] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Create the ml-serve-staging k8s cluster - https://phabricator.wikimedia.org/T302195 (10klausman) Add'l things done: - PKI setup so the new cluster can be its own CA - Deployed knativ - Deployed kserve Still needs to be done:... [14:26:36] (03CR) 10Klausman: [C: 03+1] "Mostly looks okay to me with one nit. Also, I am not sure about the implications of that gplnamespace comment. Is that a showstopper? Does" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/807135 (https://phabricator.wikimedia.org/T311043) (owner: 10AikoChou)