[09:12:50] * elukey errand for a bit! [11:38:13] * elukey lunch! [14:48:38] After a chat with Janis today I have more questions about istio in mind, especially mTLS for intra mesh traffic [14:49:28] it has been bugging me the fact that the current gateway poc that I am running works only via http (querying endpoitns like MW api via https though), and to turn it into https I'd need to manually generate the cert (using the puppet CA), upload it, etc.. [14:49:32] https://istio.io/latest/docs/tasks/security/cert-management/plugin-ca-cert/ [14:49:51] up to today I was convinced that adding the cert-manager dependency was the only way to provision certs [14:50:17] but I was wrong, istiod's self signed CA (that we are using now) can issue certificates as well [14:50:38] for intra-mesh traffic (pods -> gateways, etc..) it may be an easy solution [14:51:25] the article talks about having an intermediate certificate from the main root PKI, but in our case the self signed one that isiod provision should be fine as well [14:52:20] the serviceops team is testing cert-manager to provision certs from our PKI, but not for mTLS/intra-mesh traffic, only for the gateway endpoint ones (in our case inference.discovery.wmnet, they have way more) [14:53:02] I am not sure how difficult it is to enable the istio mTLS service mesh [14:53:07] but we could test it [15:04:57] cert-manager can handle these certs - https://www.jetstack.io/blog/cert-manager-istio-integration/ [15:05:16] but since we'll have only one main gateway endpoint (inference.discovery.wmnet) it seems overkill [16:09:00] o/ [16:12:54] elukey: i like this idea of using istiod to issue certs [16:13:01] morning :) [16:13:31] what happens, IIUC, is that a new pod issues a certificate signing request using its token (shared with istiod I think) [16:13:51] then istiod issues the cert with is self signed root CA [16:14:17] what we could do, as alternative, is to have a PKI intermediate CA deployed to istiod [16:14:24] issued by the WMF root PKI [16:14:42] but not sure how much security we gain, our cluster is already self contained [16:14:56] hmmm yeah that's a good point [16:15:20] im ok with not adding cert-manager if we can avoid it, not sure how much that would benefit us at this point [16:16:02] also mTLS might help us with transformers, explainers etc [16:18:31] the network policy rules would need to be tweaked, the sidecars will need to talk with istiod, but it should be doable [16:18:42] without horrors in yaml [16:18:50] LOL [16:19:20] the main benefit is that, IIUC, something like the egress gw will be more transparent [16:19:44] yeah it would be more organized and easy to reason about [16:19:45] for example, we could add info to istio about who to proxy for *.wikipedia.org, like the mw api [16:20:00] and it would be handled by the sidecar, that will know to contact the egress gw [16:20:51] meanwhile now we need to specify the k8s service endpoint for the egress gw, with the right host header [16:21:22] it would also be testable everywhere, with the istiod's self signed ca [16:21:35] I would proceed in this way [16:22:03] 1) add the sidecar pods + network rules, using istiod's ca. See how painful it is, and how it works [16:22:26] 2) review with serviceops the architecture, and in case either switch to cert-manager or to a dedicated PKI intermediate [16:22:32] does it sound ok? [16:22:45] a little more work but 1) is a cleaner solution to egress [16:23:09] that sounds reasonable [16:23:19] would this change our dev setup? [16:24:19] yeah probably, but in theory you should see only more pods, nothing more [16:25:24] cool, that shouldn't be a problem then [16:26:19] last famous words :D [16:26:29] haha yeah was literally thinking that [16:28:47] i will say that /dev/sda1 on ml-sandbox keeps filling up and requiring me to prune all the leftover images/containers [16:28:58] mostly due to the large size of some of the revscoring images [16:29:48] can we use the other partition? [16:30:00] yeah that's what i was thinking [16:30:31] I created it to avoid this problem, placed it under /srv but we can mount it anywhere [16:30:34] it has 60G [16:30:48] ahhh i see that now, niiiiice [16:30:49] so 3x the space, without any extra os stuff [16:31:49] hah 1% use, vs 90% use [16:43:21] elukey: just saw your note on the topic image kserve upgrade CR [16:43:25] basically i had to solve a path issue where revscoring had some hardcoded directories for our word2vec assets [16:44:05] :( [16:44:14] i had to replicate that structure in the blubberfile [16:44:43] not too concerning as those models will be superseded by the outlink topic model soon tho [16:48:23] so much tech debt to carry on in k8s [16:55:54] :D [16:56:21] at least it's isolated in containers now lol [17:28:26] re: pip issues -- i read about a new "modern" python package manager yesterday called PDM [17:28:28] https://pdm.fming.dev/ [17:29:34] not convinced it would solve our issues, pip seems to be the most "flexible" [17:29:48] but its nice to know others are thinking about this [17:30:41] i think for now our best bet is to just create a "lockfile" via `pip freeze > requirements.txt` once we getting a working env [17:32:41] although in the future it might be nice to follow something like PEP621 / pyproject.toml once we get our deps under control [17:42:36] ml@wikimedia.org is now live! [17:42:54] I'll add it to the MW page too [17:53:50] nice! [18:07:06] going afk, have a nice weekend folks :) [18:10:25] have a great weekend Luca! [18:14:23] see ya elukey [18:43:30] alright, got the outlink predictor and transformer images upgraded to kserve, just pushed up the final CRs [18:44:17] verified that they both work independently, but the still running into that cluster-local-gateway issue on ml-sandbox [18:44:54] either way, i believe all the images are running kserve now :)