[06:38:46] hello folks [06:39:14] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/700470/17 for kubeflow (PS 17 :P) seems to be ready for a review [06:40:20] will try to get some reviews today, I'll also try to create the webhook's TLS cert [08:01:22] I'll have a look at that charts patch today [08:04:19] super thanks [08:06:45] 17051 lines of YAML. oof. [08:08:25] the weird thing is that most of it is for the api conversion webhook IIUC [08:08:28] form alpha to beta [08:08:36] (and we decided to go for beta directly) [08:08:54] so there may be a chance to drop say 12/13k lines of code but I have no idea how [08:41:09] Yeah, I think a file like that would be better served by JSON. Easier to verify nesting etc [08:41:35] Maybe with indentation-based folding, I could make sense of it. I'll take a look at that (but I won't hold up review with it) [08:50:42] It will be probably trimmed by upstream when they will deprecate the alpha api [09:19:00] about the TLS certificate for the kfserving webhook [09:19:21] the initial idea was - inference-kfserving-webhook.wikimedia.org, but I think that we may need to have two [09:19:34] .svc.{eqiad,codfw}.wmnet [09:52:48] so like my-awesome-inferenc.svc.[dc].wmnet? [09:52:55] inference* [10:10:14] I meant inference-kfserving-webhook.svc.{eqiad,codfw}.wmnet [10:10:40] we could even use the same cert in codfw/eqiad, and call it .wikimedia.org [10:11:11] the cert needs to be exposed by the webhook so that the k8s api can call it [10:11:40] (also we'll need to instruct the k8s api to trust the puppet ca crt, there is a bit added to the chart for it) [10:12:16] there is some puppet private config to the deploy the certs to some helmfile on deploy1002 [10:12:21] to be used as values [10:12:49] but for the moment used only for .discovery.wmnet records (the ones used by other services) [10:13:34] ah no nevermind, all services have a discovery records + codfw + eqiad [10:13:42] no .wikimedia.org, makes sense [10:14:18] probably a .discovery record is better for this use case too [10:15:18] okok so I'd say [10:15:23] inference-kfserving-webhook [10:15:28] .discovery.wmnet [10:15:46] using the kube_services.certs.yaml [10:17:48] klausman: is it ok as name? If so I'll start creating it after lunch [10:18:33] ah no waiiiiittttt [10:18:49] of course discovery is not right, we'll not have a nodeport [10:18:55] it will be a cluster ip [10:19:07] so no gdns support, uffffff [10:21:18] we could in theory just use it, since the webhook's endpoint is configurable [10:21:39] but it may be confusing at first, since there will be no correspondent .discovery.wmnet record in our gdns setup [10:21:48] (we'd basically only use its TLS cert) [10:22:10] mmmmmmmmmmmm [10:24:04] I added a comment, .discovery may be just ok for the time being [10:24:52] we manually generate the cert using what serviceops is already providing (https://wikitech.wikimedia.org/wiki/Enable_TLS_for_Kubernetes_deployments) [10:25:09] and then we make sure that helm is able to see it and deploy it [10:26:01] for the external endpoint, inference.discovery.wmnet, we'll do the whole thing (including LVS etc..) [10:26:18] but it is not needed at the moment, we can just use plain HTTP for istio/knative [10:28:18] Sorry, I was cooking :) Reading backlog now [10:28:30] I have to go in a bit no hurry :) [10:46:09] * elukey lunch! [12:55:11] +1'd. [13:21:22] back! [15:11:13] ok inference-kfserving-webhook.discovery.wmnet TLS cert created, and deployed on deploy1002 [15:11:23] now https://gerrit.wikimedia.org/r/700470 [15:11:32] err after --^ we'll need a helmfile config [15:11:37] for the "inference" service [15:40:55] afternoon all! [15:48:29] \o [15:48:34] o/ [15:48:49] I ate way too much pizza this weekend [15:50:30] there is no such thing as having too much pizza :D [15:50:51] my weight disagrees [15:55:50] If you eat enough pizza, you become as round as the pizza. [15:56:31] that is about right, but there is really great pizza around me so it is a trade off [15:58:55] so for the base kfserving layer I thought about another admin_ng setup: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/709494 [15:59:13] this in theory should spin up the kfserving-manager pod + configure webhooks and CRDs [15:59:20] reading... [15:59:46] and under "services" we'll have only a helmfile to configure the InferenceService instances [16:00:06] That looks ace, +1ing [16:07:23] elukey: that looks great! [16:15:46] accraze: o/ we'll see! I am going to wait to see if anybody from serviceops agrees with the solution (since I am touching admin_ng), and we'll then hopefully deploy soon-ish [16:16:15] if this is ok, the next step would be to create something like "inference" under the "services" dir, with a basic helmfile [16:16:24] so people will work on it in the future [16:20:24] * elukey is looking forward to see a score created on kfserving :) [16:23:10] me too! [16:25:08] I imagine us gathering our newly-made magic eightball, Luca shaking it, and Andy going "It says.... 503 service unavailable" [16:28:08] ^ haha [16:30:13] I will admit I adapted this from an anecdote about the shooting of the Parks'n'Rec TV show. Chris Pratt one day ad-libbed the line "I googled your symptoms and it says you may have Internet connection problems." The writers hated him for coming up with such a great line as an ad-lib [16:30:29] ha [16:32:35] 10Machine-Learning-Team, 10CommRel-Specialists-Support: Outreach campaign to raise awareness of Scoring Platform - https://phabricator.wikimedia.org/T217232 (10Keegan) a:05Keegan→03None [16:32:56] 10Machine-Learning-Team, 10CommRel-Specialists-Support: Outreach campaign to raise awareness of Scoring Platform - https://phabricator.wikimedia.org/T217232 (10Keegan) I'm pretty sure this task can be closed, but I haven't owned it in a long time. [16:34:01] 10Machine-Learning-Team, 10CommRel-Specialists-Support: Outreach campaign to raise awareness of Scoring Platform - https://phabricator.wikimedia.org/T217232 (10calbon) 05Stalled→03Declined [16:34:03] 10Machine-Learning-Team, 10CommRel-Specialists-Support: Outreach campaign to raise awareness of Scoring Platform - https://phabricator.wikimedia.org/T217232 (10calbon) Yep I'll close it. Thanks for marking. [16:46:58] klausman: after shaking and getting a 503 you can probably imagine the stream of italian swear words :D [16:47:45] Well, the plot twist is, that the question was "What is the most common error message if you misconfigure an nginx reverse proxy?" [16:51:12] but how can you tell if the response was an error one or the computed one? :D [16:51:25] A classic conundrum [16:52:21] There was a town in Wales (where all signs have to be both English and Welsh) that wanted a sign translation. What they got back from the translator and put on the sign was "I'm out of the office, send any translation requests to my colleague" [16:54:07] And of course there's that photo from a Chinese factory, where the banner in the background says: "Translation server error" [16:59:15] 10Machine-Learning-Team, 10CommRel-Specialists-Support: Outreach campaign to raise awareness of Scoring Platform - https://phabricator.wikimedia.org/T217232 (10Tgr) The underlying goal of disseminating knowledge of ORES among the on-wiki technical community is still valid and much needed, though (assuming ORES... [17:28:51] * elukey afk! [17:30:11] 10Machine-Learning-Team, 10CommRel-Specialists-Support: Outreach campaign to raise awareness of Scoring Platform - https://phabricator.wikimedia.org/T217232 (10calbon) Yeah totally agree. Even more important as we deprecate ORES (very long term) to have clear messaging and discussion. [17:34:48] klausman: that's what happens when people can't ask questions [17:35:20] ha! true [17:38:35] chrisalbon: which part? [17:42:05] when people cant ask questions [17:42:31] chrisalbon: that's why I need a code reviewer [17:42:42] You'd be surprised what I write at midnight [17:43:11] I've been using Github Copilot and sometimes I feel like I am its code reviwer [17:43:13] * RhinosF1 is one team where he's the only Python dev so just gets blind merges [17:43:20] Oh how's that going [17:44:39] It is interesting, for simple boilerplate stuff like "open very pdf file in a directory" it is okay but for more complex stuff it just gets confused and starts recommending random stuff [17:50:16] Heh [17:50:19] I'm not shocked [17:50:43] I bet half the code on github for complex stuff isnt clean enough to make sense [17:52:28] I thought it might be the stackoverflow killer, but now I don't think so [17:53:01] Most SO questions are a really problem with a very specific nuance or twist, and CoPilot won't do well in those situations [18:27:41] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Inference Clients - https://phabricator.wikimedia.org/T287051 (10ACraze) Great work on finding a solution for both sandbox environments. It seems stable on KFv1.3 for enwiki-goodfaith. [18:32:57] 10Lift-Wing, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks): Create blubberfile for articlequality model server - https://phabricator.wikimedia.org/T287781 (10ACraze) a:03ACraze [18:38:31] chrisalbon: oh it definately wont with twists [18:57:51] Yeah it kinda looks like it either ignores the twist or misinterprets it. Not really copilot's fault since sometimes the nuances are really specific, but it is clear that software engineers aren't becoming outdated anytime soon