[06:54:09] (03CR) 10Elukey: "Left a couple of comments but looks good!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/764915 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [07:11:42] hello folks [07:15:20] I think that we should probably create three new namespaces [07:15:27] - revscoring-editquality-goodfaith [07:15:35] - revscoring-editquality-reverted [07:15:55] -revscoring-editquality-damaging [07:16:21] so pods will spread over those and it will be easier to manage them [07:16:33] I am going to create the namespaces/users right now [07:16:47] and then we will be able to start moving isvcs around [08:24:36] I am almost done in configuring the new namespaces/helmfile config [08:24:56] after that I'll try to delete the revscoring-edituality's isvc [08:25:00] and ramp up the others [09:53:29] all right first patch ready [09:53:44] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/765228 [09:53:56] it moves reverted models to the new ns [10:13:57] nice deleting isvc via helm works nicely [10:14:40] ok deploying the reverted models in the new ns [10:14:42] let's see if it works [10:17:07] yep working! [10:17:08] \o/ [10:17:12] Very nice [10:17:14] will keep going with damaging [10:20:17] and https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/765235/ should be enough to do the final moves [10:24:53] LGTM [10:25:57] Though something in my gut dislikes the mild repetitiveness of the list of models+versions. Like it should be a one-per-line thing. I guess that's just formatting. [10:40:25] now that we have everything more compartimentalized, we'd need at least two variables [10:40:29] wiki and version [10:40:45] we could add an option to set the default model, unless specified [10:43:11] you can try to modify the kserve-inference chart if you want, to reduce the verbosity [10:43:38] atm we have "inference_service" that is the more general and customizable config [10:43:57] and then revscoring_inference_services", that is heavily tuned for the revscoring use case [10:44:12] both define inference service (isvc) resources [10:44:33] so there is some overlapping, but better to have two than a single one with 100 conditionals in my opinion [10:44:47] at least for the moment, we'll probably have a better version in the future [10:44:56] Yeah, I don't think there's an easy way to make that YAML more parameterized, so we'd have only a flat list fo things. [10:45:25] but a flat list of things is probably not possible, we'd need at least two infos for each model [10:46:15] The mapping of model to version has to be somewhere, *and* we need to be able to set the host in some spots. I figure the list is as compact as it can be in YAML. Or at least that there's no way to make it more compact without also making it very unwieldy for some cases. [10:47:58] there is indeed some yaml intrinsic barrier [10:48:08] as yaml engineers we have to embrace it [10:48:10] :) [10:48:46] As the US Marines say: "Embrace the suck" [10:49:04] ml-serve-eqiad works with the three new namespaces, going to update codfw as well [10:49:06] It builds character etc. [10:49:48] SGTM [11:13:09] codfw done as well [11:13:20] I am going to clean up the old namespace, after that we should be done [11:20:45] revscoring-editquality cleanup https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/765242 [11:22:32] https://gerrit.wikimedia.org/r/c/labs/private/+/765243 [11:24:09] and finally https://gerrit.wikimedia.org/r/c/operations/puppet/+/765244 [11:24:22] (there is also some clean up to do in the puppet private repo) [11:26:39] all LGTM'd, now going out to hunt some pizza [11:28:27] kevinbazira: o/ [11:31:17] elukey: o/ [11:33:29] kevinbazira: hello :) so the revscoring-editquality ns has been split into three (reverted, goodfaith, damaging) [11:33:36] same kind of config, but more granular [11:35:02] 10Lift-Wing, 10artificial-intelligence, 10editquality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Add editquality isvc configurations to ml-services helmfile - https://phabricator.wikimedia.org/T301415 (10elukey) >>! In T301415#7730437, @ACraze wrote: > After talking with @eluke... [11:35:37] lemme know if you see any weird thing [11:35:43] going afk for lunch, ttl! [12:29:27] thank you for splitting the revscoring-editquality namespace, elukey. [12:29:43] i was wondering whether the model name could also be got from the namespace. [12:29:56] i.e model: "reverted" could be got from "revscoring-editquality-reverted". [12:30:12] 've pushed a patch using the new configs: https://gerrit.wikimedia.org/r/765254 [12:45:58] kevinbazira: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/765260/ [12:46:01] :) [12:47:23] it is a no-op, if you like it I'll deploy but you'll have to change your patch [12:50:42] that was fast. thank you for working on it. i've +1'd. :) [12:51:09] please deploy ... i'll change the patch when you're done. [13:04:18] weird CI showed no diff, but helmfile shows diffs [13:04:19] mmm [13:08:57] of course I missed to bump the chart sigh [13:10:21] basically https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/765264 [13:10:51] kevinbazira: you can change your patch in the meantime [13:11:13] has yours merged yet? [13:12:17] ok, i've checked and it's merges. let me change the patch now. [13:12:17] yeah but I need a follow up that shouldn't matter for you [13:17:12] yep all good now [13:26:13] great. I've pushed patch set #2 [13:31:31] kevinbazira: yep looks good, checked all models on swift, merging in a sec [13:33:25] kevinbazira: all good! you can deploy [13:33:39] I deployed the new chart only in eqiad goodfaith [13:33:40] ok, let me deploy now. [13:33:58] if you deploy you'll also pick up my changes, pods will be refreshed but no-ops [13:34:04] (03CR) 10Klausman: [C: 03+1] "LGTM, with the same nits Luca mentioned." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/764915 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [13:34:09] (so if you see pods terminating/creating it is fine) [13:35:08] elukey: is there a rolling deployment kinda functionality built into kserve? [13:35:43] As in, start new version, give it a few queries and see if it crashes, then move over traffic and remove the old ones. [13:37:39] Or Helm, really. Dman, too many names and systems :) [13:38:44] klausman: we need to experiment with it, but knative-serving should give the ability to do it [13:39:24] anytime you deploy knative creates a so called revision, and you can instruct knative to split traffic between old/new revision based on some percentage [13:40:32] it should also be possible to have a 100/0 split (old/new), but our version of knative may not support similar things (or have bugs) [13:40:54] there are some examples in the fixtures of kserve-inference IIRC [13:41:56] helm is able to understand, in some cases, if new pods are failing etc.. [13:42:00] and rollback in case [13:42:29] (even if there are errors from webhooks while deploying etc..) [13:44:03] in our case I think that knative/istio takes care of not routing the traffic to a new revision/pod until it reaches a "ready" state [13:44:44] so if the kserve-container passes health checks etc.. [13:45:12] in theory if a pod comes up completely broken, no user traffic will be routed to it [13:46:27] * elukey bbiab [13:47:06] elukey: the deploy steps have increased. one has to go back and forth in more namespace directories. trade off :D [13:47:14] anyway, both eqiad and codfw deployments have been completed successfully. [13:47:26] checking pods now ... [13:53:06] all 5 new pods are up and running \o/ [13:59:43] kevinbazira: at steady statte [13:59:45] jff [13:59:53] yep [13:59:56] can't type anymore, retring :D [14:00:11] at steady state we shouldn't, in theory, deploy to all namespaces etc.. [14:00:21] elukey: that sounds good, I shoudl do some reading on the rollout/rollback schemes we can use. [14:00:25] so yeah more steps now, but better compartments :) [14:03:01] yep [14:04:19] so we are at 42 editquality models, ~30 more to go? [14:08:38] we can probably start adding more at the same time [14:16:17] I've also had a look ath the k8s graphs in Grafana. Memory/CPU usage looks the same before/after the split. I did not _expect_ any changes, but you know how it can be [14:16:58] The control plane pods seem to have a very mild memory bump, but nothing worrisome [14:31:13] elukey: also, I know I am late to the party, but change 765260 is very neat [14:41:04] thanks :) [14:50:41] (going out to get groceries, bbl) [15:05:53] Morning! [15:37:39] morning! [15:40:39] I answered so many emails yesterday, so many [15:40:49] is this programming? [15:42:28] you can start scripting your answers :D [15:43:21] "Sorry for the late reply {company}, my family had COVID. No I don't want your ML SaaS product. Thank you for contacting us" [15:49:33] chrisalbon: I had an email today from a spammer asking how much profit a non profit made [15:49:54] they wanted to value miraheze before making an offer to buy it [15:50:07] RhinosF1: Reply: This much: > < (not to scale) [15:50:43] klausman: i deleted it [15:50:50] along with 10 emails about sunglasses [15:50:51] Oh well :) [15:51:05] Replying only encourages them anyway [15:53:29] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10elukey) Recap of the next steps (as I unders... [16:43:01] I am testing ores in beta and it is a real nightmare [16:44:09] I am rebooting the instance to see if anything good happens [16:55:19] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Halfak) Hey folks! Just got back from some... [16:55:20] for example, `curl -i localhost:8081/v3/scores/enwiki/1` leads to a 5s timeout in beta [16:55:27] but it works nicely in prod [16:57:30] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10elukey) @Halfak already fixed sorry, I was r... [16:59:45] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Halfak) Aha! What are the chances! I won... [17:03:39] o/ [17:06:15] heya andy [17:17:31] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Majavah) >>! In T300195#7732670, @Halfak wro... [17:50:25] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10elukey) >>! In T300195#7732748, @Majavah wro... [18:02:00] /7 [18:04:14] klausman: ping :) (team meeting if you are joining) [18:08:05] oops!