[05:26:29] (03PS1) 10Kevin Bazira: editquality: remove transformer blubberfile [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/761788 (https://phabricator.wikimedia.org/T301412) [06:50:10] (03CR) 10Elukey: [C: 03+2] Reverts back to python 3.5 wheels and includes mwparserfromhell 0.6.3 [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/761015 (https://phabricator.wikimedia.org/T300195) (owner: 10Halfak) [07:23:33] hello folks, there is a weird icinga error for "ores workers" since 12h ago [07:23:45] I quickly checked and ores is fine, I think it is a monitoring weirdness [07:24:00] I haven't found the reason, but I'd need to run some errands now [07:24:04] will check later [09:13:42] kevinbazira: o/ you can deploy your change if you want! I am curious to see if kserve reverts back to its predictor-only behavior transparently [09:19:30] ok thanks for the review elukey. let me deploy the change now. [09:31:31] elukey: the deployments on both eqiad and codfw have completed successfully. [09:31:45] Things look ok on my end. Hope they're stable on your end too? [09:32:34] (in a meeting will check in a sec!) [09:33:00] ok ... thanks. [11:29:21] kevinbazira: o/ sorry just finished the meetings, and now I have to run a little errand, but from a quick check I see old pods in both clusters [11:31:34] yep so the InferenceService resource in kubernetes still has the transformer [11:31:37] mmmm [11:31:39] weird [11:40:02] it is surely something weird happening in the kserve-inference chart, I'll check after lunch/errands! [14:14:08] kevinbazira: curiosity - when you ran helmfile, did you diff? [14:14:16] I am wondering if you saw any diff at all [14:14:22] Yes, I did diff [14:14:39] and what was the output? [14:14:40] and there was a diff [14:14:44] (03CR) 10Klausman: [C: 03+1] editquality: remove transformer blubberfile [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/761788 (https://phabricator.wikimedia.org/T301412) (owner: 10Kevin Bazira) [14:14:51] super weird [14:15:05] do you have it somewhere by any chance? [14:15:20] yes, I do [14:15:24] \o/ [14:15:31] let me try to find it [14:16:38] not sure if it is in the docs but you can check the status of the pods in theory, from deploy1002 [14:16:48] kube_env revscoring-ediquality ml-serve-eqiad [14:16:51] and then [14:17:00] kubectl get pods -n revscoring-editquality [14:17:24] afaics nothing changed, transformers are already there.. and the inferenceservice resource didn't change [14:18:27] I've pasted it in your slack dm because it handles code formatting well [14:19:31] thanks a lot! [14:19:36] exactly as it should have been [14:21:35] ahh ok maybe I get what's happening [14:22:08] the resource has also a "status:" config that I believe is auto-filled [14:22:18] and it still lists the transformer [14:24:25] trying to manually delete the pod [14:26:13] that of course comes up again due to replication policies [14:29:14] I think that the kserve controller has either a bug or a missing functionality [14:32:03] it seems something like https://github.com/kserve/kserve/issues/460 [14:38:25] Unha ... do we want to manually delete the transformer? [14:38:27] or better yet delete the entire inference service using the confic that had both transformer + predictor settings then redeploy it afresh with a config that has on the predictor settings? [14:47:05] not sure what's best, maybe it is a knative-serving issue? [15:42:15] Morning all [15:42:59] Minor panic attack that ores was down but sounded like a false alarm [15:43:15] yes yes :) [16:12:49] kevinbazira: so with `kubectl get ksvc -n revscoring-editquality` I got a list of knative services, and I just deleted the transformer ones [16:12:54] it seems that now the pods are going down [16:20:25] it is weird, I hoped that it was taken care by kserve [16:20:34] anyway, the predictors seem to work :) [16:23:01] (03PS1) 10Elukey: Update wheels submodules with latest changes [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/761946 (https://phabricator.wikimedia.org/T300195) [16:23:40] this is the submodule bump for the new wheels --^ [16:23:51] once reviewed i'll merge and deploy to beta, to see if all works [16:25:02] ok now I'd need to figure out what's happening with that ORES alert [16:41:15] chrisalbon: I had a very interesting chat with Miriam today, and she told me about a feature-store related use case that is very interesting. For some model training, they would ideally need (in the future) as many scores of edits as possible (for example, say edit quality etc..). She was worried about having a specific training job calling liftwing over and over to score revisions, and I realized [16:41:21] that having the current kafka topic populated by scores is not a bad idea [16:41:37] we could have airflow+spark to periodically pull from kafka and create feature datasets that anybody could use [16:42:05] it would of course require to keep the current architecture of constant cache warmup [16:42:25] I never realized before today that scores could have been useful as features for training jobs [16:42:29] does it make sense? [16:42:47] (this is kind of unknown territory for me so if I said something completly wrong let me know :) [16:43:45] perhaps ores should be a good event driven application and propagate any state changes to an event stream! :D :D :D [16:44:07] then you could use that stream to populate your caches anywhere too! [16:44:45] score every revision with every model as the revision event happens, produce the scores to streams. then ores only gets called once to do a score per revision ever [16:46:27] ottomata: we are not going to add that functionaly to ORES for sure, Lift Wing is the future :) [16:47:05] lift wing is serving yes? [16:47:21] ottomata: with knative-eventing we could, in theory, pull from a kafka topic containing new revision events and score them automatically (but it would require a more up to date version of k8s/knative/etc.. so not the immediate future) [16:47:27] oh lift wing is serving requests to a model [16:47:37] pushing the score elsewhere may be a little more difficult [16:47:57] yeah it is what will replace ORES in production basically [16:48:01] i guess; the revision scores are very very useful outside of a request [16:48:05] they'd be useful in dumps, for example [16:48:57] the revision scores are a data product in themselves, and hopefully one day will be managed as such, so that they are curated and consumable and useable by things outside of lift wing [16:49:12] ottomata: definitely, I didn't have a lot of use cases in mind, one of the things that we are planning to do is to figure out how ORES is really used and what could be a good way forward. Now it makes a lot of sense what you proposed yesterday [16:49:28] yes yes, for example feature datasets [16:49:41] we could easily have airflow+spark to build those datasets periodically [16:49:53] I mean curated by either your team or my team (or a collaboration) [16:50:30] elukey: i give you shared-data platform doc (still draft, hopefully out of draft by end of this month) https://docs.google.com/document/u/1/d/15QqLTsKIrUCfhGPHIkl6OKeh2S1NZeEe4h0O9yTm7Fo/edit#heading=h.k12p3bpi70y5 :) [16:51:02] I'll read it :) [16:51:05] ty! [16:51:10] comments very very welcome pleeeezzzzz [16:51:38] sure! [16:52:38] That is interesting, I'd love to hear from miriam what the specifics are. It isn't so much about selecting the final features of Lift Wing (since there is no "final" steady state) but selecting the features we need for MVP, getting those done, then picking some more in the post-MVP "sprint" [16:52:53] yep yep [16:53:28] it makes a lot more sense to me now to score as many revisions as possible by default [16:54:53] My ideal is that we launch the MVP, have people use it and say "Hey, this sucks, it doesn't do X, I want it to do X" then we build X. Then Y, then Z, then K. etc. etc. Miriam's use might be that so its good you both chatted! [16:59:39] chrisalbon: yep definitely, I was reasoning more about what is needed to move the ORES use cases to lift wing [16:59:42] (after the MVP) [17:03:49] Yeah good point [17:18:17] just changed the alerts of 'team-scoring' to this channel, they were pointing to #wikimedia-ai [17:34:47] thank you elukey for bringing this up! chrisalbon: let's chat about this further in our meeting next week :) [19:08:52] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Halfak) Aha! I caught something rebuilding... [19:16:29] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Dzahn) @halfak @elukey For some reason aspel... [19:19:17] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Halfak) Thanks @Dzahn! Can you also run pup... [19:45:39] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10elukey) @thanks a lot @Dzahn! @Halfak pack... [19:51:23] (03CR) 10Halfak: [C: 03+1] "Looks good to me. Thanks." [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/761946 (https://phabricator.wikimedia.org/T300195) (owner: 10Elukey) [19:52:19] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Halfak) @elukey thanks for asking. It would... [19:54:14] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Halfak) Also, confirmed that I now have the...