[08:41:47] isaranto: o/ [08:42:03] I think that the bullseye change for revscoring is ready, thanks a lot for the effort! [08:42:46] we can also try kserve 0.9 afterwards if you are still up to it, otherwise I can take care of it (if it is only the numpy change it should be quick in theory) [08:46:22] ο/ Luca! thanks will take a look. I will wait if kevin has something to add and merge it later today [08:46:32] aa [08:46:47] I will need at least one approve on github otherwise I cant merge [08:47:29] elukey: yes let's try the kserve upgrade first as the python one is much bigger. I wouldn't want it to block the kserve one [08:48:56] isaranto: right approved :) [08:49:38] the kserve upgrade can wait, I am not blocked with the control plane.. [08:50:06] I fear that will all the work that you put in the dep resolution if we start another journey with numpy etc.. we throw away all your work [08:50:13] we can work on top of it [08:50:40] we test both things in staging and when we are ready we roll out revscoring model servers in prod [08:50:44] how does it sound? [08:52:27] sounds good [08:52:44] let me put some effort to try to close this today [08:53:19] as long as things stay on the happy path at least :D [08:54:01] nah it is not urgent, you can finish next week as well [08:54:30] it is a big jump, so we can take a normal pace, no rush [09:00:08] * elukey groceries, back in a bit [09:56:53] isaranto: one qs - are we going to use MP for any of the model server? Or better to stick with SP for all? [10:00:22] SP is my suggestion as in the tests we run the improvement didnt justify the change. I will write the summary on the ticket [10:00:36] however we tested only for revscoring. We can revisit sometime [10:00:42] wdyt? [10:03:27] totally ok, from my initial tests it seemed that editquality etc.. may have got a benefit when being hit by a continuos stream of scores, but a lot of changes in the cluster have improved so if it is not good enough +1 for SP [10:03:39] we can always revisit when ChangeProp will start hitting the pods [10:03:55] for other model servers I think that we can see case-by-case [10:04:38] we could leave it on for editquality indeed [10:04:59] nah if it is not clear from your tests that it is needed we can skip it [10:05:36] my main worry is that cpu-bound code will cause lags and connections pile up every now and then, but the overhead with serialization/deserialization of MP is too big atm [10:05:46] I don't think it improves with Ray workers but I could be wrong [10:05:53] (another thing to revisit in case) [10:12:06] elukey: I'll need one more review/approval https://github.com/wikimedia/revscoring/pull/531 . I a lower bound on some requirements >= instead of == so that the revscoring package doesn't restrict us that much [10:14:03] I also merged the dependabot PR. now dependabot submits some PRs by itself. I'll review them at some point [10:14:43] ack [10:15:01] numpy >= 1.19.5, < 1.21 is safe right? 1.20 doesn't require any change in revscoring? [10:15:53] (approved btw) [10:17:11] since revscoring package says < 1.21 it depends on what dependency we apply in our inf-services image. Since we set 1.19.5 we are ok. This will stop up from trying to upgrade numpy in inf-services above 1.20 [10:17:46] I put it like this since I found there is a breaking change in numpy so we will update it afterwards [10:18:23] yes yes I recall [10:18:27] ack all looks good then [10:27:18] very weird, I am hitting this issue https://github.com/istio/istio/issues/35829 with the istio-proxy in staging [10:30:00] and it seems only happening on ml-staging2001, mmmm [10:31:28] restarted the kubelet, all gone [10:33:08] I think that this is related to the weirdnesses happening during the codfw switch failure [10:35:20] in the middle of it there are also kserve 0.9 issues, I see two pods not coming up (in the goodfaith ns) due to [10:35:31] "Failed to update InferenceService status","InferenceService":"eswikiquotewiki-damaging","error":"Operation cannot be fulfilled on inferenceservices.serving.kserve.io \"eswikiquotewiki-damaging\": the object has been modified; please apply your changes to the latest version and try again" [10:36:05] and similar [10:36:06] lovely [10:36:33] the funny thing is that the new isvc that I have deployed worked nicely [10:48:26] I want to test a node drain to see if the issue re-appear on 2002 [10:49:06] hmm. I remembered now that we have MP in some staging images. I will issue a patch to revert these. ok? [10:49:35] yep yep I saw it, but the issue happens also for zhwiki that doesn't have customizations [10:49:45] aha [10:50:41] I also opened a draft PR on revscoring with the required numpy upgrade for kserve 0.9.0 so I don't forget https://github.com/wikimedia/revscoring/pull/538 . lets leave it there for now [10:52:52] yeah on 2002 all pods work fine [10:53:35] nice! [10:53:42] (for the PR) [10:55:56] the issue seems to me not related to kserve 0.9 [10:56:44] 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10isarantopoulos) By the set of load tests we run with wrk and benthos there seem to be mixed results. https://phabricator.wikimedia.org/T323624#8468248 The editquality-damaging... [10:56:52] ok now all works fine afaics [10:57:29] I put a final comment/summary for the load tests [10:57:42] I think these plots summarize our findings for now https://phabricator.wikimedia.org/T323624#8468248 [11:02:54] will review it thanks! [11:04:26] I am inclined to upgrade kserve to ml-serve-codfw later on, to do a broader set of tests [11:21:14] 10Machine-Learning-Team: Upgrade ml clusters to kserve 0.9 - https://phabricator.wikimedia.org/T325528 (10elukey) The test on staging was a little weird since the nodes were in a weird state due to a codfw switch failure that caused some trouble with Kubelets. I tested some deployments: * Change docker image to... [11:21:26] 10Machine-Learning-Team: Upgrade ml clusters to kserve 0.9 - https://phabricator.wikimedia.org/T325528 (10elukey) [11:25:15] 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10elukey) Thanks a lot for the in depth testing! I think that we could enable MP where needed after we start seeing some traffic (for example, when we'll enable ChangeProp's stre... [11:37:36] * elukey lunch! [12:59:42] I sent a cleanup patch for load testing/MP https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/881870 [14:18:13] isaranto: thanks! Let's remove the "predictor" config and/or comment it, it may lead to confusion in my opinion otherwise [14:21:35] elukey: sure! initially I left it there so that it would be easy to enable it (keep it like a reference) but you're right I removed it all [14:21:45] super [14:22:01] going to roll out the kserve 0.9 control plane on ml-serve-eqiad [14:28:55] all right 0.9 rolled out to all clusters [14:31:45] (03PS25) 10Ilias Sarantopoulos: Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [14:33:59] super! [14:34:13] I am also ready to take python 3.9 + bullseye repo to staging [14:34:17] the patch is ready for review [14:34:51] I have tested locally with many models and all seem to work. Disclaimer: I haven't tested with all 100+ models though [14:35:07] going to review it now! [14:39:10] (03CR) 10Ilias Sarantopoulos: Upgrade the revscoring model server to Python 3.9 (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [14:43:02] (03CR) 10Elukey: "Everything looks good, I only have a doubt about the pyenchant file added, that is shipped with https://github.com/pyenchant/pyenchant/blo" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [14:43:43] (03PS26) 10Ilias Sarantopoulos: Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [14:45:25] I just removed also the uninstall of argparse from the entrypoint.sh. There was some legacy version installed through python-dev package in debian as I understand. now it plays well [14:46:09] 10Machine-Learning-Team: Upgrade ml clusters to kserve 0.9 - https://phabricator.wikimedia.org/T325528 (10elukey) The new control plane has been rolled out to all clusters, nothing horrible to report. Next step: move all model server docker images to kserve 0.9 [14:47:59] elukey: regarding the LICENSE what do you suggest? is the header of the file not enough? [14:48:20] isaranto: the header doesn't mention the GNU package license right? [14:48:23] we could explicitly mention that it is not our code, where do you suggest we do that [14:48:43] yeah we could maybe add a special LICENSE file ourselves [14:48:57] not sure, lemme ask to some expert in SRE :) [14:49:55] the header mentions ```This library is free software; you can redistribute it and/or [14:49:55] # modify it under the terms of the GNU Lesser General Public [14:49:55] # License as published by the Free Software Foundation;``` [14:50:46] ahh okok didn't see it [14:50:55] I only saw the copyright [14:50:59] okok then it should be good [14:51:26] (03CR) 10Elukey: Upgrade the revscoring model server to Python 3.9 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [14:52:38] isaranto: the only nit remaining is a little bit more comment in https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/870517/26/revscoring/model.py [14:52:42] maybe a task reference etc.. [14:52:56] my worry is that in 6 months time we don't remember why we did it etc.. [14:53:06] I'm writing a 2-3 line comment [14:53:10] ah super <3 [14:55:50] (03PS27) 10Ilias Sarantopoulos: Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [14:55:56] done! [14:56:13] (03CR) 10Ilias Sarantopoulos: Upgrade the revscoring model server to Python 3.9 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [14:59:17] (03CR) 10Elukey: [C: 03+1] "Great work!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [15:00:02] shall I merge this to try it out in staging? [15:00:17] let's do it [15:03:30] 🤞 [15:03:44] (03CR) 10Ilias Sarantopoulos: [C: 03+2] Upgrade the revscoring model server to Python 3.9 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [15:04:49] (03Merged) 10jenkins-bot: Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [15:18:57] elukey: regarding changeprop and hnowlan's comment: how do we add the kserve endpoints to the networkpolicy ? to this file charts/changeprop/default-network-policy-conf.yaml? [15:24:50] isaranto: in theory it would be better in the helmfile.d's config, but I see that it doesn't seem to be used [15:25:03] I'll check what's best, but the -conf.yaml file is a possibility yes [15:25:24] the only downside is that if you add it at the chart level then it is used everywhere (prod/staging/etc..) [15:26:08] I am trying to DRY the definition of the workflows atm [15:26:45] ack [15:32:53] thank you for the review Luca! u rock! 🤘 [15:34:34] *reviews (plural) 😄 [15:35:06] well all the work was yours :) [15:40:34] it takes a village ! [16:25:46] (03PS1) 10Ilias Sarantopoulos: Deployment script examples: [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/881899 [16:29:03] I found some issues with 2 drafttopic models (ar,cs) and 1 editquality-damaging in staging (wikidata) [16:29:03] will investigate now but probably more on monday [16:29:17] shall I revert these for now or is it ok since they are in staging? [16:31:54] nono all good [16:37:47] ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 80 from PyObject [16:38:15] this is from the new model servers.. [16:38:34] /opt/lib/python/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator GradientBoostingClassifier from version 0.22.1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. [16:39:03] IIRC during my tests there were some incompatibilities between sklearn/numpy [16:39:16] lovely [16:39:29] anyway, this is not for a Friday, we can restart working on it on Monday isaranto :) [16:40:38] These warning are unavoidable. retraining is the only solution. but we can test to see that the models are good [16:41:57] in the end the errors are only on drattopic models (ar, cs, en) and related with numpy. I will investigate on monday, but from what I've read numpy upgrade will make these go away.. [16:42:05] logging off folks, cu monday [16:58:03] o/ [16:58:10] same thing, have a nice weekend folks!