[08:41:47] <elukey>	 isaranto: o/
[08:42:03] <elukey>	 I think that the bullseye change for revscoring is ready, thanks a lot for the effort!
[08:42:46] <elukey>	 we can also try kserve 0.9 afterwards if you are still up to it, otherwise I can take care of it (if it is only the numpy change it should be quick in theory)
[08:46:22] <isaranto>	 ο/ Luca! thanks will take a look. I will wait if kevin has something to add and merge it later today
[08:46:32] <isaranto>	 aa
[08:46:47] <isaranto>	 I will need at least one approve on github otherwise I cant merge
[08:47:29] <isaranto>	 elukey: yes let's try the kserve upgrade first as the python one is much bigger. I wouldn't want it to block the kserve one
[08:48:56] <elukey>	 isaranto: right approved :)
[08:49:38] <elukey>	 the kserve upgrade can wait, I am not blocked with the control plane..
[08:50:06] <elukey>	 I fear that will all the work that you put in the dep resolution if we start another journey with numpy etc.. we throw away all your work
[08:50:13] <elukey>	 we can work on top of it 
[08:50:40] <elukey>	 we test both things in staging and when we are ready we roll out revscoring model servers in prod
[08:50:44] <elukey>	 how does it sound?
[08:52:27] <isaranto>	 sounds good
[08:52:44] <isaranto>	 let me put some effort to try to close this today 
[08:53:19] <isaranto>	 as long as things stay on the happy path at least :D
[08:54:01] <elukey>	 nah it is not urgent, you can finish next week as well
[08:54:30] <elukey>	 it is a big jump, so we can take a normal pace, no rush
[09:00:08] * elukey groceries, back in a bit
[09:56:53] <elukey>	 isaranto: one qs - are we going to use MP for any of the model server? Or better to stick with SP for all?
[10:00:22] <isaranto>	 SP is my suggestion as in the tests we run the improvement didnt justify the change. I will write the summary on the ticket
[10:00:36] <isaranto>	 however we tested only for revscoring. We can revisit sometime
[10:00:42] <isaranto>	 wdyt?
[10:03:27] <elukey>	 totally ok, from my initial tests it seemed that editquality etc.. may have got a benefit when being hit by a continuos stream of scores, but a lot of changes in the cluster have improved so if it is not good enough +1 for SP
[10:03:39] <elukey>	 we can always revisit when ChangeProp will start hitting the pods
[10:03:55] <elukey>	 for other model servers I think that we can see case-by-case
[10:04:38] <isaranto>	 we could leave it on for editquality indeed
[10:04:59] <elukey>	 nah if it is not clear from your tests that it is needed we can skip it
[10:05:36] <elukey>	 my main worry is that cpu-bound code will cause lags and connections pile up every now and then, but the overhead with serialization/deserialization of MP is too big atm
[10:05:46] <elukey>	 I don't think it improves with Ray workers but I could be wrong
[10:05:53] <elukey>	 (another thing to revisit in case)
[10:12:06] <isaranto>	 elukey: I'll need one more review/approval https://github.com/wikimedia/revscoring/pull/531 . I a lower bound on  some requirements >= instead of == so that the revscoring package doesn't restrict us that much
[10:14:03] <isaranto>	 I also merged the dependabot PR. now dependabot submits some PRs by itself. I'll review them at some point
[10:14:43] <elukey>	 ack
[10:15:01] <elukey>	 numpy >= 1.19.5, < 1.21 is safe right? 1.20 doesn't require any change in revscoring?
[10:15:53] <elukey>	 (approved btw)
[10:17:11] <isaranto>	 since revscoring package says < 1.21 it depends on what dependency we apply in our inf-services image. Since we set 1.19.5 we are ok. This will stop up from trying to upgrade numpy in inf-services above 1.20
[10:17:46] <isaranto>	 I put it like this since I found there is a breaking change in numpy so we will update it afterwards
[10:18:23] <elukey>	 yes yes I recall
[10:18:27] <elukey>	 ack all looks good then
[10:27:18] <elukey>	 very weird, I am hitting this issue https://github.com/istio/istio/issues/35829 with the istio-proxy in staging
[10:30:00] <elukey>	 and it seems only happening on ml-staging2001, mmmm
[10:31:28] <elukey>	 restarted the kubelet, all gone
[10:33:08] <elukey>	 I think that this is related to the weirdnesses happening during the codfw switch failure
[10:35:20] <elukey>	 in the middle of it there are also kserve 0.9 issues, I see two pods not coming up (in the goodfaith ns) due to
[10:35:31] <elukey>	 "Failed to update InferenceService status","InferenceService":"eswikiquotewiki-damaging","error":"Operation cannot be fulfilled on inferenceservices.serving.kserve.io \"eswikiquotewiki-damaging\": the object has been modified; please apply your changes to the latest version and try again"
[10:36:05] <elukey>	 and similar
[10:36:06] <elukey>	 lovely
[10:36:33] <elukey>	 the funny thing is that the new isvc that I have deployed worked nicely
[10:48:26] <elukey>	 I want to test a node drain to see if the issue re-appear on 2002
[10:49:06] <isaranto>	 hmm. I remembered now that we have MP in some staging images. I will issue a patch to revert these. ok?
[10:49:35] <elukey>	 yep yep I saw it, but the issue happens also for zhwiki that doesn't have customizations
[10:49:45] <isaranto>	 aha
[10:50:41] <isaranto>	 I also opened a draft PR on revscoring with the required numpy upgrade for kserve 0.9.0 so I don't forget https://github.com/wikimedia/revscoring/pull/538 . lets leave it there for now
[10:52:52] <elukey>	 yeah on 2002 all pods work fine
[10:53:35] <elukey>	 nice!
[10:53:42] <elukey>	 (for the PR)
[10:55:56] <elukey>	 the issue seems to me not related to kserve 0.9
[10:56:44] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10isarantopoulos) By the set of load tests we run with wrk and benthos there seem to be mixed results. https://phabricator.wikimedia.org/T323624#8468248 The editquality-damaging...
[10:56:52] <elukey>	 ok now all works fine afaics
[10:57:29] <isaranto>	 I put a final comment/summary for the load tests
[10:57:42] <isaranto>	 I think these plots summarize our findings for now https://phabricator.wikimedia.org/T323624#8468248
[11:02:54] <elukey>	 will review it thanks!
[11:04:26] <elukey>	 I am inclined to upgrade kserve to ml-serve-codfw later on, to do a broader set of tests
[11:21:14] <wikibugs>	 10Machine-Learning-Team: Upgrade ml clusters to kserve 0.9 - https://phabricator.wikimedia.org/T325528 (10elukey) The test on staging was a little weird since the nodes were in a weird state due to a codfw switch failure that caused some trouble with Kubelets.  I tested some deployments: * Change docker image to...
[11:21:26] <wikibugs>	 10Machine-Learning-Team: Upgrade ml clusters to kserve 0.9 - https://phabricator.wikimedia.org/T325528 (10elukey)
[11:25:15] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10elukey) Thanks a lot for the in depth testing! I think that we could enable MP where needed after we start seeing some traffic (for example, when we'll enable ChangeProp's stre...
[11:37:36] * elukey lunch!
[12:59:42] <isaranto>	 I sent a cleanup patch for load testing/MP https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/881870
[14:18:13] <elukey>	 isaranto: thanks! Let's remove the "predictor" config and/or comment it, it may lead to confusion in my opinion otherwise
[14:21:35] <isaranto>	 elukey: sure! initially I left it there so that it would be easy to enable it (keep it like a reference) but you're right I removed it all
[14:21:45] <elukey>	 super
[14:22:01] <elukey>	 going to roll out the kserve 0.9 control plane on ml-serve-eqiad
[14:28:55] <elukey>	 all right 0.9 rolled out to all clusters
[14:31:45] <wikibugs>	 (03PS25) 10Ilias Sarantopoulos: Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey)
[14:33:59] <isaranto>	 super!
[14:34:13] <isaranto>	 I am also ready to take python 3.9 + bullseye repo to staging
[14:34:17] <isaranto>	 the patch is ready for review
[14:34:51] <isaranto>	 I have tested locally with many models and all seem to work. Disclaimer: I haven't tested with all 100+ models though
[14:35:07] <elukey>	 going to review it now!
[14:39:10] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: Upgrade the revscoring model server to Python 3.9 (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey)
[14:43:02] <wikibugs>	 (03CR) 10Elukey: "Everything looks good, I only have a doubt about the pyenchant file added, that is shipped with https://github.com/pyenchant/pyenchant/blo" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey)
[14:43:43] <wikibugs>	 (03PS26) 10Ilias Sarantopoulos: Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey)
[14:45:25] <isaranto>	 I just removed also the uninstall of argparse from the entrypoint.sh. There was some legacy version installed through python-dev package in debian as I understand. now it plays well
[14:46:09] <wikibugs>	 10Machine-Learning-Team: Upgrade ml clusters to kserve 0.9 - https://phabricator.wikimedia.org/T325528 (10elukey) The new control plane has been rolled out to all clusters, nothing horrible to report.  Next step: move all model server docker images to kserve 0.9
[14:47:59] <isaranto>	 elukey: regarding the LICENSE what do you suggest? is the header of the file not enough? 
[14:48:20] <elukey>	 isaranto: the header doesn't mention the GNU package license right?
[14:48:23] <isaranto>	 we could explicitly mention that it is not our code, where do you suggest we do that
[14:48:43] <elukey>	 yeah we could maybe add a special LICENSE file ourselves
[14:48:57] <elukey>	 not sure, lemme ask to some expert in SRE :)
[14:49:55] <isaranto>	 the header mentions ```This library is free software; you can redistribute it and/or
[14:49:55] <isaranto>	 # modify it under the terms of the GNU Lesser General Public
[14:49:55] <isaranto>	 # License as published by the Free Software Foundation;```
[14:50:46] <elukey>	 ahh okok didn't see it
[14:50:55] <elukey>	 I only saw the copyright 
[14:50:59] <elukey>	 okok then it should be good
[14:51:26] <wikibugs>	 (03CR) 10Elukey: Upgrade the revscoring model server to Python 3.9 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey)
[14:52:38] <elukey>	 isaranto: the only nit remaining is a little bit more comment in https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/870517/26/revscoring/model.py
[14:52:42] <elukey>	 maybe a task reference etc..
[14:52:56] <elukey>	 my worry is that in 6 months time we don't remember why we did it etc..
[14:53:06] <isaranto>	 I'm writing a 2-3 line comment
[14:53:10] <elukey>	 ah super <3
[14:55:50] <wikibugs>	 (03PS27) 10Ilias Sarantopoulos: Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey)
[14:55:56] <isaranto>	 done!
[14:56:13] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: Upgrade the revscoring model server to Python 3.9 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey)
[14:59:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Great work!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey)
[15:00:02] <isaranto>	 shall I merge this to try it out in staging?
[15:00:17] <elukey>	 let's do it
[15:03:30] <isaranto>	 🤞
[15:03:44] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] Upgrade the revscoring model server to Python 3.9 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey)
[15:04:49] <wikibugs>	 (03Merged) 10jenkins-bot: Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey)
[15:18:57] <isaranto>	 elukey: regarding changeprop and hnowlan's comment: how do we add the kserve endpoints to the networkpolicy ? to this file charts/changeprop/default-network-policy-conf.yaml?
[15:24:50] <elukey>	 isaranto: in theory it would be better in the helmfile.d's config, but I see that it doesn't seem to be used
[15:25:03] <elukey>	 I'll check what's best, but the -conf.yaml file is a possibility yes
[15:25:24] <elukey>	 the only downside is that if you add it at the chart level then it is used everywhere (prod/staging/etc..)
[15:26:08] <elukey>	 I am trying to DRY the definition of the workflows atm
[15:26:45] <isaranto>	 ack
[15:32:53] <isaranto>	 thank you for the review Luca! u rock! 🤘
[15:34:34] <isaranto>	 *reviews (plural) 😄
[15:35:06] <elukey>	 well all the work was yours :)
[15:40:34] <isaranto>	 it takes a village !
[16:25:46] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: Deployment script examples: [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/881899
[16:29:03] <isaranto>	 I found some issues with 2 drafttopic models (ar,cs) and 1 editquality-damaging in staging (wikidata)
[16:29:03] <isaranto>	 will investigate now but probably more on monday
[16:29:17] <isaranto>	 shall I revert these for now or is it ok since they are in staging?
[16:31:54] <elukey>	 nono all good
[16:37:47] <elukey>	 ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 80 from PyObject
[16:38:15] <elukey>	 this is from the new model servers..
[16:38:34] <elukey>	 /opt/lib/python/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator GradientBoostingClassifier from version 0.22.1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk.
[16:39:03] <elukey>	 IIRC during my tests there were some incompatibilities between sklearn/numpy
[16:39:16] <elukey>	 lovely
[16:39:29] <elukey>	 anyway, this is not for a Friday, we can restart working on it on Monday isaranto :)
[16:40:38] <isaranto>	 These warning are unavoidable. retraining is the only solution. but we can test to see that the models are good
[16:41:57] <isaranto>	 in the end the errors are only on drattopic models (ar, cs, en) and related with numpy. I will investigate on monday, but from what I've read numpy upgrade will make these go away.. 
[16:42:05] <isaranto>	 logging off folks, cu monday
[16:58:03] <elukey>	 o/
[16:58:10] <elukey>	 same thing, have a nice weekend folks!