[07:41:43] <wikibugs>	 10Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742 (10kevinbazira) Thank you for the suggestion @isarantopoulos, I tried `fastapi==0.109.0` and run into the error below. It looks like kserve 0.11.2 doesn't suppor...
[08:05:29] <wikibugs>	 10Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742 (10isarantopoulos) I see that kserve does not yet support pydantic v2 and there is [[ https://github.com/kserve/kserve/pull/3273 | work in that direction  ]].  T...
[08:32:25] <wikibugs>	 (03CR) 10AikoChou: "I tried to run it locally but got the error [Errno 2] No such file or directory: 'inputs/revertrisk.input'. After manually adding it, the " [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) (owner: 10Ilias Sarantopoulos)
[09:07:19] <wikibugs>	 (03PS15) 10Ilias Sarantopoulos: locust: first example [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394)
[09:15:09] <isaranto>	 aiko: 🤦 --^ something has gotten into me. my apologies.
[09:19:15] <aiko>	 isaranto: no worries :D that happens!
[09:19:38] <isaranto>	 <3
[09:31:45] <klausman>	 Morning!
[09:32:15] <klausman>	 isaranto: when you have some time, can you confirm helmfile diff works again for you on staging? No rush.
[09:32:31] <isaranto>	 hey Tobias ! I checked and it works
[09:32:54] <isaranto>	 thanks once again! so what was it?
[09:33:37] <isaranto>	 if it the explanation is too long we can discuss when u have time :)
[09:38:04] <klausman>	 If you look at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/991309, noe the two bottom files (ml-serve.yaml and values.yaml)
[09:38:27] <klausman>	 So before, values.yaml would inherit everything from ml-serve.yaml.
[09:39:34] <klausman>	 But the way YAML works, with the change, the now-added deployExtraClusterRoles will "blot out", i.e. replace the inherited one, which means the `kserve` entry will be gone. Since the two clusters (well, three) should all be the same, I just deleted the addition to `values.yaml` and everything worked again (since it now also got `kserve`)
[09:40:05] <klausman>	 I had mistakenly thought the `kserve` extra role stuff was not needed for staging, but obviously it is.
[09:51:44] <wikibugs>	 10Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742 (10kevinbazira) I pinged Muniza about the possibility of loosening the knowledge-integrity constraint to allow for pydantic < 2.0.0 and here is her response: >>!...
[10:19:11] <wikibugs>	 (03CR) 10AikoChou: "I was running it in the wrong directory. Now it works and produces numbers!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) (owner: 10Ilias Sarantopoulos)
[10:22:44] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] locust: first example [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) (owner: 10Ilias Sarantopoulos)
[11:04:29] <wikibugs>	 10Machine-Learning-Team: Implement batch prediction for revertrisk-multilingual - https://phabricator.wikimedia.org/T355656 (10achou)
[11:20:57] * klausman lunch
[11:24:43] <wikibugs>	 (03CR) 10Kevin Bazira: locust: first example (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) (owner: 10Ilias Sarantopoulos)
[11:26:41] <isaranto>	 kevinbazira: I'm not sure if we want to have a separate file for each model server or one for all of them
[11:26:52] <isaranto>	 I'm talking about the results_stats.csv
[11:27:58] <isaranto>	 initially I was thinking to have them all together but now I think that it could cause issues if lets say you want to update only the results for 1 model server
[11:29:36] <isaranto>	 perhaps we could structure it so that we can:
[11:29:36] <isaranto>	 - run and compare results only for the model server(s) specified
[11:29:37] <isaranto>	 - run and compare results for all model servers
[11:29:39] <isaranto>	 wdyt?
[11:30:47] <kevinbazira>	 running and comparing results for only the model server(s) specified would be great
[11:36:10] <isaranto>	 yes I mean being able to do both since we are likely to use both scenarios (run 1 test or run all tests)
[11:40:13] <kevinbazira>	 sure, that would be great!
[11:56:41] * isaranto lunch!
[14:09:38] <chrisalbon>	 Morning all
[14:30:09] <isaranto>	 hey Chris!
[14:33:27] <wikibugs>	 (03PS16) 10Ilias Sarantopoulos: locust: first example [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394)
[14:34:11] <wikibugs>	 (03PS17) 10Ilias Sarantopoulos: locust: first example [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394)
[14:35:41] <wikibugs>	 (03PS18) 10Ilias Sarantopoulos: locust: first example [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394)
[14:39:14] <wikibugs>	 (03PS19) 10Ilias Sarantopoulos: locust: first example [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394)
[14:48:45] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "Done" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) (owner: 10Ilias Sarantopoulos)
[14:50:46] <isaranto>	 I  updated the locust patch! kevinbazira: my suggestion is to add the functionality we discussed earlier (to break down result files per model) in a follow up patch. otherwise it'll get too big and we'll never close it
[14:51:01] <isaranto>	 however I did add the functionality to run a specific model server
[14:51:42] <isaranto>	 `MODEL=revertrirk locust` will run only revertrisk models and one can still run all model servers with the `locust` command
[14:59:16] <klausman>	 Draining ml-serve2002 for network ... work
[15:04:20] <isaranto>	 ack!
[15:04:34] <kevinbazira>	 isaranto: ok, let me check ...
[15:05:57] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Investigate way of comparing load test results - https://phabricator.wikimedia.org/T355394 (10isarantopoulos) The first example has the following functionality:     - Run either a single load test or the whole test suite for all model servers   - There is the abilit...
[15:20:17] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] "Thank you for working on a first example!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) (owner: 10Ilias Sarantopoulos)
[15:20:35] <kevinbazira>	 isaranto: LGTM! I've +1'd.
[15:39:36] <isaranto>	 thanks!
[15:45:46] <wikibugs>	 10Machine-Learning-Team: Drain & shutdown ml-serve2005.codfw.wmnet for physical move - https://phabricator.wikimedia.org/T355757 (10klausman) Move complete, machine undrained.
[16:43:36] <isaranto>	 I'm still also trying to figure out a way on how to better work and experiment with GPUs on ml-staging. I'll create a separate task and we can discuss on it as my initial thought doesnt seem to work out of the box
[16:45:14] <isaranto>	 I was thinking to update the code on the pod but then we need to rerun the model server. There is a debugging mode in fastapi which does exactly that (the `reload` argument) and if I can get that to work it would be great
[16:45:33] <isaranto>	 I'm logging off for the evening , cu again tomorrow o/
[16:50:42] <klausman>	 \o
[16:58:34] <klausman>	 2002 is uncordoned and serving again. Heading out as well
[16:59:26] <wikibugs>	 10Machine-Learning-Team: Drain and silence ml-serve2002.codfw.wmnet - https://phabricator.wikimedia.org/T355759 (10klausman) Downtime done and machine is back in service.