[07:41:43] 10Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742 (10kevinbazira) Thank you for the suggestion @isarantopoulos, I tried `fastapi==0.109.0` and run into the error below. It looks like kserve 0.11.2 doesn't suppor... [08:05:29] 10Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742 (10isarantopoulos) I see that kserve does not yet support pydantic v2 and there is [[ https://github.com/kserve/kserve/pull/3273 | work in that direction ]]. T... [08:32:25] (03CR) 10AikoChou: "I tried to run it locally but got the error [Errno 2] No such file or directory: 'inputs/revertrisk.input'. After manually adding it, the " [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) (owner: 10Ilias Sarantopoulos) [09:07:19] (03PS15) 10Ilias Sarantopoulos: locust: first example [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) [09:15:09] aiko: 🤦 --^ something has gotten into me. my apologies. [09:19:15] isaranto: no worries :D that happens! [09:19:38] <3 [09:31:45] Morning! [09:32:15] isaranto: when you have some time, can you confirm helmfile diff works again for you on staging? No rush. [09:32:31] hey Tobias ! I checked and it works [09:32:54] thanks once again! so what was it? [09:33:37] if it the explanation is too long we can discuss when u have time :) [09:38:04] If you look at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/991309, noe the two bottom files (ml-serve.yaml and values.yaml) [09:38:27] So before, values.yaml would inherit everything from ml-serve.yaml. [09:39:34] But the way YAML works, with the change, the now-added deployExtraClusterRoles will "blot out", i.e. replace the inherited one, which means the `kserve` entry will be gone. Since the two clusters (well, three) should all be the same, I just deleted the addition to `values.yaml` and everything worked again (since it now also got `kserve`) [09:40:05] I had mistakenly thought the `kserve` extra role stuff was not needed for staging, but obviously it is. [09:51:44] 10Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742 (10kevinbazira) I pinged Muniza about the possibility of loosening the knowledge-integrity constraint to allow for pydantic < 2.0.0 and here is her response: >>!... [10:19:11] (03CR) 10AikoChou: "I was running it in the wrong directory. Now it works and produces numbers!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) (owner: 10Ilias Sarantopoulos) [10:22:44] (03CR) 10AikoChou: [C: 03+1] locust: first example [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) (owner: 10Ilias Sarantopoulos) [11:04:29] 10Machine-Learning-Team: Implement batch prediction for revertrisk-multilingual - https://phabricator.wikimedia.org/T355656 (10achou) [11:20:57] * klausman lunch [11:24:43] (03CR) 10Kevin Bazira: locust: first example (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) (owner: 10Ilias Sarantopoulos) [11:26:41] kevinbazira: I'm not sure if we want to have a separate file for each model server or one for all of them [11:26:52] I'm talking about the results_stats.csv [11:27:58] initially I was thinking to have them all together but now I think that it could cause issues if lets say you want to update only the results for 1 model server [11:29:36] perhaps we could structure it so that we can: [11:29:36] - run and compare results only for the model server(s) specified [11:29:37] - run and compare results for all model servers [11:29:39] wdyt? [11:30:47] running and comparing results for only the model server(s) specified would be great [11:36:10] yes I mean being able to do both since we are likely to use both scenarios (run 1 test or run all tests) [11:40:13] sure, that would be great! [11:56:41] * isaranto lunch! [14:09:38] Morning all [14:30:09] hey Chris! [14:33:27] (03PS16) 10Ilias Sarantopoulos: locust: first example [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) [14:34:11] (03PS17) 10Ilias Sarantopoulos: locust: first example [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) [14:35:41] (03PS18) 10Ilias Sarantopoulos: locust: first example [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) [14:39:14] (03PS19) 10Ilias Sarantopoulos: locust: first example [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) [14:48:45] (03CR) 10Ilias Sarantopoulos: "Done" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) (owner: 10Ilias Sarantopoulos) [14:50:46] I updated the locust patch! kevinbazira: my suggestion is to add the functionality we discussed earlier (to break down result files per model) in a follow up patch. otherwise it'll get too big and we'll never close it [14:51:01] however I did add the functionality to run a specific model server [14:51:42] `MODEL=revertrirk locust` will run only revertrisk models and one can still run all model servers with the `locust` command [14:59:16] Draining ml-serve2002 for network ... work [15:04:20] ack! [15:04:34] isaranto: ok, let me check ... [15:05:57] 10Machine-Learning-Team, 10Patch-For-Review: Investigate way of comparing load test results - https://phabricator.wikimedia.org/T355394 (10isarantopoulos) The first example has the following functionality: - Run either a single load test or the whole test suite for all model servers - There is the abilit... [15:20:17] (03CR) 10Kevin Bazira: [C: 03+1] "Thank you for working on a first example!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) (owner: 10Ilias Sarantopoulos) [15:20:35] isaranto: LGTM! I've +1'd. [15:39:36] thanks! [15:45:46] 10Machine-Learning-Team: Drain & shutdown ml-serve2005.codfw.wmnet for physical move - https://phabricator.wikimedia.org/T355757 (10klausman) Move complete, machine undrained. [16:43:36] I'm still also trying to figure out a way on how to better work and experiment with GPUs on ml-staging. I'll create a separate task and we can discuss on it as my initial thought doesnt seem to work out of the box [16:45:14] I was thinking to update the code on the pod but then we need to rerun the model server. There is a debugging mode in fastapi which does exactly that (the `reload` argument) and if I can get that to work it would be great [16:45:33] I'm logging off for the evening , cu again tomorrow o/ [16:50:42] \o [16:58:34] 2002 is uncordoned and serving again. Heading out as well [16:59:26] 10Machine-Learning-Team: Drain and silence ml-serve2002.codfw.wmnet - https://phabricator.wikimedia.org/T355759 (10klausman) Downtime done and machine is back in service.