[06:18:22] Good morning o/ [07:51:49] (03PS12) 10Ilias Sarantopoulos: locust: first example [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) [07:52:09] I opened it up for reviews --^ [07:54:44] keep in mind that the comparison part is an example and we'll need to fine tune which metrics (median, avg) we want to compare and what thresholds we are going to choose [07:55:12] We can have a discussion during the team meeting today [08:07:27] 10Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742 (10kevinbazira) [08:21:49] isaranto: o/ [08:22:45] good job on the locust patch. I see we now use the locust.conf. :) [09:01:02] Yep ,saw tha from you.thanks! [10:00:43] klausman_: FYI: https://phabricator.wikimedia.org/T355437 [10:01:14] Yes, am aware, will drain the machine and downtime it beforehand [10:01:25] (similar for the short downtime tomorrow) [10:30:30] ack [10:35:06] 10Machine-Learning-Team: Drain & shudtown ml-serve2005.codfw.wmnet for physical move - https://phabricator.wikimedia.org/T355757 (10klausman) [10:38:13] 10Machine-Learning-Team: Drain & shutdown ml-serve2005.codfw.wmnet for physical move - https://phabricator.wikimedia.org/T355757 (10klausman) [10:40:31] 10Machine-Learning-Team: Drain and silence ml-serve2002.codfw.wmnet - https://phabricator.wikimedia.org/T355759 (10klausman) [11:48:03] * isaranto lunch! [11:57:47] same [12:49:42] (03CR) 10Kevin Bazira: "Thank you for working on this Ilias." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) (owner: 10Ilias Sarantopoulos) [13:16:19] (03PS13) 10Ilias Sarantopoulos: locust: first example [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) [13:18:59] (03CR) 10Ilias Sarantopoulos: "Apologies but the models directory was never committed because of my local git ignore configuration. I added it again." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) (owner: 10Ilias Sarantopoulos) [13:19:16] (03PS14) 10Ilias Sarantopoulos: locust: first example [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989732 (https://phabricator.wikimedia.org/T355394) [13:19:50] kevinbazira: the models/ directory was not committed when i moved files around. now it should be ok. [13:23:00] isaranto: okok, I see it has been added. I'll run the test locally in a bit. let me first finish the RRLA model-server. [13:23:15] take your time! [14:02:15] fyi: draining ml-serve2005 and turning it off (with downtime) for DCops to be able to move it [14:03:25] 10Machine-Learning-Team: Drain and silence ml-serve2002.codfw.wmnet - https://phabricator.wikimedia.org/T355759 (10klausman) [14:03:43] 10Machine-Learning-Team: Drain & shutdown ml-serve2005.codfw.wmnet for physical move - https://phabricator.wikimedia.org/T355757 (10klausman) [14:06:58] Morning all [14:07:02] heyo chris [14:18:04] Hey Chris! [14:32:53] klausman_: I am getting some errors when I try to run helmfile diff on experimental https://phabricator.wikimedia.org/P55528 [14:33:15] taking a look [14:33:32] thanks! [14:34:10] was this with just kube_env exp staging or with the export KUBE... as well? [14:35:58] I saw it with both [14:36:07] perhaps the message changes a bit lemme check again [14:37:03] ok, it is exactly the same [14:37:14] brb [14:42:34] Mhhh. deploymThe prod clusters seem to still work, so it's constrained to staging. [14:44:06] isaranto: can you try doing a diff on a prod cluster with your usual permissions? Just to verify [14:53:02] Yes it works on prod [14:54:36] Weird. [14:54:43] I'll have to dig into this some more [14:55:44] Did it work at some point after I added the exec etc privs? [14:59:26] Huh, it works now. I think that may have been transient due to to an SSL cert update [15:01:28] it still doesnt work for me. This is the first time I tried after the changes in rbac [15:02:03] Can I try as you using sudo? UI won't pollute your bash_history [15:50:11] isaranto: I think the experimental NS extra perms may be at fault, but I am not quite sure yet. [15:52:17] ack [15:53:08] just fyi I'm trying to do the following: make a change in code-> upload new code using cp in pod and run the model server again [15:53:38] yeah, there is something fundamentally broken here, I've poked the k8s sig for help [15:55:18] thanks for jumping in once again [15:55:35] The odd thing is that e.g. experimental/ and revertrisk/ NSes are broken, but recommendation-api-ng is not. [15:56:21] (and neither is ores-legacy) [16:04:54] I just tried rolling back teh exp NS change, but that does not fix matters [16:35:31] 10Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742 (10kevinbazira) a:03kevinbazira I have been working on updating knowledge-integrity in the rrla model-server. Tried running it locally and I am currently getti... [16:53:09] 10Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742 (10isarantopoulos) I see that this is caused because of the fastapi version installed with kserve. You can try to install a newer version of fastapi e.g. `fastap... [17:06:47] mlserve2005 is back [17:07:11] thanks tobias [17:15:12] going afk folks, have a nice evening/rest of day! [17:19:36] \o [17:40:01] night isaranto [17:51:52] isaranto: the diff failure in staging is now fixed, thanks to Alex helping me find the problem. I can explain in detail tomorrow. [17:51:56] heading out now \o [18:08:56] night klausman