[06:27:12] (03CR) 10Hashar: [C: 04-1] Set up production and test images for the recommendation-api migration (035 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/932810 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [08:50:33] 10Machine-Learning-Team: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10elukey) >>! In T327620#9015842, @klausman wrote: > https://grafana.wikimedia.org/goto/x7S0HpjVk?orgId=1 I've started an SLO dahsboard here. It only has one metric (Latency) so far, but it's a start. Please also... [08:55:10] (03PS44) 10Kevin Bazira: Set up production and test images for the recommendation-api migration [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/932810 (https://phabricator.wikimedia.org/T339890) [09:05:50] (03CR) 10Kevin Bazira: Set up production and test images for the recommendation-api migration (035 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/932810 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [09:15:12] 10Machine-Learning-Team: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10klausman) I am woking on Grafana/Thanos directly for now because it's a shorter change-try loop to find the right metrics than doing it with Grizzly directly. Even with templating, we still need specific metrics... [09:37:25] o/ [09:41:49] we'll need to either deploy simplewiki or point it to enwiki models (if that is what is being used). Until now I haven't found anywhere models for simplewiki, my assumption/conclusion is that it uses en models (although I dont think that would make much sense) [09:48:07] (03CR) 10Hashar: [C: 03+1] "Excellent Kevin :]" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/932810 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [09:55:27] (03CR) 10Kevin Bazira: [C: 03+2] "Great! Thanks a lot to everyone for the reviews :)" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/932810 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [09:58:14] isaranto: +1 to use the enwiki model again [09:58:26] (03Merged) 10jenkins-bot: Set up production and test images for the recommendation-api migration [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/932810 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [09:58:42] but do we need it? I mean, is it already used in the mw extension? [10:07:36] yes, it is enabled in simplewiki https://simple.wikipedia.org/wiki/Special:RecentChanges?hidebots=1&hidecategorization=1&hideWikibase=1&limit=50&days=7&urlversion=2 [10:09:40] 10Machine-Learning-Team, 10Patch-For-Review: [ores-legacy] Clienterror is returned in some responses - https://phabricator.wikimedia.org/T341479 (10elukey) After some try I figured out what is the issue: ` - name: inference port: 6031 service: inference timeout: "10s" ` The ten seconds are clea... [10:10:12] actually one can see it here https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/ext-ORES.php#L36 [10:12:35] 10Machine-Learning-Team, 10Patch-For-Review: [ores-legacy] Clienterror is returned in some responses - https://phabricator.wikimedia.org/T341479 (10isarantopoulos) Ouch! nice catch, I couldnt figure it out. It makes sense because big responses may take even 20s... [10:12:45] isaranto: ok for the enwiki model then! [10:12:48] also, https://gerrit.wikimedia.org/r/c/operations/puppet/+/938815 [10:12:52] * elukey cries in a corner [10:13:03] spent an hour trying to figure out where the 10s timeout was [10:13:11] and I set it in puppet [10:14:12] isaranto: the tls proxy is the source of those text/plain 50x, we need to take it into account as well [10:14:24] thanks Luca! saved the day! [10:15:14] let's see if it fixes, it is sad that we have to set a higher timeout, but it is also true that we send a lot of concurrent requests at once in the use case [10:18:38] there is also another bit in the puzzle to figure out [10:18:50] knative has its own way of load balancing, through the activator pods [10:19:00] https://knative.dev/docs/serving/autoscaling/concurrency/ [10:19:23] kserve offers a way to set it, via "container_concurrency" [10:19:25] see https://github.com/kserve/kserve/blob/61c9bd334ae9766ffd1f2bf020764bf453cab54c/python/kserve/docs/V1beta1PredictorSpec.md?plain=1#L12C173-L12C229 [10:19:31] that of course we don't set :D [10:19:50] and by default is like 100, so super high [10:20:48] so IIUC we risk overloading a pod since, by default, knative considers that it can handle 100 concurrent requests [10:22:22] mm but from https://github.com/kserve/kserve/issues/338 it seems that kserve has other defaults [10:23:14] https://github.com/kserve/kserve/blob/61c9bd334ae9766ffd1f2bf020764bf453cab54c/docs/samples/v1beta1/torchserve/autoscaling/README.md?plain=1#L22 [10:23:19] some explanations in here [10:25:57] elukey: How would that interact with the replicas number? [10:26:30] 1 concurrent req per replica seems to be the default, if I understand the issue/feature request correctly [10:28:36] not sure, I don't see the autoscaling.knative.dev/target annotation in any pod, it may be old [10:29:36] knative is responsible to increase/decrease the pods, with its autoscaler component, that IIUC uses "concurrency" as metric to judge how to scale up/down [10:29:59] the activator sits between the istio gateway and the pod, we can disable it as well [10:30:21] I see. I'm a bit ocnfused about .../metric vs .../target [10:30:54] Ah, target is the utilization goal [10:32:24] and there is also https://knative.dev/docs/serving/load-balancing/target-burst-capacity/#setting-the-target-burst-capacity [10:32:28] Would default annotation values be visible even if we don't set them? [10:32:34] the activator acts as load balancer and buffers requests, if needed [10:33:04] klausman: if you don't set anything the default is applied, in theory [10:33:42] I just wonder if that value (70) would be visible in however you queried the cluster/pods [10:33:59] 70? [10:34:06] The target value [10:34:13] where did you see it? [10:34:20] https://knative.dev/docs/serving/autoscaling/concurrency/#target-utilization [10:35:01] dang, that's the per-container utilization target [10:35:12] ctrl-f led me astray once again [10:35:15] it is another thing, related though [10:35:41] The target burst capacity of 200 seems high for ORES/revscoring services as well [10:35:44] ah interesting, autoscaling.knative.dev/target is unlimited as default [10:36:06] even worse :D [10:36:38] So many performance-critical knobs to adjust [10:37:45] I think that we have two in this case: [10:38:04] - autoscaling.knative.dev/target (soft limit), that tells the autoscaler when to raise the number of instances etc.. [10:38:19] - container_concurrency (hard limit), that indicates when to buffer requests [10:38:41] not super easy to find the best ranges for all our model servers [10:38:55] but it is something that we'll need to add/test before reaching production [10:39:06] especially with the still somewhat limited amount of traffic we've seen so far. [10:40:39] we have some data from Aiko's load tests, but we can start conservative and allow more pods to scale (if we have capacity) [10:40:42] then we tune [10:41:07] for RR we could use something like 5/10 in theory [10:44:33] or maybe 3/5 as starter [10:44:47] (well at least for ORES pods) [10:48:47] * isaranto lunch [11:09:30] * elukey lunch! [11:31:51] * klausman lunch (and errand) as well [12:46:08] (03CR) 10Elukey: [C: 03+1] ores-legacy: add error response for v1 requests (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/937960 (https://phabricator.wikimedia.org/T341486) (owner: 10Ilias Sarantopoulos) [12:51:57] klausman, isaranto - as FYI the tls-proxy settings for lift wing (https://gerrit.wikimedia.org/r/c/operations/puppet/+/938815) are set in puppet since they are used for the mw appservers as well [12:52:23] ack [12:52:23] the change propagates to Lift Wing's ores-legacy pods since some helmfiles values are deployed on deploy1002 [12:52:32] and we pick them up with a helmfile sync [12:52:34] (doing it now) [12:53:34] (please review it when you have time so more people know etc..) [12:53:53] for clarity, the tls-proxy mentioned above is the one created by the "mesh" module that serviceops offers [13:05:09] Good morning all! [13:08:03] morning! [13:08:20] isaranto: posted https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/938820/3 [13:08:25] hopefully it will improve a bit [13:08:38] I still see the client error timeout for ores-legacy [13:08:57] taking a look [13:08:59] but it kinda makes sense, the request is really huge [13:14:11] 10Machine-Learning-Team, 10Patch-For-Review: [ores-legacy] Clienterror is returned in some responses - https://phabricator.wikimedia.org/T341479 (10elukey) The timeouts improved, but the original request (stated in the task's description) is still huge and leads to timeouts. [13:36:34] I also filed another change to add more scaling options to the various isvcs [13:36:42] in theory we should have enough capacity [13:38:07] * elukey afk for a quick errand [13:43:17] I added this https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/938856 [13:43:17] The idea is to not copy the same files in different directories in swift and also explicitly define that the server is using another model. Alternatively I was thinking that we could redirect requests to testwiki/simplewiki towards enwiki but I'm not sure how we would deal with host headers in that case [13:46:50] 10Machine-Learning-Team: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10klausman) I made some progress on the experimental dashboard (https://grafana.wikimedia.org/goto/VSolQfj4k?orgId=1). Request count (and 200 vs non-200) I now have a better mental model/grasp of. The current setup... [13:58:07] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ores-legacy: add error response for v1 requests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/937960 (https://phabricator.wikimedia.org/T341486) (owner: 10Ilias Sarantopoulos) [14:01:03] (03Merged) 10jenkins-bot: ores-legacy: add error response for v1 requests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/937960 (https://phabricator.wikimedia.org/T341486) (owner: 10Ilias Sarantopoulos) [14:09:35] 10Machine-Learning-Team, 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T341648 (10Jhancock.wm) 05Open→03Resolved replaced idrac card and coms battery. updated idrac IP info. BAT0002 alert has cleared and the server is reachable by ssh [14:13:20] elukey: I can confirm the alert for 2003 is gone from AM [14:15:34] klausman: ack, can you repool the node ? [14:19:39] will do [14:21:44] elukey: does it have to go inactive->no->yes or would riect work (I am going to go through "no" this time, to be sure, but I wondered) [14:22:38] you can go to yes directly [14:22:47] it is a flag in confd basically [14:24:24] Alright, ack. I know it's a bit more complex when building a new pybal service, but I wasn't sure in this case. [14:25:03] No alets firing with pooled=no, so seitched to =yes just now [14:38:01] 10Machine-Learning-Team, 10Patch-For-Review: [ores-legacy] Clienterror is returned in some responses - https://phabricator.wikimedia.org/T341479 (10isarantopoulos) As I am checking now the issue has been resolved. However some of the underlying requests are getting errors related to the `mwapi` as state in ht... [14:39:15] (03PS2) 10Ilias Sarantopoulos: ores-legacy: fix error due to response content type [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/938266 (https://phabricator.wikimedia.org/T341479) [14:43:09] (03CR) 10Elukey: ores-legacy: fix error due to response content type (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/938266 (https://phabricator.wikimedia.org/T341479) (owner: 10Ilias Sarantopoulos) [14:55:36] elukey: I was able to get a response for the big request for ores-legacy that we have in the task [14:55:46] (actually tested a couple of times) [15:03:31] (03CR) 10Ilias Sarantopoulos: ores-legacy: fix error due to response content type (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/938266 (https://phabricator.wikimedia.org/T341479) (owner: 10Ilias Sarantopoulos) [15:06:38] isaranto: I tried again as well, now works yes! [15:07:15] (03CR) 10Elukey: [C: 03+1] ores-legacy: fix error due to response content type (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/938266 (https://phabricator.wikimedia.org/T341479) (owner: 10Ilias Sarantopoulos) [15:07:23] it seems that if we get errors from mwapi this will happen again [15:07:40] i'll deploy the new changes and we can run some more tests (load tests) [15:08:11] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ores-legacy: fix error due to response content type [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/938266 (https://phabricator.wikimedia.org/T341479) (owner: 10Ilias Sarantopoulos) [15:09:03] (03Merged) 10jenkins-bot: ores-legacy: fix error due to response content type [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/938266 (https://phabricator.wikimedia.org/T341479) (owner: 10Ilias Sarantopoulos) [15:11:43] elukey: regarding autoscaling in knative. iiuc from the docs setting `autoscaling.knative.dev/target` to 3 would mean that an average of 3 replicas would be the target at any given time. correct? [15:12:20] isaranto: this is my understanding yes, it is a soft limit (mostly for the autoscaler) [15:12:36] the concurrency setting is a hard limit, after that the activator knative pods start queueing [15:15:35] yes, however the concurrency setting refers to concurrent requests https://kserve.github.io/website/0.10/sdk_docs/docs/V1beta1ComponentExtensionSpec/ where in this setting concurrent_requests <> rqs (?) [15:15:58] I am actually discussing/thinking out loud to understand better [15:21:49] yes yes definitely [15:22:11] I think concurrent requests reflects more how many clients we want at the same time for each model server [15:22:42] in this case, no more than 5 for each pod, and after that queueing [15:22:48] same thing for revscoring [15:23:19] with rps it may be more intuitive, I think that these metrics will need to be included in our tests in staging before moving to prod [15:23:48] thanks for the reviews, testing the values in staging :) [15:24:32] isaranto: I am still not 100% sure if this will solve our problems, but the current settings tend to overload a pod in my opinion [15:28:37] also target utilization seems nice https://knative.dev/docs/serving/autoscaling/concurrency/#target-utilization [15:30:00] yep yep a lot of options, we need to start experimenting with those [15:31:47] starting with [15:31:47] unknown field "container_concurrency" in io.kserve.serving.v1beta1.InferenceService.spec.predictor [15:31:50] sigh [15:33:08] I think it was containerConcurrency [15:33:29] all i found was this https://kserve.github.io/website/0.10/sdk_docs/docs/V1beta1ComponentExtensionSpec/ [15:33:47] yes yes camelcase [15:33:53] * elukey cries in a corner [15:33:57] sorry folks, fixing [15:35:06] iiuc if you set it through kserve it is with underscore but if u set it directly on knative it is camelCase [15:35:29] or it is just an error in the docs [15:38:32] I tested it now, we need to use the isvc spec, so camel case [15:40:49] ack [15:41:15] elukey: if you find some time let me know if you agree/disagree with this https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/938856 [15:41:41] isaranto: 2 min and I'll check it! [15:41:58] can even be tomorrow. I was just thinking if we should redirect requests of replicate model servers (with the downside of using more resources :( ) [15:43:44] * isaranto feels helpless and sad and angry about forest fires (once again) [15:43:52] 😄 [15:47:23] :( [15:47:52] Yeah, we've had forest fire warnings for all of the last 8 weeks, but fortunately no major outbreaks [15:52:03] things started getting ugly in Greece the last couple of days and today close to Athens [15:53:05] https://ores-legacy.wikimedia.org/scores/enwiki [15:53:05] https://ores-legacy.wikimedia.org/v1/scores/enwiki [15:53:41] the message works, however I just thought a link to some docs would be even better [15:53:57] stay safe, Ilias, as much as it is feasible. [15:54:47] yep --^ [15:54:52] isaranto: nice! [15:54:56] it is a good start [15:55:18] buuuut I get another error, from the other patch which I should have caught [15:55:43] `UnboundLocalError: local variable 'response_json' referenced before assignment`. On it! [15:56:50] ah snap [15:56:54] missed in the code review [16:02:42] but still i'm not getting why it fails as I'm running it fine on statbox [16:03:01] I mean why the underlying request fails.. 🤔 [16:05:37] (03PS1) 10Ilias Sarantopoulos: ores-legacy: log error message instead of response_json [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/938885 (https://phabricator.wikimedia.org/T341479) [16:06:48] (03CR) 10Elukey: [C: 03+1] ores-legacy: log error message instead of response_json [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/938885 (https://phabricator.wikimedia.org/T341479) (owner: 10Ilias Sarantopoulos) [16:10:44] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ores-legacy: log error message instead of response_json [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/938885 (https://phabricator.wikimedia.org/T341479) (owner: 10Ilias Sarantopoulos) [16:11:23] merging and deploying, this will not resolve the underlying issue but just enable correct logging again [16:11:35] (03Merged) 10jenkins-bot: ores-legacy: log error message instead of response_json [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/938885 (https://phabricator.wikimedia.org/T341479) (owner: 10Ilias Sarantopoulos) [16:14:43] isaranto: reviewed, looks good, I just asked for a .fixture test so we see a diff etc.. [16:18:53] going afk for today folks! [16:19:00] have a nice rest of the day [16:20:14] ack, done! [16:20:21] ciao Luca, cu tomorrow! [16:29:36] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10TBurmeister) [16:31:58] ok I fixed ores-legacy but there are a lot transient errors related to mediawiki api coming from lift wing [16:38:54] 10Machine-Learning-Team: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10klausman) I've now also managed to add some letncy bucketing stuff. Not 100% yet if it is what we want, but in any case, it's progress. [16:39:08] Now heading out as well \o [16:42:58] 10Machine-Learning-Team: [ores-legacy] Clienterror is returned in some responses - https://phabricator.wikimedia.org/T341479 (10isarantopoulos) The issue with the request mentioned in the task description is not always happening as sometimes we get a response with no errors. However, now that we correctly read r... [16:44:48] buy Tobias! heading out as well! o/ [18:32:18] 10Machine-Learning-Team, 10ORES, 10MW-1.41-notes (1.41.0-wmf.11; 2023-05-30), 10Wikimedia-production-error: PHP Notice: Trying to access array offset on value of type null (in SpecialORESModels) - https://phabricator.wikimedia.org/T329304 (10Umherirrender) 05Open→03Resolved [19:23:46] 10Machine-Learning-Team, 10Goal: Goal: Lift Wing announced at MVP to the public - https://phabricator.wikimedia.org/T341703 (10calbon) [19:23:55] 10Machine-Learning-Team, 10Goal: Goal: Zero traffic on bare metal ORES servers - https://phabricator.wikimedia.org/T341696 (10calbon) [19:24:06] 10Machine-Learning-Team, 10Goal: Goal: Defined and measured SLO for every production service - https://phabricator.wikimedia.org/T341693 (10calbon) [19:24:14] 10Machine-Learning-Team, 10Goal: Goal: Support WME migration to Lift Wing - https://phabricator.wikimedia.org/T341698 (10calbon) [19:24:24] 10Machine-Learning-Team, 10Goal: Goal: Content Recommendation API migration completed - https://phabricator.wikimedia.org/T341704 (10calbon) [19:24:30] 10Machine-Learning-Team, 10Goal: Goal: Order 2-4 GPU for Lift Wing and Statbox - https://phabricator.wikimedia.org/T341699 (10calbon) [19:24:41] 10Machine-Learning-Team, 10Goal: Stretch Goal: Swagger UI implemented for every production inference service - https://phabricator.wikimedia.org/T341701 (10calbon) [19:24:51] 10Machine-Learning-Team, 10Goal: Stretch Goal: Inference batching is tested to our satisfaction - https://phabricator.wikimedia.org/T341702 (10calbon) [19:25:02] 10Machine-Learning-Team, 10Goal: Stretch Goal: Hosting a production ready version of an LLM - https://phabricator.wikimedia.org/T341695 (10calbon) [19:27:53] 10Machine-Learning-Team, 10Goal: Lift Wing announced at MVP to the public - https://phabricator.wikimedia.org/T341703 (10calbon) [19:27:59] 10Machine-Learning-Team, 10Goal: Zero traffic on bare metal ORES servers - https://phabricator.wikimedia.org/T341696 (10calbon) [19:28:04] 10Machine-Learning-Team, 10Goal: Defined and measured SLO for every production service - https://phabricator.wikimedia.org/T341693 (10calbon) [19:28:11] 10Machine-Learning-Team, 10Goal: Content Recommendation API migration completed - https://phabricator.wikimedia.org/T341704 (10calbon) [19:28:18] 10Machine-Learning-Team, 10Goal: Support WME migration to Lift Wing - https://phabricator.wikimedia.org/T341698 (10calbon) [19:28:22] 10Machine-Learning-Team, 10Goal: Order 2-4 GPU for Lift Wing and Statbox - https://phabricator.wikimedia.org/T341699 (10calbon) [19:28:29] 10Machine-Learning-Team, 10Goal: Stretch: Swagger UI implemented for every production inference service - https://phabricator.wikimedia.org/T341701 (10calbon) [19:28:35] 10Machine-Learning-Team, 10Goal: Stretch: Inference batching is tested to our satisfaction - https://phabricator.wikimedia.org/T341702 (10calbon) [19:28:50] 10Machine-Learning-Team, 10Goal: Stretch: Hosting a production ready version of an LLM - https://phabricator.wikimedia.org/T341695 (10calbon)