[07:13:55] morning! (from berlin) [07:23:53] aiko: guten morgen! :) [07:24:01] how's berlin??? [07:30:48] elukey: guten morgen :D [07:31:01] the first impression was very nice!!! [07:35:00] happy for you! Enjoy the city! [07:52:38] 10Machine-Learning-Team, 10Patch-For-Review: Increase Lift Wing rate limit for ImpactVisualizer OAuth2 client - https://phabricator.wikimedia.org/T345394 (10elukey) @Ragesoss thanks for the explanation! I filed a patch to create a new tier called `wikieducation` set for 150k request/hour (max 40 rps), that sho... [07:54:05] aiko: re load test - I thought to do a quick load test to outlink, agnostic and multi-lingual to figure out their best autoscaling thresholds (8 may be too conservative) [07:54:24] I'd also like to run one for drafttopic, see: [07:54:25] https://grafana.wikimedia.org/d/c6GYmqdnz/knative-serving?forceLogin&from=now-12h&orgId=1&to=now&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-knative_namespace=knative-serving&var-revisions_namespace=All&viewPanel=27 [07:54:39] (keeps requesting pods and terminating them) [07:54:54] if you have time to run some please do! [07:55:30] Let's just remember to run them where we don't have autoscaling (like staging), otherwise we'll get false results (namely if the pods scale up the numbers at the end of the load test will not be ok) [07:55:37] does it make sense? [07:57:47] -- [07:58:06] I am going to restart the ores pool counter vms (kernel upgrades), this may give some headache to ORES [07:58:10] hopefully a brief one [08:09:16] elukey: makes sense! I will start with outlink [08:13:47] super [08:13:56] in theory we'd need an SLO for it as well [08:18:08] aiko: we can use https://phabricator.wikimedia.org/T327620 to track the load tests [08:19:26] ack!! [08:37:37] aiko: I don't recall, do we have your load test script for wrk written somewhere? [08:52:07] isaranto: good morning, let me know when we need to deploy next batches :P [08:52:23] aiko: that niceness will wear off really quickly :P [08:54:21] elukey: yes, there is test script in inference-services repo > test > wrk [08:57:27] Amir1: lol i hope it never goes away!! [08:57:51] aiko: ack thanks! [09:30:35] (03PS1) 10Elukey: test: improve {revscoring,revertrisk}.lua to avoid errors on empty lines [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/954604 [09:30:58] aiko: --^ [09:32:49] aiko: also let's use https://phabricator.wikimedia.org/T344058 for the load test, sorry it seems more appropriate [09:32:52] (than the SLO one) [09:35:56] elukey: ok! [09:37:06] (03CR) 10AikoChou: [C: 03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/954604 (owner: 10Elukey) [09:38:07] thanks for the review! [09:38:27] (03CR) 10Elukey: [V: 03+2 C: 03+2] test: improve {revscoring,revertrisk}.lua to avoid errors on empty lines [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/954604 (owner: 10Elukey) [09:51:17] draft topic and quality are stable around 8 rps, settings should be good [09:51:38] I'll bump the minimum replicas of draft topic to 2 so we don't see the pods constantly up and down [09:51:48] goodfaith seems to handle more though [09:55:06] articlequality is not great [09:55:12] we need to scale it up earlier [10:07:05] filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/954608 [10:13:58] Amir1: as FYI Ilias today is out [10:14:16] thanks. I'll wait for now [10:36:07] (03PS1) 10AikoChou: test: add load test script and input for outlink [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/954613 [10:38:03] (03CR) 10Elukey: [C: 03+1] "Nice!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/954613 (owner: 10AikoChou) [10:41:45] * elukey lunch! [10:47:51] (03CR) 10AikoChou: "I found the load test all returned 400 - "Unrecognized request format: unexpected character: line 1 column 30 (char 29)" It must be someth" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/954613 (owner: 10AikoChou) [11:41:42] * aiko lunch [12:47:09] (03PS2) 10AikoChou: test: add load test script and input for outlink [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/954613 [12:52:49] (03PS3) 10AikoChou: test: add load test script and input for outlink [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/954613 [13:08:27] (03CR) 10AikoChou: [C: 03+2] test: add load test script and input for outlink (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/954613 (owner: 10AikoChou) [13:40:01] deploying the new autoscaling thresholds to revscoring isvcs [13:59:14] aaand done [14:12:56] elukey: o/ looks like outlink has autoscaling in staging [14:12:59] https://grafana.wikimedia.org/d/c6GYmqdnz/knative-serving?forceLogin&from=now-12h&orgId=1&to=now&var-cluster=codfw%20prometheus%2Fk8s-mlstaging&var-knative_namespace=knative-serving&var-revisions_namespace=All&viewPanel=27 [14:13:58] scales up when I load test [14:15:13] aiko: weird, it has maxReplicas set to 1 [14:15:41] ahhh it is the transformer! [14:16:08] sending a patch [14:16:11] elukey: but I saw both transformer and predictor scaled up [14:16:34] elukey: to 5 [14:16:53] very weird [14:17:10] mmmm [14:19:44] ah I see why [14:19:47] in values-ml-staging-codfw.yaml, we have inference_services: [14:19:47] outlink-topic-model: {} [14:20:10] does it use the setting in value.yaml? [14:20:50] it does yes [14:21:11] let me try one thing [14:22:21] ok! [14:22:53] aiko: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/954695/ [14:23:12] the CI's diff looks good [14:24:35] makes sense [14:24:39] aiko: merged! Do you mind to deploy/re-test? [14:25:09] elukey: np! [14:25:15] thanks :) [14:26:19] thank you! [14:53:06] elukey: load test result https://phabricator.wikimedia.org/P52246 it seems very good, can sustain to 81.37 rps [14:54:40] not sure what target rps we want for outlink? [14:55:52] aiko: really nice result, maybe we could use a target like 20 rpses? It is very conservative but we can refine later on [15:00:36] sounds good! [15:02:19] * elukey errand, bbiab! [15:33:56] back! [15:34:23] I will file a patch for outlink and move on to load test revertrisk [15:55:13] aiko: +1ed thanks! [15:58:12] 10Machine-Learning-Team, 10Patch-For-Review: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10elukey) Updated the code review and the SLO wikitech page for Lift Wing. The new proposal is the following: * We use the 95% SLO for experimental isvcs, like outlink or revert risk multi-li... [16:02:50] 10Machine-Learning-Team, 10Patch-For-Review: Tune LiftWing autoscaling settings for Knative - https://phabricator.wikimedia.org/T344058 (10elukey) Me and Aiko are doing basic load tests to update the knative autoscaling settings. I applied new thresholds for the revscoring ones, and Aiko is working on outlink... [16:13:40] going afk folks, have a nice rest of the day! [16:17:47] outlink autoscaling change deployed [16:18:08] elukey: bye Luca! have a nice evening :) [16:19:58] nice work aiko! [16:20:23] :D [16:23:55] logging off as well! will continue tomorrow