[06:41:27] Moooorning :) [07:28:32] So I bumped kserve for readability and I'm going to load test it on ml-staging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/984666 [07:46:46] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Investigate increase p99 latencies in ml-serve-eqiad - https://phabricator.wikimedia.org/T352958 (10isarantopoulos) On my side I tried to change mwapi code to follow the redirects but the main issue is that it seems too specific for our use case in or... [08:04:07] * isaranto afk be back in an hour! [09:10:49] elukey: o/ thanks for the review :) [09:10:49] If you're still around, I deployed the article-descriptions model-server and the `helmfile sync` completed successfully but it looks like the pod didn't update, it still shows age: 2d17h as shown here: https://phabricator.wikimedia.org/P54504 [09:10:55] gooood morning! [09:11:08] morning Aiko o/ [09:11:52] hi Kevin :D [09:26:42] Updated https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Streams lemme know if anything is unclear :) [09:29:02] o/ [09:29:18] I'll read it a bit later Aiko, thanks for writing it! [09:34:05] isaranto: <3 [09:35:10] 10Machine-Learning-Team: Apply common settings to publish events from Lift Wing staging to EventGate - https://phabricator.wikimedia.org/T349919 (10achou) 05Open→03Resolved Updated https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Streams. Let me know if anything is unclear. Going to resolve this... [09:43:36] 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123 (10isarantopoulos) @Seddon What is the anticipated load/traffic for this model? It is important for us t... [09:47:34] (03PS5) 10Ilias Sarantopoulos: revertrisk: add config for mw host rewrites [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981541 (https://phabricator.wikimedia.org/T352958) [09:48:45] (03CR) 10Ilias Sarantopoulos: revertrisk: add config for mw host rewrites (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981541 (https://phabricator.wikimedia.org/T352958) (owner: 10Ilias Sarantopoulos) [09:50:09] I tested the above again and it is ready for merging. aiko: lemme know if your +1 still holds . I rebased it to update recent revertrisk Classees breakdown (basemodel batcher etc) [09:51:37] I'd suggest we do deployments we want to production services today and not touch anything tomorrow [09:53:03] kevinbazira: lets try to debug the articledesc deployment together [09:53:32] isaranto: sure sure ... [09:54:11] lemme grab a cup of coffee and I'll be back in 3-4 ' [09:54:27] no problem. please take your time :) [10:00:19] (03CR) 10AikoChou: revertrisk: add config for mw host rewrites (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981541 (https://phabricator.wikimedia.org/T352958) (owner: 10Ilias Sarantopoulos) [10:06:19] soo [10:07:02] when we deploy an inference service through helmfile sync an isvc is deployed which a kubernetes crd (custom resource definition) [10:07:38] which in turn holds a bunch of crds like revision, deployment etc [10:08:35] when this isvc already exists what happens is that a new revision is added with an incremental id. we can understand the revision of the pod by looking at the description of the pod but also from its name [10:09:09] So pod `article-descriptions-predictor-default-00021-deployment-65vngjb` reflects revision number 21 [10:11:15] kevinbazira: so in our case we see that revision 21 was successfully deployed 2 days ago [10:11:52] the next step would be to go and check if the newer revision was synced successfully. running `kubectl get revision` would tell us [10:17:13] in this case we see another revision (22) which is in `Unknown` state with the reason `Deploying`. Go ahead and run a `kubectl describe REVISIONAME` to see if we can find out more [10:17:30] let me know if you have any questions regarding the above [10:20:12] yep, I checked revisions for the article-descriptions model-server deployment and generation 22 shows Unknown ready state : https://phabricator.wikimedia.org/P54505 [10:22:35] did you run `kubectl describe revision REVISION_NAME`? (I was missing revision in the command above) [10:24:49] yep, running it now ... [10:28:12] (03PS6) 10Ilias Sarantopoulos: revertrisk: add config for mw host rewrites [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981541 (https://phabricator.wikimedia.org/T352958) [10:29:45] (03CR) 10Ilias Sarantopoulos: revertrisk: add config for mw host rewrites (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981541 (https://phabricator.wikimedia.org/T352958) (owner: 10Ilias Sarantopoulos) [10:30:45] kevinbazira: o/ sorry I didn't see the ping [10:31:08] in addition to what Ilias suggested, you can also check `kubectl get events -n experimental` [10:32:21] isaranto: I run `kubectl describe revision article-descriptions-predictor-default-00022` and the results show: [10:32:22] ``` [10:32:22] Status: [10:32:22] Actual Replicas: 0 [10:32:22] Conditions: [10:32:22] Message: Requests to the target are being buffered as resources are provisioned. [10:32:22] Reason: Queued [10:32:23] Status: Unknown [10:32:23] ``` [10:32:24] does this mean the deployment is still in the process of being rolled out? [10:34:36] elukey: great, I run `kubectl get events -n experimental` and the logs show: [10:34:37] ``` [10:34:37] Error creating: pods "article-descriptions-predictor-default-00022-deployment-69hj4xx" is forbidden: [maximum cpu usage per Container is 8, but limit is 16, maximum cpu usage per Pod is 16, but limit is 19] [10:34:37] ``` [10:34:37] this is probably the reason the deplyment didn't complete. [10:36:47] kevinbazira: yes so the picture is as follows - the new knative revision is not coming up since there are limits in place for the namespace [10:36:53] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/984807/ should solve it [10:36:57] they are called limit ranges [10:37:19] atm we allow max 8 cpus for each container (and we are requesting 16 for the kserve one now) [10:37:39] and a total per pod of 16 (and we are asking more, a total of 19 in this case) [10:37:42] does it make sense [10:37:42] ? [10:38:41] only sres can deploy admin_ng though, so I'll quickly merge and deploy [10:38:58] you can also verify this information through `kubectl describe limitranges` [10:40:11] elukey: yes it makes sense. thank you for bumping up the limitranges. [10:43:31] the k8s controllers are set to retry/backoff so the new pod will take a bit before being created [10:43:38] but the new limit ranges are in place [10:43:53] going afk, ttl! [10:46:40] thanks Luca! [11:38:56] I saw that the pod with the new revision for article-desc started properly now [11:53:01] (03CR) 10AikoChou: [C: 03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981541 (https://phabricator.wikimedia.org/T352958) (owner: 10Ilias Sarantopoulos) [11:53:25] (03CR) 10Ilias Sarantopoulos: [C: 03+2] revertrisk: add config for mw host rewrites [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981541 (https://phabricator.wikimedia.org/T352958) (owner: 10Ilias Sarantopoulos) [11:54:08] lol I didn't realize you had just approved. I just randomly opened the tab with the patch . great timing! [12:00:18] hahaha team work!! :D [12:00:19] (03Merged) 10jenkins-bot: revertrisk: add config for mw host rewrites [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981541 (https://phabricator.wikimedia.org/T352958) (owner: 10Ilias Sarantopoulos) [12:03:13] * isaranto lunch o clock! [12:21:11] * aiko lunch 2 [12:25:06] 10Machine-Learning-Team: Optimize response performance for the article-descriptions model-server - https://phabricator.wikimedia.org/T353127 (10kevinbazira) After trying different optimization options (T353127#9416933) to meet the Android team's requirement of 500ms-3s response time, we were able to improve the... [12:33:19] isaranto: the original article-descriptions request that had ~14s response time has been brought down to ~3s, right with in the Android team's range: https://phabricator.wikimedia.org/T353127#9416933 [12:34:37] nice work! we need to see how much load we are expecting. I asked a question https://phabricator.wikimedia.org/T343123#9420718 [12:35:33] great, thank you for following up on the expected load. I am going to prepare to run load tests. [12:36:16] the 16CPUs test results that we run today are here: https://phabricator.wikimedia.org/T353127#9421055 [13:30:41] I'm updating revertrisk to test on staging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/984840 [13:30:59] if things are fine and redirects are resolved I'll proceed with prod [13:49:34] aa! a request for readability takes ~30-40s in staging so best guess is that we see throttling because of a huge number of threads [13:53:01] revertrisk works great on the other hand! I'll deploy to prod and also add some httpbb tests for these edge cases [14:25:50] Morning all [14:30:37] Morning Chris! [14:33:57] o/ hi Chris! [14:39:40] isaranto: so adding thread_count in catboost models for readability doesn't work? [14:43:40] aiko: no. I don't have permissions to view the threadcount on the python process on the pod but my guess is that the python process on initialization uses too many threads so setting them just on predict doesnt help [14:44:25] I'm not sure though, we need more debugging and perhaps some more logging [15:59:50] deploying revertrisk to prod! [16:01:21] o/ [16:01:40] nice! [16:12:38] I'll be monitoring and if things go well I'll resolve the task tomorrow [16:12:58] I'm talking about the preprocess p99 graph here -> https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-component=All&var-namespace=revertrisk&var-model_name=revertrisk-language-agnostic&from=now-3h&to=now&refresh=1m [16:55:58] ack, let's see if the issue goes away [16:57:57] Going afk folks. Cu tomorrow! [19:19:14] isaranto: (for tomorrow) - I checked with perf and indeed I see most of the CPU used for a readability request consumed by libgomp, so probably too many OpenMP threads. We removed OMP_NUM_THREAD=1, but I am not sure if OpenMP is used by catboost or by torch (it is listed as dependency for readability, but I don't see any specific import). [19:27:29] 👍 going to add it again. I removed it to see if it would make any difference but..nope!