[06:18:39] Good morning! [06:23:38] (03PS1) 10Ilias Sarantopoulos: llm: add sentencepiece package [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982956 [06:44:45] (03CR) 10Ilias Sarantopoulos: [C: 03+2] llm: add sentencepiece package [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982956 (owner: 10Ilias Sarantopoulos) [06:45:32] (03Merged) 10jenkins-bot: llm: add sentencepiece package [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982956 (owner: 10Ilias Sarantopoulos) [07:14:42] (03PS7) 10Ilias Sarantopoulos: article-descriptions: explictly set torch threads [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982855 (https://phabricator.wikimedia.org/T352750) (owner: 10Elukey) [07:43:34] 10Machine-Learning-Team: Deploy ctranslate2 version of nllb-200 - https://phabricator.wikimedia.org/T351740 (10isarantopoulos) The model is deployed on eqiad and a request can be made like this ` time curl -s https://inference.svc.eqiad.wmnet:30443/v1/models/nllb-200:predict -X POST -d '{"prompt": "Jazz is a mus... [07:44:52] I deployed the cpu version of nllb. A request took 9-30s (30 for the whole paragraph) . [07:44:57] * isaranto sighs [07:45:17] well increasing CPUs would work [07:46:49] I'm running some tests with docker locally and will submit a patch afterwards [08:01:55] Will be afk for the next 30-40 minutes! [09:11:30] back! [09:14:02] good morning! [09:20:55] o/ Aiko! [09:32:48] hello folks [09:37:29] so about https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/982855 - it feels a way less hacky solution than OMP_NUM_THREADS [09:37:39] not entirely sure if it is the same thing [09:37:47] is should be, but from the docs it is not 100% clear [09:43:54] kevinbazira: o/ [09:44:15] I'll try to revert manually the docker image for article-descr in staging so I can get how many threads it runs [09:44:28] to compare with what we see in the ml-sandbox [09:45:08] my suspicion is that 20/30 threads are created anyway [09:46:11] elukey: o/ [09:46:11] sure sure, I was able to see the log entry: [09:46:11] https://usercontent.irccloud-cdn.com/file/9w0rgNag/OMP_NUM_THREADS%20log.jpg [09:46:42] (03CR) 10Ilias Sarantopoulos: [C: 03+1] article-descriptions: explictly set torch threads [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982855 (https://phabricator.wikimedia.org/T352750) (owner: 10Elukey) [09:46:59] ah yes yes thanks! [09:47:11] seems indeed a better approach worth to try [09:47:21] ack, kevinbazira ok if I proceed with the merge? [09:47:21] the only downside it is that it applies only to pytorch [09:47:46] isaranto: I think that we could use the same approach to catboost etc.., if they don't release [09:47:52] elukey: yes please go ahead! [09:48:01] most of those frameworks have a set_threads function [09:48:11] in the init we set it and we are sure everything works [09:48:21] IIUC pytorch doesn't use only openmp [09:48:43] there are other variables, for openblas etc.. and it is not clear when/how something is used :D [09:48:55] I hope that set_threads() is a catch all for everything [09:49:08] (03CR) 10Elukey: [C: 03+2] article-descriptions: explictly set torch threads [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982855 (https://phabricator.wikimedia.org/T352750) (owner: 10Elukey) [09:49:40] 10Machine-Learning-Team: Apply common settings to publish events from Lift Wing staging to EventGate - https://phabricator.wikimedia.org/T349919 (10achou) [09:49:52] (03CR) 10Klausman: [C: 03+1] article-descriptions: explictly set torch threads [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982855 (https://phabricator.wikimedia.org/T352750) (owner: 10Elukey) [09:49:55] (03Merged) 10jenkins-bot: article-descriptions: explictly set torch threads [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982855 (https://phabricator.wikimedia.org/T352750) (owner: 10Elukey) [09:58:32] kevinbazira: worked! [09:58:38] I tested the image manually in staging [09:59:53] super! [10:00:17] I am testing it on the ml-sandbox with more CPUs and it's working too. [10:00:25] Going to share the report soon [10:00:32] testing with 2 cpus in staging as well [10:00:52] with 1 cpu it takes ~14s, that is what we got with OMP_NUM_THREADS=1 IIRC [10:01:06] (03PS5) 10Ilias Sarantopoulos: outlink: validate json input in transformer [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982043 (https://phabricator.wikimedia.org/T352834) [10:01:25] both of us testing on the ml sandbox might give us the wrong results for: ps -eLf | grep [m]odel_server | wc -l [10:01:48] with two cpus - 9s [10:02:01] kevinbazira: nono I am testing manually on ml-staging [10:02:36] I am going to file a code review now [10:06:46] 4 cpus ~6 seconds [10:10:16] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/983148 [10:12:03] 10Machine-Learning-Team: Optimize response performance for the article-descriptions model-server - https://phabricator.wikimedia.org/T353127 (10kevinbazira) @isarantopoulos, thank you for the recommendations. I agree with you regarding things internal to torch, transformers and not going down the road of trying... [10:13:36] 10Machine-Learning-Team: Optimize response performance for the article-descriptions model-server - https://phabricator.wikimedia.org/T353127 (10kevinbazira) @elukey suggested that we try adjusting the OMP_NUM_THREADS environment variable to match the CPUs, here are the performance results after testing this opti... [10:15:07] elukey: the results you're got while testing are similar to those I got on the ml-sandbox. The adjusting the OMP_NUM_THREADS option improves the performance: https://phabricator.wikimedia.org/T353127#9406015 [10:17:24] kevinbazira: nice! In my opinion we could now ask to Research what are the next steps. The model clearly scales with CPUs, but 8 vcores to get ~3.x seconds of response time seem a lot [10:17:48] maybe there is more that we can do on our side (like tuning torch), but I am wondering if the model's logic etc.. needs to be tweaked as well [10:26:03] klausman: I'm syncing new llm images to ml-staging [10:26:59] elukey: this is the same test I'd like to do for nllb, various cpu sizes [10:27:20] I'll try to do it in sandbox like kevin does until we figure out sth with permissions [10:30:38] isaranto: ack [10:31:23] someone will need to manually delete the previous nllb-gpu pod otherwise the new one won't be scheduled [10:31:28] 🙏 [10:32:49] isaranto: nllb uses torch right? [10:35:45] yes. and the cpu version uses ctranslate on top of torch [10:58:57] ack :) Going to take an early lunch, ttl! [10:59:07] * elukey lunch [10:59:12] 10Machine-Learning-Team, 10ORES: Add deprecation warnings to ORES-related repositories on Github - https://phabricator.wikimedia.org/T349632 (10klausman) The above mentioned GH PRs have all been merged. [11:10:45] 10Machine-Learning-Team: Create external endpoint for recommendation-api-ng hosted on LiftWing - https://phabricator.wikimedia.org/T347263 (10klausman) This is complete. I'll track the slash-vs-no-slash matter mentioned above in a separate ticket. [11:11:10] What do we currently do for completed tasks? Resolve and leave in "In progress" for discussion in the meeting? Leave open and move to "Done"? [11:14:12] resolve and move to done! [11:14:19] ack. [11:16:58] at least this is what I do :) [11:18:18] 10Machine-Learning-Team: Set SLO for the recommendation-api-ng service hosted on LiftWing - https://phabricator.wikimedia.org/T347262 (10klausman) For the availability of the service, I think an SLO similar to our Revscoring and Revertrisk services would be a good fit. At the moment, when querying rec-api-ng fr... [11:18:32] I just feel that that way, task completion may be "missed" a bit. [11:22:31] move to done and resolve them in the meeting? [11:24:33] yeah, I'll do that [11:33:56] right now there are too many tasks in done so it is difficult to check. but if we move them it'll be easier to spot recently completed tasks [11:39:50] * aiko lunch! [11:50:06] 10Machine-Learning-Team: Set SLO for the recommendation-api-ng service hosted on LiftWing - https://phabricator.wikimedia.org/T347262 (10elukey) We decided to use 95% for experimental new services, see https://wikitech.wikimedia.org/wiki/SLO/Lift_Wing#Calculate_the_realistic_targets 99% seems to be a lot for th... [11:54:36] * isaranto lunch! [12:17:47] * klausman lunch as well [12:24:05] 10Machine-Learning-Team: Set SLO for the recommendation-api-ng service hosted on LiftWing - https://phabricator.wikimedia.org/T347262 (10klausman) >>! In T347262#9406216, @elukey wrote: > We decided to use 95% for experimental new services, see https://wikitech.wikimedia.org/wiki/SLO/Lift_Wing#Calculate_the_real... [12:39:12] (03CR) 10AikoChou: [C: 03+1] outlink: validate json input in transformer [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982043 (https://phabricator.wikimedia.org/T352834) (owner: 10Ilias Sarantopoulos) [12:39:58] (03CR) 10Ilias Sarantopoulos: [C: 03+2] outlink: validate json input in transformer [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982043 (https://phabricator.wikimedia.org/T352834) (owner: 10Ilias Sarantopoulos) [12:40:45] (03Merged) 10jenkins-bot: outlink: validate json input in transformer [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982043 (https://phabricator.wikimedia.org/T352834) (owner: 10Ilias Sarantopoulos) [12:59:28] Hello [12:59:29] 10Machine-Learning-Team, 10ORES: Add deprecation warnings to ORES-related repositories on Github - https://phabricator.wikimedia.org/T349632 (10elukey) We should also provide a warning in: https://github.com/wikimedia/articlequality There are also other repos listed in https://github.com/wikimedia/?q=ores&ty... [13:00:05] I think I am finally feeling better [13:01:44] good morning :) [13:05:57] Hey! So, I guess it hasn’t happened yet? [13:06:15] nope, everything is quiet [13:06:17] :) [13:08:27] Sending all my love in your direction [13:08:35] thanksss [13:11:41] Hey chris! [13:15:40] * elukey commuting! bbiab [13:24:29] https://docs.google.com/document/d/1oB4BXTAhEH4QzusNatLb9TazgiG0cFKmyjy6xYIfSlI/edit?usp=sharing I've added more detail to the Caching doc with help form Luca. Please give it a read and comment on anything you think is amiss/unclear (and feel free to ask questions, of course). [13:29:20] ack, thanks! [13:38:51] running an errand, bbiab [13:39:31] this is the last model server left to deploy json validation https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/983189 [13:43:08] isaranto: o/ that contains kserve upgrade change for outlink, I'd like to do some testing in staging before rolling out to prod [13:43:33] isaranto: I wrote it here https://phabricator.wikimedia.org/T347549#9403693 [13:46:26] aiko: ack! I put a pause on that. I updated the patch so we deploy the new image only on ml-staging [13:48:44] isaranto: thanks!! [13:51:35] aiko: ok for me to deploy staging? [13:52:19] isaranto: yes o/ [14:02:40] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Enforce json payload in existing kserve model servers - https://phabricator.wikimedia.org/T352834 (10isarantopoulos) The changes have been deployed to article_descriptions, readability and llm model servers. outlink-topic-model is still pending. @aiko... [14:02:55] aiko: deployed and tested it with httpbb. all yours1 [14:05:39] klausman: it seems that the previous revision still has the GPU in llm namespace in ml-staging. I think deleting the pod won't do as the isvc revision is still there so it will just recreate it. Then I guess it is a race between the pods (old and new) for the GPU unless there is an order enforce which I'm unaware of [14:05:59] I will try and kick it harder [14:06:02] deleting the old revision will likely do the trick (or even the nllb-200-gpu isvc) [14:06:52] ok, nllb-200-gpu-predictor-default-00002-deployment-69d8b7fbcfkjcd8 is terminating. [14:07:22] that sounds right :) [14:07:28] nllb-200-gpu-predictor-default-00003-deployment-5b54b8dbb7jvlwt 0/3 Init:1/2 0 3h40m [14:07:41] Looks like it's starting properly now [14:07:42] so did u delete the revision? [14:07:51] yes: [14:07:57] thanks! [14:07:58] # kubectl -n llm delete revisions.serving.knative.dev nllb-200-gpu-predictor-default-00002 [14:08:30] and the gpu predictor rev 3 is runnung [14:09:29] 🎉 [15:29:56] I opened a patch to restore cpu resource (4) for nllb as it affects performance https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/983204 [15:31:03] isaranto: +1ed, one suggestion - let's always specify request/limit values, removing them may lead to weird results [15:31:19] (either set them to 1 etc.., don't remove them completely) [15:33:09] elukey: I agree! I think we'd need to update many inferences services revscoring etc (not sure but I can check) [15:33:42] if they set a default (like 1, as I suspect) it is fien [15:33:44] *fine [15:36:25] one last request for the day: could someone delete also the 3rd revision for nllb-gpu in ml-staging? [15:36:34] `kubectl -n llm delete revisions.serving.knative.dev nllb-200-gpu-predictor-default-00003` [15:40:59] on it [15:41:08] 10Machine-Learning-Team, 10Research: Allow to set Catboost's threads in readability-liftwing - https://phabricator.wikimedia.org/T353461 (10elukey) [15:41:33] isaranto: done [15:41:54] 10Machine-Learning-Team, 10Patch-For-Review: Increased latencies with Kserve 0.11.1 (cgroups v2) - https://phabricator.wikimedia.org/T349844 (10elukey) [15:41:56] 10Machine-Learning-Team, 10Research: Allow to set Catboost's threads in readability-liftwing - https://phabricator.wikimedia.org/T353461 (10elukey) [15:42:01] Danke danke schön! [15:42:15] isaranto, aiko - opened https://phabricator.wikimedia.org/T353461, when you have a moment lemme know if it makes sense [15:42:19] Gern geschehen :) [15:42:31] 10Machine-Learning-Team, 10Patch-For-Review: Increased latencies with Kserve 0.11.1 (cgroups v2) - https://phabricator.wikimedia.org/T349844 (10elukey) Opened T353461 to track the efforts to fix catboost in Readability. [15:43:15] klausman: FYI https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/983191 [15:43:23] Looking [15:44:43] the serviceops team noticed an increased memory/cpu usage for their control plane, leading to unavailability (partial) of the kube api during deployments [15:44:48] elukey: do we override those settings, or do they apply to us as well? [15:44:57] that in turn caused issues with calico's typha and a network outage [15:45:49] klausman: they have set it only for the main cluster [15:46:14] (in the diff no changes for ml-serve) [15:46:27] Roger. I keep missing that jist "eqiad" is not also ml-serve-eqiad etc [15:46:38] I would do it as well for our clusters though [15:46:46] and review the control plane's usage etc.. [15:46:55] Also agreed. I can make a patch [15:47:30] you can coordinate with folks in #wikimedia-k8s-sig (please be in that channel, it is very important) [15:49:59] 10Machine-Learning-Team, 10Research: Allow to set Catboost's threads in readability-liftwing - https://phabricator.wikimedia.org/T353461 (10isarantopoulos) Both options would do the trick. I would prefer to use the new catboost version so that we don't have so many hacks in our code but it doesn't hurt to try... [15:51:17] isaranto: re hacks in --^ - I think that a lot of libraries will have this problem, they always expose the thread count but the auto-detection of CPUs is not always great [15:51:27] nllb latency dropped to 4-5 seconds from 30s [15:51:39] so it may be good to use our code to explicitly set threads etc.. [15:51:44] as general rule [15:51:53] (nice for nllb!) [15:52:11] no need to set threads? [15:52:15] now I am curious :) [15:52:37] ack! since we use a handful of frameworks (xgboost, catboost, pytroch, sklearn) having a way for each one of them makes us safe [15:52:47] then I agree to continue with manually setting it [15:52:59] elukey: helmfile.d/admin_ng/values/main.yaml would still apply everywhere, no? [15:53:39] oups I thought we had OMP_NUM_THREADS set [15:53:47] but we don't [15:53:54] klausman: how can we figure it out? [15:56:47] isaranto: ah wait we don't use torch there, but ctranslate2 [15:57:22] I'm curious what happens in the gpu version which is torch [15:57:40] I don't see cgroups mentioned in ctranslate's code though [15:58:15] torch is still being used in some way that is not clear to me. I need to do some reading [16:01:36] elukey: given that describe tells me the limits for typha in codfw are 500m/150Mi, and that diff is showing nothing means that the infinity from the general chart is not applied (there is an override in helmfile.d/admin_ng/values/ml-serve.yaml, line 590. [16:02:10] My question wasn't about "does this apply to us", but rather, "this is not main cluster specific" [16:02:13] yep, but you can also check the helmfile.yaml config under admin_ng [16:02:30] ok yes but please be specific next time :) [16:03:07] the answer though is something that you can ask to folks in k8s-sig, act as I wasn't here :) [16:03:26] or take your decision and send the patch with a reason :) [16:05:47] Currently digging around Grafana to see what a good limit for memory would be for us [16:08:02] Oddly enough (or not), Typha in staging uses more memoery/gets closer to the limit than the two prod clusters' Typha. I suspect it is because we have more churn due to updating in staging. [16:17:22] isaranto: I think that ctranslates limits the number of threads somehow, I didn't get any info yet, maybe Santhosh knows [16:18:29] in mint's config I see [16:18:30] CT2_INTER_THREADS: 4 # Match available CPUs [16:18:30] CT2_INTRA_THREADS: 0 # Set to 0 so that CTranslate2 use a default value [16:18:33] GUNICORN_WORKERS: 4 # Match available CPUs [16:19:21] ttps://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/935704/6/translator/base.py [16:21:00] ohttps://opennmt.net/CTranslate2/parallel.html [16:21:04] uff sorry [16:21:09] thanks for the reference, I'll dig in a bit [16:21:09] https://opennmt.net/CTranslate2/parallel.html [16:21:30] OMP_NUM_THREADS is mentioned [16:21:33] a yes I was reading that earlier today [16:22:31] so yeah we can probably match what the currently do [16:22:45] using the get_cpu_count() etc.. [16:22:49] seems the easiest [16:27:02] 10Machine-Learning-Team, 10SRE Observability (FY2023/2024-Q2): Gap in metrics rendered from Thanos Rules - https://phabricator.wikimedia.org/T352756 (10elukey) @herron removed the Pyrra configs to figure out if anything changes, but IIRC we had similar gaps even before the Pyrra pilots, so I suspect something... [16:28:57] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Growth-Team, 10Wikipedia-Android-App-Backlog, 10MW-1.42-notes (1.42.0-wmf.9; 2023-12-12): Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 (10Samwalton9-WMF) Can I confirm the status of this patch... [16:33:55] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Growth-Team, 10Wikipedia-Android-App-Backlog, 10MW-1.42-notes (1.42.0-wmf.9; 2023-12-12): Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 (10isarantopoulos) Yes this was only for Beta, but failed... [16:39:24] I can test it the way Kevin did in a container with limited resources [17:01:35] yep yep seems good! I can send a patch if you want help, otherwise I am available for a brainbounce :) [17:01:46] logging off for today folks! [17:01:49] have a nice rest of the day [17:01:56] I'm loggin off now will test it tomorrow! [17:01:59] night Luca! [17:02:01] night Ilias! [17:02:06] have a nice evening! [17:02:45] and rest of day :) [17:08:16] logging off as well o/ [17:17:02] night Aiko