[04:20:59] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10santhosh) >>! In T340507#9242994, @isarantopoulos wrote: > @santhosh Thanks for creating the model card! > Is there...
[05:56:31] <wikibugs>	 (03CR) 10Elukey: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey)
[06:00:13] <wikibugs>	 (03PS3) 10Elukey: readability: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964561
[06:39:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] readability: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964561 (owner: 10Elukey)
[06:45:05] <wikibugs>	 (03Merged) 10jenkins-bot: readability: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964561 (owner: 10Elukey)
[06:46:15] <elukey>	 hello folks!
[06:46:40] <elukey>	 so I think I solved the issue for revscoring-drafttopic in ml-serve-eqiad, I deleted old helm secrets and I don't see the problem anymore
[06:46:50] <elukey>	 so I guess it was some etcd inconsistent state
[06:46:59] <elukey>	 a little scary, but I wouldn't spend more time on this
[06:47:32] <wikibugs>	 (03PS3) 10Elukey: langid: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190
[06:48:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] langid: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey)
[06:49:14] <wikibugs>	 10Machine-Learning-Team: Visualize KServe latency metrics in a dashboard - https://phabricator.wikimedia.org/T348456 (10elukey) Completed the rollout, we should have metrics for all isvcs!
[06:52:42] <elukey>	 uff I merged a change to drop some prometheus labels in the new metrics and I don't see any datapoint anymore :D
[06:54:08] <elukey>	 ah snap wrong config
[06:56:59] <elukey>	 fix is https://gerrit.wikimedia.org/r/c/operations/puppet/+/965392/
[07:07:29] <isaranto>	 elukey: o/ I can work on the langid CI that is failing
[07:07:37] <isaranto>	 If that's ok with you ofc
[07:08:15] <elukey>	 sure
[07:13:48] <isaranto>	 I didnt get exactly what was wrong with drafttopic, that's why I'm not commenting :)
[07:17:24] <elukey>	 I am not super sure either
[07:18:08] <elukey>	 TL;DR I think that helm was failing when fetching secrets from the k8s api, and in the k8s api I noticed some protobuf errors that I assume where related to etcd
[07:18:20] <elukey>	 so I dropped some old helm secrets for drafttopic, that we don't use anymore
[07:18:24] <elukey>	 and all good
[07:18:41] <elukey>	 I think it was some corrupted/empty value
[07:18:54] <elukey>	 but getting to the exact root cause is probably a long long task
[07:19:04] <elukey>	 so I'd do it only if we see the issue again when deploying
[07:19:09] <elukey>	 it is pretty clear, helm fails :)
[07:24:59] * elukey brb
[07:46:04] <isaranto>	 ack, thanks for the info!
[08:19:42] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10elukey) @santhosh Usually when we expose a service to the outside world (wmf cloud included) we use the API Gateway...
[08:23:02] <isaranto>	 I uploaded the model here https://analytics.wikimedia.org/published/wmf-ml-models/langid/
[08:23:33] <isaranto>	 and added its hash as well. Will add this on the ticket above for reference
[09:02:39] <elukey>	 nice!
[09:02:41] <elukey>	 created https://gerrit.wikimedia.org/r/c/operations/puppet/+/965469/1
[09:02:49] <elukey>	 to expand model_upload with the prompt
[09:03:06] <isaranto>	 nice!
[09:06:42] <elukey>	 I am also proposing to change the name to model-upload :D
[09:11:23] <elukey>	 tried to add a directory under wmf-ml-models on stat1004
[09:11:25] <elukey>	 let's see how it goes
[09:22:33] <isaranto>	 I'm debugging the langid issues.Trying to figure out what are the dependency issues that arise when installing kserve 0.11.1
[09:24:00] <isaranto>	 I think I would prefer if we explicitly defined all dependencies via pip install --no-dependencies. It creates deterministic builds, but it has a downside, sometimes it can become chaos since you need to manually update all the versions
[09:24:18] <isaranto>	 anyway, just random thoughts  for now :)
[09:28:47] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey)
[09:35:32] <elukey>	 can confirm that we can use separate dirs with models in various stat boxes
[09:35:35] <elukey>	 see https://analytics.wikimedia.org/published/wmf-ml-models/elukey-test/it/20231012091016/
[09:35:38] <elukey>	 aiko: --^
[09:35:43] <elukey>	 (elukey test is only on stat1004)
[09:47:04] <elukey>	 also filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/965474
[09:47:24] <elukey>	 I realized that we allow to use model-upload to deploy-ml-service users, so it should be the same for public models
[09:47:39] <elukey>	 at the moment deploy-ml-services is the same as ml-admins
[09:47:44] <elukey>	 but we'll probably expand it in the future
[09:50:34] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Add sha512 checksum files to all the ML's models in the public dir - https://phabricator.wikimedia.org/T347838 (10elukey) Prompt added to the script, task should be done!
[10:02:39] <aiko>	 elukey: ack!
[10:14:38] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: revscoring: update bullseye image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965477
[10:15:59] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "I created this patch because there was not new image produced from the previous patches" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965477 (owner: 10Ilias Sarantopoulos)
[10:23:45] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] revscoring: update bullseye image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965477 (owner: 10Ilias Sarantopoulos)
[10:24:30] <wikibugs>	 (03Merged) 10jenkins-bot: revscoring: update bullseye image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965477 (owner: 10Ilias Sarantopoulos)
[10:35:28] <isaranto>	 taking a break from trying to solve the dependency issues. I'm going to start deploying revscoring starting from staging
[10:36:08] * isaranto lunch time!
[10:40:05] <Amir1>	 elukey: good morning, we should make many ores repos archived or read only, they are everywhere, in github, in gerrit, etc.
[10:43:35] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey)
[10:44:57] <elukey>	 Amir1: o/ yes yes we'll do it as soon as possible, we need to keep revscoring live but the rest can be archived
[10:45:03] <elukey>	 I'll add a step in the task
[10:45:42] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey)
[10:45:43] <elukey>	 klausman: cleaned up ores from deployment-prep, quick and easy
[10:45:54] <klausman>	 thankyou!
[10:46:41] <Amir1>	 thanks!
[10:51:37] <wikibugs>	 10Machine-Learning-Team: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 (10klausman) 05Open→03Resolved
[11:05:20] * elukey lunch!
[11:21:45] <klausman>	 same
[12:00:39] <chrisalbon>	 Coffee!
[12:07:15] <isaranto>	 o/
[12:07:16] <isaranto>	 last coffee of the day for me
[12:07:53] <isaranto>	 folks I added a second deployment for articlequality in staging that uses mutliprocessing so that we can load test side by side
[12:21:54] <isaranto>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/965479
[12:26:36] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: llm: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965192 (owner: 10Elukey)
[12:44:59] <wikibugs>	 (03PS1) 10Kevin Bazira: Use envoy proxy to access endpoints external to k8s/LiftWing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607)
[12:46:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Use envoy proxy to access endpoints external to k8s/LiftWing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira)
[12:47:52] <chrisalbon>	 Sweet
[12:53:42] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10santhosh) @elukey If I understood that documentation correctly, if the service required oauth token, still Anonymou...
[14:44:36] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: langid: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey)
[14:50:36] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: langid: Upgrade to KServe 0.11.1 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey)
[14:50:40] <wikibugs>	 (03PS5) 10Ilias Sarantopoulos: langid: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey)
[14:57:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] langid: Upgrade to KServe 0.11.1 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey)
[15:12:40] <aiko>	 elukey: o/ the issue can't be repro locally. performance looks all good for old and new image
[15:12:51] <elukey>	 sigh
[15:12:51] <aiko>	 here is the result https://phabricator.wikimedia.org/P52917
[15:13:28] <elukey>	 aiko: have you tried again in staging?
[15:13:34] <elukey>	 to make sure it is always slow
[15:13:56] <aiko>	 elukey: not yet
[15:14:07] <isaranto>	 I'll be submitting a separate patch that enables local runs for langid (almost ready)
[15:14:24] <isaranto>	 aiko: ouch!
[15:14:32] <aiko>	 elukey: ok I'll do it and try longer period 
[15:15:24] <isaranto>	 we'll need to check the other model servers, I hope it doesnt happen that much
[15:15:33] <elukey>	 so far it seems only revert risk
[15:15:42] <elukey>	 it gets heavily cpu throttled from a quick check
[15:20:05] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] langid: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey)
[15:20:52] <wikibugs>	 (03Merged) 10jenkins-bot: langid: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey)
[15:21:23] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: llm: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965192 (owner: 10Elukey)
[15:22:12] <elukey>	 yes definitely it is cpu throttled:
[15:22:13] <elukey>	 https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=revertrisk&var-pod=revertrisk-language-agnostic-predictor-default-00014-deplovnrr7&var-container=All&viewPanel=5
[15:24:22] <elukey>	 aiko: just to be sure, can you deploy another isvc for rr-la in staging that uses kserve 0.10 (so the old docker image) ?
[15:24:30] <elukey>	 so we can test both at the same time
[15:24:53] <aiko>	 yes! good idea
[15:26:36] <elukey>	 also if you can, please keep doing the load test
[15:27:44] <aiko>	 ack
[15:33:03] <elukey>	 I am not sure if I am doing perf correctly, but I see that 40% of the time is spent in libgomp
[15:33:33] <elukey>	 that is OpenMP-related?
[15:35:38] <elukey>	 aiko: another qs - when you ran the test locally, did you limit docker's cpu?
[15:37:22] <aiko>	 elukey: have no idea about the libgomp.. ahh I didn't limit docker's cpu :(
[15:37:47] <elukey>	 let's try to match lift wing's cpu and memory
[15:39:44] <aiko>	 ok
[15:39:59] <aiko>	 another isvc: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/965532
[15:41:24] <elukey>	 checked in prod, I don't see the same libgomp
[15:41:44] <elukey>	 very weird
[15:45:25] <aiko>	 would it be related to the fix https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/965094?
[15:46:52] <elukey>	 in theory no
[15:47:20] <elukey>	 I see that libgomp is from xboost
[15:47:21] <elukey>	 ./xgboost.libs/libgomp-a34b3233.so.1.0.0
[15:47:33] <elukey>	 that I assume is brought it by KI?
[15:48:03] <aiko>	 yes it is the lib that the model uses
[15:49:16] <isaranto>	 elukey: how did you check performance and find the lib? (just asking to learn)
[15:49:30] <elukey>	 isaranto: not sure if it 100% correct, but I used perf
[15:49:39] <elukey>	 https://www.brendangregg.com/perf.html
[15:49:43] <elukey>	 directly on the node
[15:49:52] <elukey>	 now in theory it should be fine even if it is in a separate pid namespace
[15:50:09] <elukey>	 and for the lib, I used nsenter
[15:50:28] <elukey>	 it is "namespace enter"
[15:50:30] <isaranto>	 aha, nice!
[15:50:38] <elukey>	 you can use it to check specific namespaces etc..
[15:50:44] <elukey>	 root@ml-staging2001:/opt/lib/python/site-packages# find -name *gomp*
[15:50:44] <elukey>	 ./xgboost.libs/libgomp-a34b3233.so.1.0.0
[15:51:12] <isaranto>	 iiuc we don't have permissions to do that 
[15:51:38] <elukey>	 exactly yes, only root
[15:52:26] <isaranto>	 a ok, now I understood. u attached to the node and ran it on the specific pid
[15:52:35] <elukey>	 exactly
[15:52:49] <aiko>	 tested locally after limiting the cpu to 1 and memory to 2G, the result looks no changed
[15:52:57] <elukey>	 okok
[15:53:06] <elukey>	 so we can try to load test the old endpoint
[15:53:19] <elukey>	 one thing that I am reading is that xgboost uses openmp behind the scenes
[15:53:27] <elukey>	 and we should tune its number of threads
[15:53:41] <elukey>	 but I am not sure why we don't see this elsewhere
[15:54:11] <elukey>	 I am curious to see in the old pod if there is libgomp
[15:54:25] <elukey>	 maybe for some weird reason we are importing it with 0.11
[15:54:40] <elukey>	 and it makes some damage when running in k8s
[15:55:01] <aiko>	 but I noticed usually we would get a warning about xgboost when starting the server, I didn't see that for the new docker image 
[15:55:14] <aiko>	 https://www.irccloud.com/pastebin/RpJ7uPwm/
[15:55:59] <elukey>	 lemme know when the old pod is ready so I can check its libs
[15:56:13] <aiko>	 okok
[15:57:22] <aiko>	 deploying
[15:59:49] <elukey>	 root@ml-staging2002:/opt/lib/python/site-packages# find -name *gomp*
[15:59:49] <elukey>	 ./xgboost.libs/libgomp-a34b3233.so.1.0.0
[15:59:57] <elukey>	 same lib afaics
[16:00:38] <elukey>	 and same xgboost
[16:00:58] <aiko>	 load testing both endpoints
[16:02:59] <elukey>	 0.10 doesn't get throttled
[16:03:09] <elukey>	 gimme a sec to run perf on it
[16:04:26] <elukey>	 yeah no trace of libgomp
[16:06:09] <aiko>	 so the problem is that libgomp
[16:06:10] <elukey>	 ok so this is number of threads for 0.11
[16:06:11] <elukey>	 elukey@ml-staging2001:~$ ps -eLf | grep 2220806 | wc -l
[16:06:11] <elukey>	 237
[16:06:18] <elukey>	 this is for 0.10
[16:06:18] <elukey>	 elukey@ml-staging2002:~$ ps -eLf | grep 1113496 | wc -l
[16:06:18] <elukey>	 8
[16:06:20] <elukey>	 lol
[16:07:08] <aiko>	 huge difference..
[16:07:32] <elukey>	 so I think that, somehow, I assume due to KI + KServe 0.11, we use a ton of threads for openmp
[16:07:35] <elukey>	 in xgboost
[16:07:48] <elukey>	 and this consumes all the cpu time available in no time -> Throttling
[16:07:53] <isaranto>	 sry for not actively participating - working on solving some issues with langid and fasttext. if you need me anything ping me (I am following the discussion though - thanks for the updates)
[16:08:03] <elukey>	 sure!
[16:08:13] <elukey>	 I am going afk now, need to go aiko - we can restart tomorrow
[16:08:21] <aiko>	 isaranto: no worries :)
[16:08:29] <aiko>	 elukey: ok!
[16:08:47] <aiko>	 let's restart tmr, have a nice evening!
[16:09:42] <elukey>	 aiko: from what I can read we can play with the env var OMP_NUM_THREADS=X
[16:09:45] <elukey>	 to limit the threads
[16:09:59] <elukey>	 but I have zero ideas about why we have this behavior no
[16:10:01] <elukey>	 *now
[16:10:07] <elukey>	 anyway, have a nice rest of the day folks!
[16:10:45] <aiko>	 I'll look into that, thanks!!
[16:10:50] <aiko>	 bye luca :)
[16:17:34] <isaranto>	 going afk too! cu tomorrow