[04:20:59] 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10santhosh) >>! In T340507#9242994, @isarantopoulos wrote: > @santhosh Thanks for creating the model card! > Is there... [05:56:31] (03CR) 10Elukey: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey) [06:00:13] (03PS3) 10Elukey: readability: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964561 [06:39:22] (03CR) 10Elukey: [C: 03+2] readability: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964561 (owner: 10Elukey) [06:45:05] (03Merged) 10jenkins-bot: readability: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964561 (owner: 10Elukey) [06:46:15] hello folks! [06:46:40] so I think I solved the issue for revscoring-drafttopic in ml-serve-eqiad, I deleted old helm secrets and I don't see the problem anymore [06:46:50] so I guess it was some etcd inconsistent state [06:46:59] a little scary, but I wouldn't spend more time on this [06:47:32] (03PS3) 10Elukey: langid: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 [06:48:00] (03CR) 10CI reject: [V: 04-1] langid: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey) [06:49:14] 10Machine-Learning-Team: Visualize KServe latency metrics in a dashboard - https://phabricator.wikimedia.org/T348456 (10elukey) Completed the rollout, we should have metrics for all isvcs! [06:52:42] uff I merged a change to drop some prometheus labels in the new metrics and I don't see any datapoint anymore :D [06:54:08] ah snap wrong config [06:56:59] fix is https://gerrit.wikimedia.org/r/c/operations/puppet/+/965392/ [07:07:29] elukey: o/ I can work on the langid CI that is failing [07:07:37] If that's ok with you ofc [07:08:15] sure [07:13:48] I didnt get exactly what was wrong with drafttopic, that's why I'm not commenting :) [07:17:24] I am not super sure either [07:18:08] TL;DR I think that helm was failing when fetching secrets from the k8s api, and in the k8s api I noticed some protobuf errors that I assume where related to etcd [07:18:20] so I dropped some old helm secrets for drafttopic, that we don't use anymore [07:18:24] and all good [07:18:41] I think it was some corrupted/empty value [07:18:54] but getting to the exact root cause is probably a long long task [07:19:04] so I'd do it only if we see the issue again when deploying [07:19:09] it is pretty clear, helm fails :) [07:24:59] * elukey brb [07:46:04] ack, thanks for the info! [08:19:42] 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10elukey) @santhosh Usually when we expose a service to the outside world (wmf cloud included) we use the API Gateway... [08:23:02] I uploaded the model here https://analytics.wikimedia.org/published/wmf-ml-models/langid/ [08:23:33] and added its hash as well. Will add this on the ticket above for reference [09:02:39] nice! [09:02:41] created https://gerrit.wikimedia.org/r/c/operations/puppet/+/965469/1 [09:02:49] to expand model_upload with the prompt [09:03:06] nice! [09:06:42] I am also proposing to change the name to model-upload :D [09:11:23] tried to add a directory under wmf-ml-models on stat1004 [09:11:25] let's see how it goes [09:22:33] I'm debugging the langid issues.Trying to figure out what are the dependency issues that arise when installing kserve 0.11.1 [09:24:00] I think I would prefer if we explicitly defined all dependencies via pip install --no-dependencies. It creates deterministic builds, but it has a downside, sometimes it can become chaos since you need to manually update all the versions [09:24:18] anyway, just random thoughts for now :) [09:28:47] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey) [09:35:32] can confirm that we can use separate dirs with models in various stat boxes [09:35:35] see https://analytics.wikimedia.org/published/wmf-ml-models/elukey-test/it/20231012091016/ [09:35:38] aiko: --^ [09:35:43] (elukey test is only on stat1004) [09:47:04] also filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/965474 [09:47:24] I realized that we allow to use model-upload to deploy-ml-service users, so it should be the same for public models [09:47:39] at the moment deploy-ml-services is the same as ml-admins [09:47:44] but we'll probably expand it in the future [09:50:34] 10Machine-Learning-Team, 10Patch-For-Review: Add sha512 checksum files to all the ML's models in the public dir - https://phabricator.wikimedia.org/T347838 (10elukey) Prompt added to the script, task should be done! [10:02:39] elukey: ack! [10:14:38] (03PS1) 10Ilias Sarantopoulos: revscoring: update bullseye image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965477 [10:15:59] (03CR) 10Ilias Sarantopoulos: "I created this patch because there was not new image produced from the previous patches" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965477 (owner: 10Ilias Sarantopoulos) [10:23:45] (03CR) 10Ilias Sarantopoulos: [C: 03+2] revscoring: update bullseye image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965477 (owner: 10Ilias Sarantopoulos) [10:24:30] (03Merged) 10jenkins-bot: revscoring: update bullseye image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965477 (owner: 10Ilias Sarantopoulos) [10:35:28] taking a break from trying to solve the dependency issues. I'm going to start deploying revscoring starting from staging [10:36:08] * isaranto lunch time! [10:40:05] elukey: good morning, we should make many ores repos archived or read only, they are everywhere, in github, in gerrit, etc. [10:43:35] 10Machine-Learning-Team, 10Patch-For-Review: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey) [10:44:57] Amir1: o/ yes yes we'll do it as soon as possible, we need to keep revscoring live but the rest can be archived [10:45:03] I'll add a step in the task [10:45:42] 10Machine-Learning-Team, 10Patch-For-Review: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey) [10:45:43] klausman: cleaned up ores from deployment-prep, quick and easy [10:45:54] thankyou! [10:46:41] thanks! [10:51:37] 10Machine-Learning-Team: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 (10klausman) 05Open→03Resolved [11:05:20] * elukey lunch! [11:21:45] same [12:00:39] Coffee! [12:07:15] o/ [12:07:16] last coffee of the day for me [12:07:53] folks I added a second deployment for articlequality in staging that uses mutliprocessing so that we can load test side by side [12:21:54] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/965479 [12:26:36] (03PS3) 10Ilias Sarantopoulos: llm: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965192 (owner: 10Elukey) [12:44:59] (03PS1) 10Kevin Bazira: Use envoy proxy to access endpoints external to k8s/LiftWing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) [12:46:46] (03CR) 10CI reject: [V: 04-1] Use envoy proxy to access endpoints external to k8s/LiftWing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [12:47:52] Sweet [12:53:42] 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10santhosh) @elukey If I understood that documentation correctly, if the service required oauth token, still Anonymou... [14:44:36] (03PS4) 10Ilias Sarantopoulos: langid: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey) [14:50:36] (03CR) 10Ilias Sarantopoulos: langid: Upgrade to KServe 0.11.1 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey) [14:50:40] (03PS5) 10Ilias Sarantopoulos: langid: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey) [14:57:07] (03CR) 10Elukey: [C: 03+1] langid: Upgrade to KServe 0.11.1 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey) [15:12:40] elukey: o/ the issue can't be repro locally. performance looks all good for old and new image [15:12:51] sigh [15:12:51] here is the result https://phabricator.wikimedia.org/P52917 [15:13:28] aiko: have you tried again in staging? [15:13:34] to make sure it is always slow [15:13:56] elukey: not yet [15:14:07] I'll be submitting a separate patch that enables local runs for langid (almost ready) [15:14:24] aiko: ouch! [15:14:32] elukey: ok I'll do it and try longer period [15:15:24] we'll need to check the other model servers, I hope it doesnt happen that much [15:15:33] so far it seems only revert risk [15:15:42] it gets heavily cpu throttled from a quick check [15:20:05] (03CR) 10Ilias Sarantopoulos: [C: 03+2] langid: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey) [15:20:52] (03Merged) 10jenkins-bot: langid: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965190 (owner: 10Elukey) [15:21:23] (03PS4) 10Ilias Sarantopoulos: llm: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965192 (owner: 10Elukey) [15:22:12] yes definitely it is cpu throttled: [15:22:13] https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=revertrisk&var-pod=revertrisk-language-agnostic-predictor-default-00014-deplovnrr7&var-container=All&viewPanel=5 [15:24:22] aiko: just to be sure, can you deploy another isvc for rr-la in staging that uses kserve 0.10 (so the old docker image) ? [15:24:30] so we can test both at the same time [15:24:53] yes! good idea [15:26:36] also if you can, please keep doing the load test [15:27:44] ack [15:33:03] I am not sure if I am doing perf correctly, but I see that 40% of the time is spent in libgomp [15:33:33] that is OpenMP-related? [15:35:38] aiko: another qs - when you ran the test locally, did you limit docker's cpu? [15:37:22] elukey: have no idea about the libgomp.. ahh I didn't limit docker's cpu :( [15:37:47] let's try to match lift wing's cpu and memory [15:39:44] ok [15:39:59] another isvc: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/965532 [15:41:24] checked in prod, I don't see the same libgomp [15:41:44] very weird [15:45:25] would it be related to the fix https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/965094? [15:46:52] in theory no [15:47:20] I see that libgomp is from xboost [15:47:21] ./xgboost.libs/libgomp-a34b3233.so.1.0.0 [15:47:33] that I assume is brought it by KI? [15:48:03] yes it is the lib that the model uses [15:49:16] elukey: how did you check performance and find the lib? (just asking to learn) [15:49:30] isaranto: not sure if it 100% correct, but I used perf [15:49:39] https://www.brendangregg.com/perf.html [15:49:43] directly on the node [15:49:52] now in theory it should be fine even if it is in a separate pid namespace [15:50:09] and for the lib, I used nsenter [15:50:28] it is "namespace enter" [15:50:30] aha, nice! [15:50:38] you can use it to check specific namespaces etc.. [15:50:44] root@ml-staging2001:/opt/lib/python/site-packages# find -name *gomp* [15:50:44] ./xgboost.libs/libgomp-a34b3233.so.1.0.0 [15:51:12] iiuc we don't have permissions to do that [15:51:38] exactly yes, only root [15:52:26] a ok, now I understood. u attached to the node and ran it on the specific pid [15:52:35] exactly [15:52:49] tested locally after limiting the cpu to 1 and memory to 2G, the result looks no changed [15:52:57] okok [15:53:06] so we can try to load test the old endpoint [15:53:19] one thing that I am reading is that xgboost uses openmp behind the scenes [15:53:27] and we should tune its number of threads [15:53:41] but I am not sure why we don't see this elsewhere [15:54:11] I am curious to see in the old pod if there is libgomp [15:54:25] maybe for some weird reason we are importing it with 0.11 [15:54:40] and it makes some damage when running in k8s [15:55:01] but I noticed usually we would get a warning about xgboost when starting the server, I didn't see that for the new docker image [15:55:14] https://www.irccloud.com/pastebin/RpJ7uPwm/ [15:55:59] lemme know when the old pod is ready so I can check its libs [15:56:13] okok [15:57:22] deploying [15:59:49] root@ml-staging2002:/opt/lib/python/site-packages# find -name *gomp* [15:59:49] ./xgboost.libs/libgomp-a34b3233.so.1.0.0 [15:59:57] same lib afaics [16:00:38] and same xgboost [16:00:58] load testing both endpoints [16:02:59] 0.10 doesn't get throttled [16:03:09] gimme a sec to run perf on it [16:04:26] yeah no trace of libgomp [16:06:09] so the problem is that libgomp [16:06:10] ok so this is number of threads for 0.11 [16:06:11] elukey@ml-staging2001:~$ ps -eLf | grep 2220806 | wc -l [16:06:11] 237 [16:06:18] this is for 0.10 [16:06:18] elukey@ml-staging2002:~$ ps -eLf | grep 1113496 | wc -l [16:06:18] 8 [16:06:20] lol [16:07:08] huge difference.. [16:07:32] so I think that, somehow, I assume due to KI + KServe 0.11, we use a ton of threads for openmp [16:07:35] in xgboost [16:07:48] and this consumes all the cpu time available in no time -> Throttling [16:07:53] sry for not actively participating - working on solving some issues with langid and fasttext. if you need me anything ping me (I am following the discussion though - thanks for the updates) [16:08:03] sure! [16:08:13] I am going afk now, need to go aiko - we can restart tomorrow [16:08:21] isaranto: no worries :) [16:08:29] elukey: ok! [16:08:47] let's restart tmr, have a nice evening! [16:09:42] aiko: from what I can read we can play with the env var OMP_NUM_THREADS=X [16:09:45] to limit the threads [16:09:59] but I have zero ideas about why we have this behavior no [16:10:01] *now [16:10:07] anyway, have a nice rest of the day folks! [16:10:45] I'll look into that, thanks!! [16:10:50] bye luca :) [16:17:34] going afk too! cu tomorrow