[06:08:24] Guten Morgen o/ [08:07:07] Guten Morgen :D [08:41:14] o/ [09:26:54] 早晨好! [09:28:29] \o [09:39:56] o/ [10:42:36] * isaranto lunch! [10:59:33] ditto [12:39:32] isaranto: o/ [12:39:40] \o [12:39:59] I am curious about the crashloop of mistral - what is the issue? From the logs it is not super clear that something is failing [12:41:07] exactly I couldn't tell as well...however the best explanation I can find is this https://phabricator.wikimedia.org/T365246#9826605 [12:41:23] that MI100 is not supported [12:42:22] could it be that we need a longer readiness probe time? [12:42:31] by vllm I mean. So I'm working on allowing to pass the cmd argument related to huggingface server that allow to disable vllm [12:42:52] didn't think about it , I can check [12:44:49] isaranto: we can quickly change it and bump it to say 900 [12:45:28] yes, I'm checking now where I should change that [12:45:46] just done it :) [12:46:08] so this is how to find it [12:46:19] if you describe the pod, you'll see something like [12:46:26] Readiness probe failed: Get "http://10.194.61.200:15020/app-health/queue-proxy/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) [12:46:37] now that :15020 is an istio port [12:47:18] and on the isvc we have an annotation like proxy.istio.io/config: '{"readiness.status.sidecar.istio.io/periodSeconds": "600" } [12:47:41] so the kubelet calls the istio container, that in turn calls the queue-proxy, that in turn contacts the kserve container [12:47:47] basically the regular chain of calls [12:48:10] IIRC it is the istio annotation that the kubelet cares when deciding the crashloop or not [12:48:21] I've manually bumped to 15 mins basically [12:48:39] okk, thanks for the explanation (and changing it!) [12:48:43] not sure if it is the issue, I also see that the kserve container returns 500 [12:48:50] but the absence of logs is really puzzling [12:54:09] yes, and I initially thought that it was nice that we see some logs [12:58:18] trying another setting [13:00:29] 10Lift-Wing, 06Machine-Learning-Team, 10ORES, 10ChangeProp, and 5 others: Selectively disable changeprop functionality that is no longer used - https://phabricator.wikimedia.org/T361483#9829458 (10achou) Hi @dcausse @EBernhardson, I just wanted to sync with you whether it is acceptable to lose some events... [13:03:51] so it failed again, with no visible explanation [13:04:25] it is interesting that the health probes failures are two, port 15021 and 15002 [13:04:28] *15020 [13:05:49] so it may be that the istio-proxy container is causing the failure [13:08:35] I see this in the logs `{"level":"error","time":"2024-05-24T13:07:00.501235Z","msg":"Request to probe app failed: Get \"http://10.194.61.221:8012/\": context deadline exceeded (Client.Timeout exceeded while awaiting headers), original URL path = /app-health/queue-proxy/readyz\napp URL path = /"}` [13:08:42] of istio-proxy [13:08:51] something is not right though [13:08:52] Readiness: http-get http://:15020/app-health/queue-proxy/readyz delay=0s timeout=1s period=10s #success=1 #failure=3 [13:08:55] Readiness: http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30 [13:09:17] if you see the values even for 15020, they are very quick [13:09:26] I feel that I have already seen this issue [13:12:09] okok so I added the correct annotation config [13:12:12] now I see [13:12:12] Readiness: http-get http://:15020/app-health/queue-proxy/readyz delay=0s timeout=1s period=10s #success=1 #failure=3 [13:12:15] Readiness: http-get http://:15021/healthz/ready delay=600s timeout=3s period=900s #success=1 #failure=30 [13:12:37] the 15020 is still very quick, let's see if the kserve one is enought to delay the crash loop [13:12:57] (the above values are available in kubectl describe pod etc..) [13:14:24] ack [13:17:03] ok now the readiness probe failures are only for the istio-proxy [13:17:46] I see this in the queue-proxy Readiness: http-get http://:15020/app-health/queue-proxy/readyz delay=0s timeout=1s period=10s #success=1 #failure=3 [13:22:11] mmm where? [13:22:41] when I do kubectl describe pod in the queue-proxy [13:22:42] w8 [13:24:29] and now the queue-proxy container has an error `aggressive probe error (failed 202 times): dial tcp 127.0.0.1:8080: connect: connection refused [13:24:29] timed out waiting for the condition` [13:25:44] mercelisv: o/ [13:26:15] regarding the newer models, we can first add them to the package and then add validation as a second step. [13:26:41] I mean later on. lemme create an example first and we can do it afterwards [13:33:28] isaranto: ah! I found a way! [13:33:33] you are totally right [13:33:54] so if we define the readiness probe for the kserve container, then the queue proxy one (that you pointed out) gets updated [13:34:15] let's see if it owrk [13:34:18] *works [13:37:12] elukey: I need some advice on sth! I want to pass a bunch of cmd arguments to the entrypoint sto that I can change the command that we run [13:37:50] the kserve-container entrypoint? [13:38:05] 06Machine-Learning-Team, 06serviceops, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9829677 (10elukey) I checked the dragonfly repo and I have a question about building for bookworm (didn't find it in https://wikitech.wikimedia.org/wiki/Dragonf... [13:39:33] e.g. `python -m huggingfaceserver --backend=huggingface`. So a way to do that would be to add it as an env var so we can change it directly on the deployment e.g. ``python -m huggingfaceserver --backend="$BACKEND"`. Since there will be 5-6 such values found a way to do it with an associative array in bash instead of many if else statements [13:39:37] any other hints? [13:39:48] lemme push my patch to explain what I mean [13:40:02] yes in the kserver-container entrypoint [13:41:09] (03PS1) 10Ilias Sarantopoulos: huggingface: add cmd args as environment variables [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035781 [13:41:52] chatgpt helped me :) I've already tested this and it works [13:42:12] ahahahaha [13:44:12] 06Machine-Learning-Team, 06serviceops, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9829691 (10MoritzMuehlenhoff) >>! In T365253#9829677, @elukey wrote: > I checked the dragonfly repo and I have a question about building for bookworm (didn't fi... [13:47:03] (03CR) 10CI reject: [V:04-1] huggingface: add cmd args as environment variables [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035781 (owner: 10Ilias Sarantopoulos) [13:48:03] it def needs some more love and comments. the thing is that things get more and more complex this way. [13:48:32] I was thinking that perhaps a better alternative would be to just override the entrypoint through the values.yaml in the chart. wdyt? [13:49:55] in theory yes, if you just need to override it [13:50:04] I mean if you don't need to use a bash script in values.yaml [13:50:14] but if it is just a list, seems good [13:50:18] I never done it IIRC [13:52:17] yes, I'm thinking this is a better approach. Otherwise we're just creating a variable to manipulate the command instead of changing the command directly [13:58:57] so same stuff , still failing... [14:03:33] I am trying a different thing, namely max-revision-timeout-seconds in knative [14:03:45] it always fails after 5 mins, and that value defaults to 300s [14:03:48] I bumped it to 600 [14:10:29] (meeting) [14:11:47] (me2) [14:50:04] back! [14:50:20] in the meantime, the new refactored pytorch images are on the registry [14:53:34] ok! [14:53:49] elukey: I removed the gpu from mistral [14:54:04] I put it in the bert model to check if anything is different [14:55:24] okok! [14:55:54] going afk will be back to check it in 30' before I wrap up for the weekend [14:56:04] will leave things to their deployment-charts state after that [16:07:25] ok it worked with bert! [16:07:44] so I'm pretty sure it has to do with either model architecture or size. [16:08:03] * elukey nods [16:08:23] I didn't have time to check but something is causing the revision to fail for mistral [16:08:29] I'll work on changing the entrypoint so that we can experiment with different configs (using different dtype -> less memory) [16:08:42] it is not readiness probe anymore, and we got a longer revision timeout, so it must be something else [16:08:44] FIRING: LiftWingServiceErrorRate: ... [16:08:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [16:10:07] we have 4 pods running sigh [16:10:22] I think it has to do with vllm and the whole huggingface/kserve upgrade. Since now it is just using the CPU (I use the GPU on bert) it should be working as before [16:10:26] * isaranto sighs [16:11:20] I am wondering one thing - could we expand the response json to include a warning/admin message? [16:11:21] at least Murphy's Law is valid. an alert on Friday evening ! [16:11:23] :D [16:11:39] Like "Please follow up with us to figure out if RR could be used instead etc.." [16:11:44] maybe pointing to a wikitech page [16:12:01] we could roll it out to all revscoring-editquality models [16:12:12] so at least people may get back to us if they see the msg [16:13:52] so the extra instances are not kicking in, the rps are now probably [16:14:44] > Like "Please follow up with us to figure out if RR could be used instead etc.." [16:14:44] good suggestion. I'll follow up on that [16:15:09] ack :) [16:15:20] cpu was maxed out on most of the containers, now only one [16:18:44] RESOLVED: LiftWingServiceErrorRate: ... [16:18:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [16:20:16] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9830133 (10isarantopoulos) Task {T365253} fixed the issue mentioned above in ml-staging-codfw. After that bert model works perfect while we're having issues with Mistral (more... [16:23:34] 06Machine-Learning-Team: Append wikitech link and contact info to revscoring model servers - https://phabricator.wikimedia.org/T365834 (10isarantopoulos) 03NEW [16:37:22] going afk! o/ [16:45:23] o/ [16:47:06] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU - https://phabricator.wikimedia.org/T362670#9830343 (10isarantopoulos) [16:47:10] 06Machine-Learning-Team, 13Patch-For-Review: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) - https://phabricator.wikimedia.org/T365246#9830342 (10isarantopoulos) [16:47:31] 06Machine-Learning-Team: Update Pytorch base image to 2.3.0 - https://phabricator.wikimedia.org/T365166#9830346 (10isarantopoulos) [16:54:13] 06Machine-Learning-Team: Allow setting huggingfaceserver cmd args from deployment-charts - https://phabricator.wikimedia.org/T365842 (10isarantopoulos) 03NEW [16:54:30] going afk as well folks, o/