[06:08:24] <isaranto>	 Guten Morgen o/
[08:07:07] <aiko>	 Guten Morgen :D
[08:41:14] <isaranto>	 o/
[09:26:54] <klausman>	 早晨好!
[09:28:29] <isaranto>	 \o
[09:39:56] <aiko>	 o/
[10:42:36] * isaranto lunch!
[10:59:33] <klausman>	 ditto
[12:39:32] <elukey>	 isaranto: o/
[12:39:40] <isaranto>	 \o
[12:39:59] <elukey>	 I am curious about the crashloop of mistral - what is the issue? From the logs it is not super clear that something is failing
[12:41:07] <isaranto>	 exactly I couldn't tell as well...however the best explanation I can find is this https://phabricator.wikimedia.org/T365246#9826605
[12:41:23] <isaranto>	 that MI100 is not supported
[12:42:22] <elukey>	 could it be that we need a longer readiness probe time?
[12:42:31] <isaranto>	 by vllm I mean. So I'm working on allowing to pass the cmd argument related to huggingface server that allow to disable vllm 
[12:42:52] <isaranto>	 didn't think about it , I can check
[12:44:49] <elukey>	 isaranto: we can quickly change it and bump it to say 900
[12:45:28] <isaranto>	 yes, I'm checking now where I should change that
[12:45:46] <elukey>	 just done it :)
[12:46:08] <elukey>	 so this is how to find it
[12:46:19] <elukey>	 if you describe the pod, you'll see something like
[12:46:26] <elukey>	 Readiness probe failed: Get "http://10.194.61.200:15020/app-health/queue-proxy/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[12:46:37] <elukey>	 now that :15020 is an istio port
[12:47:18] <elukey>	 and on the isvc we have an annotation like proxy.istio.io/config: '{"readiness.status.sidecar.istio.io/periodSeconds": "600" }
[12:47:41] <elukey>	 so the kubelet calls the istio container, that in turn calls the queue-proxy, that in turn contacts the kserve container
[12:47:47] <elukey>	 basically the regular chain of calls
[12:48:10] <elukey>	 IIRC it is the istio annotation that the kubelet cares when deciding the crashloop or not
[12:48:21] <elukey>	 I've manually bumped to 15 mins basically
[12:48:39] <isaranto>	 okk, thanks for the explanation (and changing it!)
[12:48:43] <elukey>	 not sure if it is the issue, I also see that the kserve container returns 500
[12:48:50] <elukey>	 but the absence of logs is really puzzling
[12:54:09] <isaranto>	 yes, and I initially thought that it was nice that we see some logs
[12:58:18] <elukey>	 trying another setting
[13:00:29] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10ORES, 10ChangeProp, and 5 others: Selectively disable changeprop functionality that is no longer used - https://phabricator.wikimedia.org/T361483#9829458 (10achou) Hi @dcausse @EBernhardson, I just wanted to sync with you whether it is acceptable to lose some events...
[13:03:51] <isaranto>	 so it failed again, with no visible explanation
[13:04:25] <elukey>	 it is interesting that the health probes failures are two, port 15021 and 15002
[13:04:28] <elukey>	 *15020
[13:05:49] <elukey>	 so it may be that the istio-proxy container is causing the failure
[13:08:35] <isaranto>	 I see this in the logs `{"level":"error","time":"2024-05-24T13:07:00.501235Z","msg":"Request to probe app failed: Get \"http://10.194.61.221:8012/\": context deadline exceeded (Client.Timeout exceeded while awaiting headers), original URL path = /app-health/queue-proxy/readyz\napp URL path = /"}`
[13:08:42] <isaranto>	 of istio-proxy
[13:08:51] <elukey>	 something is not right though
[13:08:52] <elukey>	     Readiness:  http-get http://:15020/app-health/queue-proxy/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
[13:08:55] <elukey>	     Readiness:  http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
[13:09:17] <elukey>	 if you see the values even for 15020, they are very quick
[13:09:26] <elukey>	 I feel that I have already seen this issue
[13:12:09] <elukey>	 okok so I added the correct annotation config 
[13:12:12] <elukey>	 now I see
[13:12:12] <elukey>	     Readiness:  http-get http://:15020/app-health/queue-proxy/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
[13:12:15] <elukey>	     Readiness:  http-get http://:15021/healthz/ready delay=600s timeout=3s period=900s #success=1 #failure=30
[13:12:37] <elukey>	 the 15020 is still very quick, let's see if the kserve one is enought to delay the crash loop
[13:12:57] <elukey>	 (the above values are available in kubectl describe pod etc..)
[13:14:24] <isaranto>	 ack
[13:17:03] <elukey>	 ok now the readiness probe failures are only for the istio-proxy
[13:17:46] <isaranto>	 I see this in the queue-proxy Readiness:  http-get http://:15020/app-health/queue-proxy/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
[13:22:11] <elukey>	 mmm where?
[13:22:41] <isaranto>	 when I do kubectl describe pod in the queue-proxy
[13:22:42] <isaranto>	 w8
[13:24:29] <isaranto>	 and now the queue-proxy container has an error `aggressive probe error (failed 202 times): dial tcp 127.0.0.1:8080: connect: connection refused
[13:24:29] <isaranto>	 timed out waiting for the condition`
[13:25:44] <isaranto>	 mercelisv: o/
[13:26:15] <isaranto>	 regarding the newer models, we can first add them to the package and then add validation as a second step. 
[13:26:41] <isaranto>	 I mean later on. lemme create an example first and we can do it afterwards
[13:33:28] <elukey>	 isaranto: ah! I found a way!
[13:33:33] <elukey>	 you are totally right
[13:33:54] <elukey>	 so if we define the readiness probe for the kserve container, then the queue proxy one (that you pointed out) gets updated
[13:34:15] <elukey>	 let's see if it owrk
[13:34:18] <elukey>	 *works
[13:37:12] <isaranto>	 elukey: I need some advice on sth! I want to pass a bunch of cmd arguments to the entrypoint sto that I can change the command that we run
[13:37:50] <elukey>	 the kserve-container entrypoint?
[13:38:05] <wikibugs>	 06Machine-Learning-Team, 06serviceops, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9829677 (10elukey) I checked the dragonfly repo and I have a question about building for bookworm (didn't find it in https://wikitech.wikimedia.org/wiki/Dragonf...
[13:39:33] <isaranto>	 e.g. `python -m huggingfaceserver --backend=huggingface`. So a way to do that would be to add it as an env var so we can change it directly on the deployment e.g. ``python -m huggingfaceserver --backend="$BACKEND"`. Since there will be 5-6 such values  found a way to do it with an associative array in bash  instead of many if else statements
[13:39:37] <isaranto>	 any other hints?
[13:39:48] <isaranto>	 lemme push my patch to explain what I mean
[13:40:02] <isaranto>	 yes in the kserver-container entrypoint
[13:41:09] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: huggingface: add cmd args as environment variables [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035781
[13:41:52] <isaranto>	 chatgpt helped me :) I've already tested this and it works
[13:42:12] <elukey>	 ahahahaha
[13:44:12] <wikibugs>	 06Machine-Learning-Team, 06serviceops, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9829691 (10MoritzMuehlenhoff) >>! In T365253#9829677, @elukey wrote: > I checked the dragonfly repo and I have a question about building for bookworm (didn't fi...
[13:47:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] huggingface: add cmd args as environment variables [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035781 (owner: 10Ilias Sarantopoulos)
[13:48:03] <isaranto>	 it def needs some more love and comments. the thing is that things get more and more complex this way.
[13:48:32] <isaranto>	 I was thinking that perhaps a better alternative would be to just override the entrypoint through the values.yaml in the chart. wdyt?
[13:49:55] <elukey>	 in theory yes, if you just need to override it 
[13:50:04] <elukey>	 I mean if you don't need to use a bash script in values.yaml
[13:50:14] <elukey>	 but if it is just a list, seems good
[13:50:18] <elukey>	 I never done it IIRC
[13:52:17] <isaranto>	 yes, I'm thinking this is a better approach. Otherwise we're just creating a variable to manipulate the command instead of changing the command directly
[13:58:57] <isaranto>	 so same stuff , still failing...
[14:03:33] <elukey>	 I am trying a different thing, namely max-revision-timeout-seconds in knative
[14:03:45] <elukey>	 it always fails after 5 mins, and that value defaults to 300s
[14:03:48] <elukey>	 I bumped it to 600
[14:10:29] <elukey>	 (meeting)
[14:11:47] <isaranto>	 (me2)
[14:50:04] <elukey>	 back!
[14:50:20] <elukey>	 in the meantime, the new refactored pytorch images are on the registry
[14:53:34] <isaranto>	 ok!
[14:53:49] <isaranto>	 elukey: I removed the gpu from mistral
[14:54:04] <isaranto>	 I put it in the bert model to check if anything is different
[14:55:24] <elukey>	 okok!
[14:55:54] <isaranto>	 going afk will be back to check it in 30' before I wrap up for the weekend
[14:56:04] <isaranto>	 will leave things to their deployment-charts state after that
[16:07:25] <isaranto>	 ok it worked with bert!
[16:07:44] <isaranto>	 so I'm pretty sure it has to do with either model architecture or size.
[16:08:03] * elukey nods
[16:08:23] <elukey>	 I didn't have time to check but something is causing the revision to fail for mistral
[16:08:29] <isaranto>	 I'll work on changing the entrypoint so that we can experiment with different configs (using different dtype -> less memory)
[16:08:42] <elukey>	 it is not readiness probe anymore, and we got a longer revision timeout, so it must be something else
[16:08:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[16:08:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[16:10:07] <elukey>	 we have 4 pods running sigh
[16:10:22] <isaranto>	 I think it has to do with vllm and the whole huggingface/kserve upgrade. Since now it is just using the CPU (I use the GPU on bert) it should be working as before
[16:10:26] * isaranto sighs
[16:11:20] <elukey>	 I am wondering one thing - could we expand the response json to include a warning/admin message?
[16:11:21] <isaranto>	 at least Murphy's Law is valid. an alert on Friday evening !
[16:11:23] <isaranto>	 :D
[16:11:39] <elukey>	 Like "Please follow up with us to figure out if RR could be used instead etc.."
[16:11:44] <elukey>	 maybe pointing to a wikitech page
[16:12:01] <elukey>	 we could roll it out to all revscoring-editquality models
[16:12:12] <elukey>	 so at least people may get back to us if they see the msg
[16:13:52] <elukey>	 so the extra instances are not kicking in, the rps are now probably
[16:14:44] <isaranto>	 > Like "Please follow up with us to figure out if RR could be used instead etc.."
[16:14:44] <isaranto>	 good suggestion. I'll follow up on that
[16:15:09] <elukey>	 ack :)
[16:15:20] <elukey>	 cpu was maxed out on most of the containers, now only one
[16:18:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: ...
[16:18:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[16:20:16] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9830133 (10isarantopoulos) Task {T365253} fixed the issue mentioned above in ml-staging-codfw. After that bert model works perfect while we're having issues with Mistral (more...
[16:23:34] <wikibugs>	 06Machine-Learning-Team: Append wikitech link and contact info to revscoring model servers - https://phabricator.wikimedia.org/T365834 (10isarantopoulos) 03NEW
[16:37:22] <elukey>	 going afk! o/
[16:45:23] <isaranto>	 o/
[16:47:06] <wikibugs>	 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU - https://phabricator.wikimedia.org/T362670#9830343 (10isarantopoulos)
[16:47:10] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) - https://phabricator.wikimedia.org/T365246#9830342 (10isarantopoulos)
[16:47:31] <wikibugs>	 06Machine-Learning-Team: Update Pytorch base image to 2.3.0 - https://phabricator.wikimedia.org/T365166#9830346 (10isarantopoulos)
[16:54:13] <wikibugs>	 06Machine-Learning-Team: Allow setting huggingfaceserver cmd args from deployment-charts - https://phabricator.wikimedia.org/T365842 (10isarantopoulos) 03NEW
[16:54:30] <isaranto>	 going afk as well folks, o/