[06:47:59] <wikibugs>	 (03PS1) 10Kevin Bazira: revert_risk_model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1067216 (https://phabricator.wikimedia.org/T369344)
[10:02:59] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[10:02:59] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=nlwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[10:11:30] <klausman>	 Enable MP on the service
[10:11:34] <klausman>	 Enabled*
[10:17:59] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: ...
[10:17:59] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=nlwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[10:42:21] <aiko>	 klausman: ---^ thanks for taking care of this! 
[10:44:21] <aiko>	 I wanted to check the kserve logs to see if this is the same issue we had before, but I found the logs were missing during the firing (around 9:30 to 10:10 UTC)
[10:44:35] <aiko>	 https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-component=All&var-namespace=revscoring-editquality-damaging&var-model_name=nlwiki-damaging&from=1724739600000&to=1724754000000
[10:44:54] <aiko>	 has this happened before?
[10:45:08] <klausman>	 I think I've seen nlwiki fail like this before, yes
[10:45:30] <klausman>	 let me see if I can find something in Logstash
[10:46:19] <aiko>	 thanks!
[10:54:16] <klausman>	 https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-webrequest-1-7.0.0-1-2024.08.27?id=TE5Ek5EB-wz6FOsq0ZZl seems to be the first slow request
[10:54:43] <klausman>	 Unfortunately, Logstash doesn't have the query string logged, so I have no idea what reviod that was
[11:28:17] * klausman lunch
[11:41:46] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Request to host the Reference Need Model on LiftWing - https://phabricator.wikimedia.org/T371902#10095533 (10Aitolkyn)
[12:37:45] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Request to host the Reference Need Model on LiftWing - https://phabricator.wikimedia.org/T371902#10095713 (10achou) Hi @Aitolkyn, could you provide the location of the model (e.g. directory on stat100x or a google drive link) and its sha512 checksum? You can generate the s...
[13:30:22] <aiko>	 I created a page https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Incident to record incidents. I think it would be good to have a place where we collect information for future reference and retrospect
[13:30:44] <klausman>	 :+1:
[13:31:33] <aiko>	 feel free to add information you find useful
[13:39:47] <aiko>	 klausman: could you send a patch to deployment-charts that you enabled mp for nlwiki-damaging?
[13:39:52] <klausman>	 yep
[13:42:06] <klausman>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1067353
[14:04:09] <elukey>	 aiko: my 2c - I'd keep the list private in Phab to avoid exposing too many info
[14:05:12] <elukey>	 (o/)
[14:11:11] <wikibugs>	 06Machine-Learning-Team, 05Goal: Goal 1:  Non-technical users can make a request to a Hugging Face Large Language Model that is fast in production. - https://phabricator.wikimedia.org/T371395#10096186 (10calbon)   - GPU hosts are racked but not set up yet   - Software side slower
[14:15:20] <wikibugs>	 06Machine-Learning-Team, 05Goal: Goal 2:  People outside the ML team can ssh into an ml-lab machine, run a Jupyter Notebook, and run PyTorch powered by a GPU. - https://phabricator.wikimedia.org/T371396#10096213 (10calbon) Update:    - machines are racked but not set up. Will set up one first to figure out dis...
[14:18:06] <klausman>	 elukey: speaking of, was there a specific reason the request logging we do did not end up in Logstash?
[14:18:47] <wikibugs>	 06Machine-Learning-Team, 05Goal: Goal 3: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services. - https://phabricator.wikimedia.org/T371397#10096222 (10calbon) Update:  - Slow revscoring, started logging queries on the pod side, so that is gone when the pod is killed.  -...
[14:19:41] <elukey>	 klausman: not that I can think of, besides the kserve container not logging for some reason
[14:21:21] <elukey>	 (assuming it wasn't logged anything at all)
[14:22:12] <klausman>	 it did log, but only in the pod, and once that is gone, it's gone for good, AIUI (unless --previous works, which it usually doesn't)
[14:24:46] <klausman>	 hmm, now that I check it, the pod in question above does not log the request :-/
[14:26:14] <klausman>	 I distinctly remember some isvc logging requests, but I don't know if that's behind a feature flag. At any rate, I've neves seen query strings in Logstash
[14:27:39] <elukey>	 so the logs are here https://logstash.wikimedia.org/goto/b82b34b72340b438e4fc375ec4233460
[14:27:45] <elukey>	 but there are gaps
[14:28:12] <elukey>	 probably when the kserve container was completely hanging
[14:28:25] <elukey>	 but before that, it should be possible to find slow requests
[14:28:44] <klausman>	 Do those logs contain the query string? I could only find the request ID
[14:28:53] <wikibugs>	 06Machine-Learning-Team, 05Goal: Goal 4: Support product teams in deploying production models. - https://phabricator.wikimedia.org/T371398#10096294 (10calbon) Update:    - Recommendation API is live and in production   - Recently been [[ https://phabricator.wikimedia.org/T370762#10087554 | supporting structure...
[14:29:29] <elukey>	 we don't use query strings, only json payloads
[14:30:22] <elukey>	 https://logstash.wikimedia.org/goto/4358d5f76143490bae26639b09a1d42f is what was logged until 9:30 UTC, that I assume it was when the kserve container on ml-serve1007 got in trouble
[14:30:34] <elukey>	 the rev id is logged afaics
[14:31:36] <klausman>	 aaaah, my bad, I had though it would be a separate field
[14:35:07] <klausman>	 So the first request I found that looked slow was eae62036-dbdb-9539-9123-5e3e8d85dec8
[14:35:17] <klausman>	 correction: 6cc8fe41-7d54-46be-a6d2-8894cfd6f015
[14:37:42] <klausman>	 But I can't find what the request JSON was
[14:39:23] <klausman>	 Oh man, I was looking at istio logs instead of kserve container logs *facepalm*
[14:45:06] <aiko>	 elukey: o/ that's a good point 🫢
[14:56:49] <aiko>	 I moved the info to the Phab task
[15:17:13] <klausman>	 ty!
[21:33:46] <inflatador>	 I added some methods to get the most up-to-date hosts w/GPUs to https://wikitech.wikimedia.org/wiki/Machine_Learning/AMD_GPU , should help docrot a little ;P
[21:35:20] <inflatador>	 if there's a better way to do it, esp. one that doesn't require NDA access feel free to update