[06:47:59] (03PS1) 10Kevin Bazira: revert_risk_model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1067216 (https://phabricator.wikimedia.org/T369344) [10:02:59] FIRING: LiftWingServiceErrorRate: ... [10:02:59] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=nlwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:11:30] Enable MP on the service [10:11:34] Enabled* [10:17:59] RESOLVED: LiftWingServiceErrorRate: ... [10:17:59] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=nlwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:42:21] klausman: ---^ thanks for taking care of this! [10:44:21] I wanted to check the kserve logs to see if this is the same issue we had before, but I found the logs were missing during the firing (around 9:30 to 10:10 UTC) [10:44:35] https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-component=All&var-namespace=revscoring-editquality-damaging&var-model_name=nlwiki-damaging&from=1724739600000&to=1724754000000 [10:44:54] has this happened before? [10:45:08] I think I've seen nlwiki fail like this before, yes [10:45:30] let me see if I can find something in Logstash [10:46:19] thanks! [10:54:16] https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-webrequest-1-7.0.0-1-2024.08.27?id=TE5Ek5EB-wz6FOsq0ZZl seems to be the first slow request [10:54:43] Unfortunately, Logstash doesn't have the query string logged, so I have no idea what reviod that was [11:28:17] * klausman lunch [11:41:46] 10Lift-Wing, 06Machine-Learning-Team: Request to host the Reference Need Model on LiftWing - https://phabricator.wikimedia.org/T371902#10095533 (10Aitolkyn) [12:37:45] 10Lift-Wing, 06Machine-Learning-Team: Request to host the Reference Need Model on LiftWing - https://phabricator.wikimedia.org/T371902#10095713 (10achou) Hi @Aitolkyn, could you provide the location of the model (e.g. directory on stat100x or a google drive link) and its sha512 checksum? You can generate the s... [13:30:22] I created a page https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Incident to record incidents. I think it would be good to have a place where we collect information for future reference and retrospect [13:30:44] :+1: [13:31:33] feel free to add information you find useful [13:39:47] klausman: could you send a patch to deployment-charts that you enabled mp for nlwiki-damaging? [13:39:52] yep [13:42:06] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1067353 [14:04:09] aiko: my 2c - I'd keep the list private in Phab to avoid exposing too many info [14:05:12] (o/) [14:11:11] 06Machine-Learning-Team, 05Goal: Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that is fast in production. - https://phabricator.wikimedia.org/T371395#10096186 (10calbon) - GPU hosts are racked but not set up yet - Software side slower [14:15:20] 06Machine-Learning-Team, 05Goal: Goal 2: People outside the ML team can ssh into an ml-lab machine, run a Jupyter Notebook, and run PyTorch powered by a GPU. - https://phabricator.wikimedia.org/T371396#10096213 (10calbon) Update: - machines are racked but not set up. Will set up one first to figure out dis... [14:18:06] elukey: speaking of, was there a specific reason the request logging we do did not end up in Logstash? [14:18:47] 06Machine-Learning-Team, 05Goal: Goal 3: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services. - https://phabricator.wikimedia.org/T371397#10096222 (10calbon) Update: - Slow revscoring, started logging queries on the pod side, so that is gone when the pod is killed. -... [14:19:41] klausman: not that I can think of, besides the kserve container not logging for some reason [14:21:21] (assuming it wasn't logged anything at all) [14:22:12] it did log, but only in the pod, and once that is gone, it's gone for good, AIUI (unless --previous works, which it usually doesn't) [14:24:46] hmm, now that I check it, the pod in question above does not log the request :-/ [14:26:14] I distinctly remember some isvc logging requests, but I don't know if that's behind a feature flag. At any rate, I've neves seen query strings in Logstash [14:27:39] so the logs are here https://logstash.wikimedia.org/goto/b82b34b72340b438e4fc375ec4233460 [14:27:45] but there are gaps [14:28:12] probably when the kserve container was completely hanging [14:28:25] but before that, it should be possible to find slow requests [14:28:44] Do those logs contain the query string? I could only find the request ID [14:28:53] 06Machine-Learning-Team, 05Goal: Goal 4: Support product teams in deploying production models. - https://phabricator.wikimedia.org/T371398#10096294 (10calbon) Update: - Recommendation API is live and in production - Recently been [[ https://phabricator.wikimedia.org/T370762#10087554 | supporting structure... [14:29:29] we don't use query strings, only json payloads [14:30:22] https://logstash.wikimedia.org/goto/4358d5f76143490bae26639b09a1d42f is what was logged until 9:30 UTC, that I assume it was when the kserve container on ml-serve1007 got in trouble [14:30:34] the rev id is logged afaics [14:31:36] aaaah, my bad, I had though it would be a separate field [14:35:07] So the first request I found that looked slow was eae62036-dbdb-9539-9123-5e3e8d85dec8 [14:35:17] correction: 6cc8fe41-7d54-46be-a6d2-8894cfd6f015 [14:37:42] But I can't find what the request JSON was [14:39:23] Oh man, I was looking at istio logs instead of kserve container logs *facepalm* [14:45:06] elukey: o/ that's a good point 🫢 [14:56:49] I moved the info to the Phab task [15:17:13] ty! [21:33:46] I added some methods to get the most up-to-date hosts w/GPUs to https://wikitech.wikimedia.org/wiki/Machine_Learning/AMD_GPU , should help docrot a little ;P [21:35:20] if there's a better way to do it, esp. one that doesn't require NDA access feel free to update