[08:07:13] killed eswiki-damaging again [08:36:56] 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) There seems to be a metric to look for, namely [[ https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?forceLogin&orgId=1&var-backend=eswiki-damaging-predictor-... [08:37:04] didn't find a clue yet but I think we have a good metric to alarm on --^ [09:21:29] Well, at least we have that. I'm not up-to-date on the kserve-11-for-isvcs effort, how was that going. Last I saw was a dep conflict between revscoring and kserve [09:55:35] o/ We can start with this metric to test and then update them when we have more from kube metrics [10:44:44] re: kserve11 - tomorrow we can release the new revscoring and rollout kserve 0.11 in staging [10:45:53] in general a sustained downstream close metric indicates that clients are likely giving up after trying, so probably good to alarm on anyway [10:46:07] this case is the worst case, namely blackhole [10:52:01] aye, re: alarms [14:44:40] 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) The last pod that I deleted on ml-serve-eqiad was `eswiki-damaging-predictor-default-00012-deployment-754bf46tdg6p`. Something interesting is that hours before an... [14:44:51] found some interesting data while checking for the dead pods --^ [15:41:13] Sounds like a Query Of Death [16:04:27] killed eswiki-damaging again [16:06:30] 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) Killed eswiki-damaging again. [16:29:55] Shall we check if increasing memory would assist in resolving this? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/958052 [16:29:55] ofc I agree with creating alerts for response flags