[08:07:13] <elukey>	 killed eswiki-damaging again
[08:36:56] <wikibugs>	 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) There seems to be a metric to look for, namely [[ https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?forceLogin&orgId=1&var-backend=eswiki-damaging-predictor-...
[08:37:04] <elukey>	 didn't find a clue yet but I think we have a good metric to alarm on --^
[09:21:29] <klausman>	 Well, at least we have that. I'm not up-to-date on the kserve-11-for-isvcs effort, how was that going. Last I saw was a dep conflict between revscoring and kserve
[09:55:35] <isaranto>	 o/ We can start with this metric to test and then update them  when we have more from kube metrics
[10:44:44] <elukey>	 re: kserve11 - tomorrow we can release the new revscoring and rollout kserve 0.11 in staging
[10:45:53] <elukey>	 in general a sustained downstream close metric indicates that clients are likely giving up after trying, so probably good to alarm on anyway
[10:46:07] <elukey>	 this case is the worst case, namely blackhole
[10:52:01] <klausman>	 aye, re: alarms
[14:44:40] <wikibugs>	 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) The last pod that I deleted on ml-serve-eqiad was `eswiki-damaging-predictor-default-00012-deployment-754bf46tdg6p`. Something interesting is that hours before an...
[14:44:51] <elukey>	 found some interesting data while checking for the dead pods --^
[15:41:13] <klausman>	 Sounds like a Query Of Death
[16:04:27] <elukey>	 killed eswiki-damaging again
[16:06:30] <wikibugs>	 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) Killed eswiki-damaging again.
[16:29:55] <isaranto>	 Shall we check if increasing memory would assist in resolving this? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/958052
[16:29:55] <isaranto>	 ofc I agree with creating alerts for response flags