[00:09:31] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[00:09:31] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[04:09:31] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[04:09:31] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[08:09:31] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[08:09:31] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[08:35:18] <wikibugs>	 06Machine-Learning-Team: Error in revscoring-editquality-damaging - itwiki-damaging-predictor-default - https://phabricator.wikimedia.org/T401109#11067410 (10OKarakaya-WMF) looking into the logs we also get this error with other wikis. However, the alarm is always about itwiki  `  https://logstash.wikimedia.org/...
[08:36:20] <wikibugs>	 06Machine-Learning-Team: Numpy is not available in revertrisk_multilingual - https://phabricator.wikimedia.org/T401305#11067414 (10OKarakaya-WMF) 05Open→03Resolved a:03OKarakaya-WMF
[08:38:41] <elukey>	 ozge_: o/ 
[08:39:13] <elukey>	 one tip - in logstash's top right corner there is a share functionality, in which you can create permalinks (short versions too)
[08:39:36] <elukey>	 most of the times the full link of logstash is not something that leads to a consistent results for others
[08:50:36] <elukey>	 the itwiki issue seems to be related to some traffic that causes timeouts with the mediawiki api
[08:51:09] <elukey>	 IIRC we set 5 seconds or similar, so it seems that somehow the current requests are heavy
[08:53:09] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: ...
[08:53:09] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[08:53:36] <elukey>	 from https://logstash.wikimedia.org/goto/6caf1b48928bcb3ff43b822a14c08034 it seems that all most of the traffic is generated by ORES Legacy
[08:57:29] <elukey>	 is there anybody looking into it? :)
[09:07:55] <elukey>	 the "0" response code returned by the istio gateway is to indicate when a client gives up before getting the response
[09:24:42] <ozge_>	 Hey @elukey this is the error we discussed here https://phabricator.wikimedia.org/T401109 . Interesting that the alarm rises only for itwiki and enwiki stats look all good    https://logstash.wikimedia.org/app/dashboards#/view/138271f0-40ce-11ed-bb3e-0bc9ce387d88?_g=h@1d21a65&_a=h@1c1da89
[09:27:06] <elukey>	 ozge_: my impression is that we receive some calls from the ores service, and that traffic somehow causes the latency to go up, leading to timeouts etc..
[09:27:36] <elukey>	 is there anything specific on those rev-ids that may explain this? It doesn't seem compute related (it happened in the past, the current fix is to add multi processing)
[09:28:11] <elukey>	 another view is https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=aWotKxQMz&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default-00028&from=now-12h&to=now&timezone=utc&var-response_code=$__all&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99
[09:28:23] <elukey>	 we have both 200s and 0s (client giving up), but latency is horrible
[09:28:43] <elukey>	 the endpoint that we use is the mw api, and it is shared across all other pods
[09:30:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[09:30:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[09:43:16] <elukey>	 so these alerts are currently firing since "0" responses are registered, namely clients giving up
[09:43:28] <elukey>	 that may be noise, but in this case it may indicate a broader issue
[09:52:34] <ozge_>	 ozge@deploy1003:~$ kubectl get pods | grep itwiki
[09:52:34] <ozge_>	 itwiki-damaging-predictor-default-00028-deployment-55c7d6776t5s   3/3     Running   0             4m52s
[09:52:34] <ozge_>	 itwiki-damaging-predictor-default-00028-deployment-55c7d67fn4z9   3/3     Running   0             23d
[09:52:34] <ozge_>	 I see some pods gets restarted in itwiki
[09:53:56] <ozge_>	 and I see many mwapi.errors.TimeoutError in itwiki. I'll try to check if we have legacy ores request only in itwiki
[09:54:27] <ozge_>	 but the previously failing requests a quiet fast now (~0.5 seconds) if I try from local
[10:06:36] <ozge_>	 I've checked logs of all the currently running pods. we have TimeoutError only in itwiki although the other pods also actively get requests.
[10:07:37] <ozge_>	 @elukey do we have istio conf per wiki or is it more general?
[10:07:58] <ozge_>	 happy to jump on a call to get some help
[10:08:48] <elukey>	 it is the same for all wikis, so I think there is something specific with the traffic that hits the pod
[10:09:03] <elukey>	 maybe a revision that takes a lot to be fetched or similar
[10:09:16] <elukey>	 I can check more after lunch
[10:22:43] <ozge_>	 https://www.irccloud.com/pastebin/AaKwZJRI/
[10:22:54] <ozge_>	 cool thank you @elukey . I'm not sure if this is relevant but the old pod does not have any successful requests and the new pod does not have any errors kubectl logs itwiki-damaging-predictor-default-00028-deployment-55c7d67fn4z9  | grep "POST /v1/models/itwiki-damaging%3Apredict HTTP/1.1"
[10:22:54] <ozge_>	 kubectl logs itwiki-damaging-predictor-default-00028-deployment-55c7d67fn4z9  | grep TimeoutError
[10:23:19] <ozge_>	 this is interesting
[10:24:12] <ozge_>	 maybe we can try to scale down to 0 and back to 2 after lunch
[11:56:10] <elukey>	 back!
[11:56:35] <elukey>	 ozge_: this is a nice finding! What does it mean in this context old and new? Before/After a deployment or a scaleup?
[11:57:21] <ozge_>	 sure, there are two itwiki pods. one of them is created 23 days ago. The other one is created today
[11:57:50] <ozge_>	 it seems all the errors are from the one which is created 23 days ago
[11:58:49] <ozge_>	 would be great to confirm it and we can try to rescale
[11:59:32] <elukey>	 wow this is really weird
[12:00:22] <elukey>	 I see 3 itwiki pods now, one recently created
[12:00:29] <ozge_>	  kubectl logs itwiki-damaging-predictor-default-00028-deployment-55c7d67fn4z9  | grep "POST /v1/models/itwiki-damaging%3Apredict HTTP/1.1"
[12:00:33] <ozge_>	 this is the old one
[12:00:45] <ozge_>	 and it does not have any successful responses
[12:00:58] <claime>	 elukey: drive-by, you can isolate the pod for examination by removing labels, and let the replicaset put a new one back in
[12:01:16] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: outlink-topic-model: Introduce caching mechanism. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1176448 (https://phabricator.wikimedia.org/T356256)
[12:01:22] <claime>	 (it will not get any traffic then though)
[12:01:24] <elukey>	 ozge_: ack!
[12:02:15] <elukey>	 claime: thanks! I am currently trying to check istio logs to see if anything pops up, it seems a clear indication that the sidecar pod got borked somehow
[12:02:25] <elukey>	 I don't recall it happening in the past, it is not great
[12:03:55] <claime>	 elukey: yeah, that would seem like it at first glance, that's the request to the pod actually timing out, or a backend call?
[12:04:34] <elukey>	 in this case we don't use the mesh, but the istio "transparent" proxy (envoy based)
[12:04:46] <elukey>	 and it handles all traffic, egress and ingress
[12:04:54] <elukey>	 in this case it seems an issue contacting the mwapi
[12:05:50] <wikibugs>	 06Machine-Learning-Team, 07Epic, 13Patch-For-Review: Epic: Implement prototype inference service that uses Cassandra for request caching - https://phabricator.wikimedia.org/T356256#11067968 (10BWojtowicz-WMF) I've started working towards the [[ https://phabricator.wikimedia.org/T392833 | goal of making artic...
[12:06:16] <elukey>	 saved logs and pod deleted
[12:08:02] <elukey>	 the new pod looks healthy from the logs
[12:08:32] <ozge_>	 cool, I'll monitor for awhile
[12:09:24] <wikibugs>	 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade reability model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400352#11067977 (10BWojtowicz-WMF) @OKarakaya-WMF  I agree, let's put this in blocked until we update the catboost version in [[ https://gitlab....
[12:10:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: ...
[12:10:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[12:16:43] <ozge_>	 thank you! all active pods have some successful requests and no timeout errors. I'll keep monitoring for awhile. 
[12:28:13] <wikibugs>	 06Machine-Learning-Team, 07Essential-Work: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#11068041 (10BWojtowicz-WMF) @elukey I think the idea was that we'd provide a `Makefile` such that everybody could run just `make model-upload ...`,...
[13:01:14] <wikibugs>	 06Machine-Learning-Team, 07Essential-Work: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#11068159 (10elukey) @BWojtowicz-WMF you can definitely add another script under say /usr/local/bin/ via puppet, that is called create-model-upload-...
[13:03:02] <ozge_>	 @elukey is the side-car `istio-sidecar` used for the connections to mw-api as proxy? and if it fails we can expect to the get the errors today. I'm looking into the configuration here https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/ml-services/revscoring-editquality-damaging/values.yaml#48
[13:04:50] <ozge_>	 as it's a side-car, each pod has it's own side-car instance.
[13:12:06] <wikibugs>	 06Machine-Learning-Team, 07Essential-Work: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#11068219 (10BWojtowicz-WMF) @elukey I think it sounds like a good compromise. I'll go this way, thank you!
[13:36:50] <bartosz>	 I can’t make it to todays knowledge sharing due to physio appointment so I’m wishing you all an amazing long weekend! <3
[13:47:10] <wikibugs>	 06Machine-Learning-Team: Investigate revertrisk threshold generation for enwiki - https://phabricator.wikimedia.org/T400590#11068338 (10gkyziridis) == Update ==  I achieved to run the threshold analysis for `enwiki` successfully.  The obstacle I faced was that I could not run the query for a bigger than one mont...
[15:08:08] <wikibugs>	 06Machine-Learning-Team: Error in revscoring-editquality-damaging - itwiki-damaging-predictor-default - https://phabricator.wikimedia.org/T401109#11068784 (10OKarakaya-WMF) We found out that all errors are from one pod that was created 23 days ago. Also, it never returned a successful response. It's likely that...
[15:41:08] <wikibugs>	 06Machine-Learning-Team, 06Data-Persistence, 06Growth-Team, 10Improve-Tone-Suggested-Edit, 07OKR-Work: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11068906 (10Michael)
[16:09:46] <aiko>	 have a nice long weekend! o/
[16:33:30] <wikibugs>	 06Machine-Learning-Team: Error in revscoring-editquality-damaging - itwiki-damaging-predictor-default - https://phabricator.wikimedia.org/T401109#11069161 (10OKarakaya-WMF) 05Open→03Resolved
[16:34:48] <ozge_>	 hello I'm closing the revscoring issue with the following note: https://phabricator.wikimedia.org/T401109#11068784 @elukey please feel free to reopen if we need more investigation for the root cause. have a nice extended weekend
[16:35:51] <elukey>	 ozge_: +1 thanks! Have a good weekend too!
[16:36:32] <elukey>	 ml-team: please note that I'll be on holidays for the next couple of weeks, so if you need SRE help please ping Ben (bt*ullis on IRC) or any other SRE in #wikimedia-sre
[16:36:47] <elukey>	 I am not sure when Tobias will be back
[16:37:31] <ozge_>	 have a nice holiday @elukey. I'll be on holidays next week as well.