[00:09:31] FIRING: LiftWingServiceErrorRate: ... [00:09:31] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [04:09:31] FIRING: LiftWingServiceErrorRate: ... [04:09:31] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:09:31] FIRING: LiftWingServiceErrorRate: ... [08:09:31] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:35:18] 06Machine-Learning-Team: Error in revscoring-editquality-damaging - itwiki-damaging-predictor-default - https://phabricator.wikimedia.org/T401109#11067410 (10OKarakaya-WMF) looking into the logs we also get this error with other wikis. However, the alarm is always about itwiki ` https://logstash.wikimedia.org/... [08:36:20] 06Machine-Learning-Team: Numpy is not available in revertrisk_multilingual - https://phabricator.wikimedia.org/T401305#11067414 (10OKarakaya-WMF) 05Open→03Resolved a:03OKarakaya-WMF [08:38:41] ozge_: o/ [08:39:13] one tip - in logstash's top right corner there is a share functionality, in which you can create permalinks (short versions too) [08:39:36] most of the times the full link of logstash is not something that leads to a consistent results for others [08:50:36] the itwiki issue seems to be related to some traffic that causes timeouts with the mediawiki api [08:51:09] IIRC we set 5 seconds or similar, so it seems that somehow the current requests are heavy [08:53:09] RESOLVED: LiftWingServiceErrorRate: ... [08:53:09] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:53:36] from https://logstash.wikimedia.org/goto/6caf1b48928bcb3ff43b822a14c08034 it seems that all most of the traffic is generated by ORES Legacy [08:57:29] is there anybody looking into it? :) [09:07:55] the "0" response code returned by the istio gateway is to indicate when a client gives up before getting the response [09:24:42] Hey @elukey this is the error we discussed here https://phabricator.wikimedia.org/T401109 . Interesting that the alarm rises only for itwiki and enwiki stats look all good https://logstash.wikimedia.org/app/dashboards#/view/138271f0-40ce-11ed-bb3e-0bc9ce387d88?_g=h@1d21a65&_a=h@1c1da89 [09:27:06] ozge_: my impression is that we receive some calls from the ores service, and that traffic somehow causes the latency to go up, leading to timeouts etc.. [09:27:36] is there anything specific on those rev-ids that may explain this? It doesn't seem compute related (it happened in the past, the current fix is to add multi processing) [09:28:11] another view is https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=aWotKxQMz&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default-00028&from=now-12h&to=now&timezone=utc&var-response_code=$__all&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99 [09:28:23] we have both 200s and 0s (client giving up), but latency is horrible [09:28:43] the endpoint that we use is the mw api, and it is shared across all other pods [09:30:44] FIRING: LiftWingServiceErrorRate: ... [09:30:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:43:16] so these alerts are currently firing since "0" responses are registered, namely clients giving up [09:43:28] that may be noise, but in this case it may indicate a broader issue [09:52:34] ozge@deploy1003:~$ kubectl get pods | grep itwiki [09:52:34] itwiki-damaging-predictor-default-00028-deployment-55c7d6776t5s 3/3 Running 0 4m52s [09:52:34] itwiki-damaging-predictor-default-00028-deployment-55c7d67fn4z9 3/3 Running 0 23d [09:52:34] I see some pods gets restarted in itwiki [09:53:56] and I see many mwapi.errors.TimeoutError in itwiki. I'll try to check if we have legacy ores request only in itwiki [09:54:27] but the previously failing requests a quiet fast now (~0.5 seconds) if I try from local [10:06:36] I've checked logs of all the currently running pods. we have TimeoutError only in itwiki although the other pods also actively get requests. [10:07:37] @elukey do we have istio conf per wiki or is it more general? [10:07:58] happy to jump on a call to get some help [10:08:48] it is the same for all wikis, so I think there is something specific with the traffic that hits the pod [10:09:03] maybe a revision that takes a lot to be fetched or similar [10:09:16] I can check more after lunch [10:22:43] https://www.irccloud.com/pastebin/AaKwZJRI/ [10:22:54] cool thank you @elukey . I'm not sure if this is relevant but the old pod does not have any successful requests and the new pod does not have any errors kubectl logs itwiki-damaging-predictor-default-00028-deployment-55c7d67fn4z9 | grep "POST /v1/models/itwiki-damaging%3Apredict HTTP/1.1" [10:22:54] kubectl logs itwiki-damaging-predictor-default-00028-deployment-55c7d67fn4z9 | grep TimeoutError [10:23:19] this is interesting [10:24:12] maybe we can try to scale down to 0 and back to 2 after lunch [11:56:10] back! [11:56:35] ozge_: this is a nice finding! What does it mean in this context old and new? Before/After a deployment or a scaleup? [11:57:21] sure, there are two itwiki pods. one of them is created 23 days ago. The other one is created today [11:57:50] it seems all the errors are from the one which is created 23 days ago [11:58:49] would be great to confirm it and we can try to rescale [11:59:32] wow this is really weird [12:00:22] I see 3 itwiki pods now, one recently created [12:00:29] kubectl logs itwiki-damaging-predictor-default-00028-deployment-55c7d67fn4z9 | grep "POST /v1/models/itwiki-damaging%3Apredict HTTP/1.1" [12:00:33] this is the old one [12:00:45] and it does not have any successful responses [12:00:58] elukey: drive-by, you can isolate the pod for examination by removing labels, and let the replicaset put a new one back in [12:01:16] (03PS1) 10Bartosz Wójtowicz: outlink-topic-model: Introduce caching mechanism. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1176448 (https://phabricator.wikimedia.org/T356256) [12:01:22] (it will not get any traffic then though) [12:01:24] ozge_: ack! [12:02:15] claime: thanks! I am currently trying to check istio logs to see if anything pops up, it seems a clear indication that the sidecar pod got borked somehow [12:02:25] I don't recall it happening in the past, it is not great [12:03:55] elukey: yeah, that would seem like it at first glance, that's the request to the pod actually timing out, or a backend call? [12:04:34] in this case we don't use the mesh, but the istio "transparent" proxy (envoy based) [12:04:46] and it handles all traffic, egress and ingress [12:04:54] in this case it seems an issue contacting the mwapi [12:05:50] 06Machine-Learning-Team, 07Epic, 13Patch-For-Review: Epic: Implement prototype inference service that uses Cassandra for request caching - https://phabricator.wikimedia.org/T356256#11067968 (10BWojtowicz-WMF) I've started working towards the [[ https://phabricator.wikimedia.org/T392833 | goal of making artic... [12:06:16] saved logs and pod deleted [12:08:02] the new pod looks healthy from the logs [12:08:32] cool, I'll monitor for awhile [12:09:24] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade reability model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400352#11067977 (10BWojtowicz-WMF) @OKarakaya-WMF I agree, let's put this in blocked until we update the catboost version in [[ https://gitlab.... [12:10:44] RESOLVED: LiftWingServiceErrorRate: ... [12:10:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [12:16:43] thank you! all active pods have some successful requests and no timeout errors. I'll keep monitoring for awhile. [12:28:13] 06Machine-Learning-Team, 07Essential-Work: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#11068041 (10BWojtowicz-WMF) @elukey I think the idea was that we'd provide a `Makefile` such that everybody could run just `make model-upload ...`,... [13:01:14] 06Machine-Learning-Team, 07Essential-Work: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#11068159 (10elukey) @BWojtowicz-WMF you can definitely add another script under say /usr/local/bin/ via puppet, that is called create-model-upload-... [13:03:02] @elukey is the side-car `istio-sidecar` used for the connections to mw-api as proxy? and if it fails we can expect to the get the errors today. I'm looking into the configuration here https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/ml-services/revscoring-editquality-damaging/values.yaml#48 [13:04:50] as it's a side-car, each pod has it's own side-car instance. [13:12:06] 06Machine-Learning-Team, 07Essential-Work: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#11068219 (10BWojtowicz-WMF) @elukey I think it sounds like a good compromise. I'll go this way, thank you! [13:36:50] I can’t make it to todays knowledge sharing due to physio appointment so I’m wishing you all an amazing long weekend! <3 [13:47:10] 06Machine-Learning-Team: Investigate revertrisk threshold generation for enwiki - https://phabricator.wikimedia.org/T400590#11068338 (10gkyziridis) == Update == I achieved to run the threshold analysis for `enwiki` successfully. The obstacle I faced was that I could not run the query for a bigger than one mont... [15:08:08] 06Machine-Learning-Team: Error in revscoring-editquality-damaging - itwiki-damaging-predictor-default - https://phabricator.wikimedia.org/T401109#11068784 (10OKarakaya-WMF) We found out that all errors are from one pod that was created 23 days ago. Also, it never returned a successful response. It's likely that... [15:41:08] 06Machine-Learning-Team, 06Data-Persistence, 06Growth-Team, 10Improve-Tone-Suggested-Edit, 07OKR-Work: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11068906 (10Michael) [16:09:46] have a nice long weekend! o/ [16:33:30] 06Machine-Learning-Team: Error in revscoring-editquality-damaging - itwiki-damaging-predictor-default - https://phabricator.wikimedia.org/T401109#11069161 (10OKarakaya-WMF) 05Open→03Resolved [16:34:48] hello I'm closing the revscoring issue with the following note: https://phabricator.wikimedia.org/T401109#11068784 @elukey please feel free to reopen if we need more investigation for the root cause. have a nice extended weekend [16:35:51] ozge_: +1 thanks! Have a good weekend too! [16:36:32] ml-team: please note that I'll be on holidays for the next couple of weeks, so if you need SRE help please ping Ben (bt*ullis on IRC) or any other SRE in #wikimedia-sre [16:36:47] I am not sure when Tobias will be back [16:37:31] have a nice holiday @elukey. I'll be on holidays next week as well.