[06:28:01] good morning :) [07:48:08] good morning [08:19:30] I will deploy the changes from bullseye to bookworm for revertrisk-language-agnostic -> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1172622 [08:23:28] ack [08:33:16] (03PS2) 10Gkyziridis: langid-model: Update base image from bullseye to the latest bookworm image. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1172597 (https://phabricator.wikimedia.org/T400347) [08:34:38] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade langid model server from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400347#11045994 (10gkyziridis) == Test it on ml-testing == {P80277} [08:41:19] (03CR) 10Gkyziridis: "I fixed the issue and I built it on ml-testing: https://phabricator.wikimedia.org/T400347#11045994" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1172597 (https://phabricator.wikimedia.org/T400347) (owner: 10Gkyziridis) [08:41:41] Folks if you have time please review this one for langid -> https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1172597 [08:42:18] (03CR) 10Gkyziridis: [C:03+2] ores-legacy-model: Update base image from bullseye to the latest bookworm image. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1172612 (https://phabricator.wikimedia.org/T400348) (owner: 10Gkyziridis) [08:42:53] (03Merged) 10jenkins-bot: ores-legacy-model: Update base image from bullseye to the latest bookworm image. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1172612 (https://phabricator.wikimedia.org/T400348) (owner: 10Gkyziridis) [08:48:07] morning! [08:52:31] good morning Aiko [08:55:09] (03CR) 10Gkyziridis: [C:03+2] articletopic-outlink-model: Update base image from bullseye to the latest bookworm image. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1172620 (https://phabricator.wikimedia.org/T400349) (owner: 10Gkyziridis) [08:56:36] (03CR) 10Kevin Bazira: [C:03+1] "Thanks for the fix." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1172597 (https://phabricator.wikimedia.org/T400347) (owner: 10Gkyziridis) [08:57:41] (03CR) 10Ozge: [C:03+1] langid-model: Update base image from bullseye to the latest bookworm image. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1172597 (https://phabricator.wikimedia.org/T400347) (owner: 10Gkyziridis) [08:59:05] (03Merged) 10jenkins-bot: articletopic-outlink-model: Update base image from bullseye to the latest bookworm image. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1172620 (https://phabricator.wikimedia.org/T400349) (owner: 10Gkyziridis) [09:12:37] (03CR) 10Gkyziridis: [C:03+2] langid-model: Update base image from bullseye to the latest bookworm image. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1172597 (https://phabricator.wikimedia.org/T400347) (owner: 10Gkyziridis) [09:13:13] Thnx for the reviews folks! I am opening a sequence of patches for deployments [09:13:32] ack [09:13:46] (03Merged) 10jenkins-bot: langid-model: Update base image from bullseye to the latest bookworm image. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1172597 (https://phabricator.wikimedia.org/T400347) (owner: 10Gkyziridis) [09:25:08] Bookworm deployment patches for liftwing models: [09:25:08] 1. Ores-legacy -> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1174404 [09:25:08] 2. LangId -> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1174412 [09:25:08] 3. Articletopic-outlink -> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1174410 [09:25:33] revertrisk-language-agnostic is already deployed and synced on staging [09:28:34] lgtm [09:32:58] kevinbazira: Thnx for the fast reviews 🙏 [09:33:12] np! [09:46:15] Deployments done! [09:47:07] 06Machine-Learning-Team, 07Essential-Work: Upgrade articletopic-outlink model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400349#11046126 (10gkyziridis) a:03gkyziridis [09:48:27] 06Machine-Learning-Team, 07Essential-Work: Upgrade remaining model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400144#11046139 (10gkyziridis) [09:49:30] 06Machine-Learning-Team, 07Essential-Work: Upgrade remaining model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400144#11046140 (10gkyziridis) [09:49:50] that's amazing, thank you George <3 [09:50:07] bartosz: I updated the main ticket: https://phabricator.wikimedia.org/T400144 [09:50:16] There are 3 more models need to be updated [09:50:39] sweet, I'll be picking those up! [09:51:22] No no focus on your MLops stuff this week, I can continue work on them these days [09:52:11] wouldn't those fall under the MLOps week tho? [09:53:29] yeah indeed, but the main focus is on the alerts and the bugs/incidents. So, if you are focused on them then I can help you taking something from your plate. Otherwise feel free to pick them up [09:55:02] I think I'll have the capacity to do the remaining 3 ones this week, I'd rather you have time to focus on your other tasks :D [09:55:54] alright perfect. I will assign them to you then, but feel free to reach out if you do not have time. [09:56:18] perfect, thank you! [09:56:28] 06Machine-Learning-Team, 07Essential-Work: Upgrade revscoring model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400350#11046166 (10gkyziridis) a:03Batorsz [09:57:32] 06Machine-Learning-Team, 07Essential-Work: Upgrade revscoring model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400350#11046185 (10gkyziridis) a:05Batorsz→03BWojtowicz-WMF [09:57:51] 06Machine-Learning-Team, 07Essential-Work: Upgrade article-descriptions model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400351#11046193 (10gkyziridis) a:03BWojtowicz-WMF [09:58:08] 06Machine-Learning-Team, 07Essential-Work: Upgrade reability model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400352#11046194 (10gkyziridis) a:03BWojtowicz-WMF [10:11:35] o/ elukey: I’m investigating a bug report that edit-check is sometimes returning blank responses with 200 status code under increased load. I did some manual load tests on staging and could find a potential issue where we receive successful response with `'read tcp 127.0.0.1:59484->127.0.0.1:8080: read: connection reset by peer\n’` in the response body [10:11:58] Do you know where we might investigate what might be causing the server to drop those connections? In my experience it only happened when there are multiple replicas available so potentially it might be on proxy/load balancer layer of istio? [10:12:15] There’s more on this issue in this ticket: https://phabricator.wikimedia.org/T400606 [10:47:12] bartosz: o/ [10:47:15] interesting [10:47:51] one clarification - from the slack thread it seems that the response returned is a blank page, while you mentioned the read tcp etc.. connection reset [10:48:51] because there are multiple things that could be at fault, but it is very weird that a HTTP 200 is returned [10:49:06] the picture is the following (for a simple edit check requesT) [10:49:47] Yes yes, I'm not really sure if it's connected, but it's the only thing I found so far. I'm thinking that the response body is possibly parsed differently via browsers and not showing the read tcp error.. [10:49:55] But it might also be unrelated [10:50:40] client -> Api gateway (envoy) -> Load Balancer + Istio Gateway pods (on k8s) -> edit check svc on k8s -> istio sidecar (envoy) -> Knative queue proxy sidecar -> kserve python [10:51:17] I am not 100% sure about the knative queue proxy though, I'd need to recheck, but it is a sidecar that it is meant to provide metrics and buffer requests in some cases, sitting in front of kserve [10:53:10] the connection reset may be related to too much traffic hitting single pods, and reaching their capacity, but a connection reset is definitely going to lead to a 503 [10:53:25] not sure where you found that log but if I had to guess it seems a istio/envoy sidecar log [10:53:32] since port 8080 should be kserve's [10:54:29] what I'd do is something like this: create an ad-hoc simple Python script that hits the staging endpoint reporting if the body is empty and the response code is 200 [10:54:36] so we see if it is reproducible [10:54:58] if not, we can try to target the codfw load balance (so prod), that IIRC it is not really used that much [10:55:08] and again, see if we can repro [10:56:33] what I've always seen so far is 50x errors reported in various layers when a connection breaks etc.., never a HTTP 200 that is not carrying any body [10:58:46] Thanks for the explanation of the path the request goes through, that's really useful! [10:59:17] I have a script ready that I’ve used to target staging and I can reproduce receiving the 200 responses with the "read tcp..." error string in the response body, I can’t reproduce receiving empty responses in the body tho [10:59:36] I can also see what happens if I target prod instead of staging, will run such test and report back [11:00:37] bartosz: wait a sec, you can repro having a 200 with "connection reset" ? [11:01:08] Yes [11:01:37] ok this is bad, it shouldn't really happen [11:01:53] sorry I didn't get that you were able to repro, let's start with this use case then [11:02:34] Ahh no wait those are actually 502s [11:02:42] So they are correct [11:02:51] perfect that makes sense, I was really worried for a moment :D [11:03:57] the `success: True` part of the response got me confused [11:04:20] Alright, I'll run the script against prod to see if I can repro the empty response body there [11:05:33] the only easy thing to check that I can think of is if, for some reason, the edit check's code in the kserve container returns under certain corner cases an empty response [11:05:39] If I won't be able to reproduce, I think we'll have to wait till it happens again and ideally we'll have some timestamps to investigate [11:06:05] I've already checked and we should never return anything other than dict with predictions key [11:07:45] Running the test agains prod now [11:11:12] I asked some extra details to David in the slack thread, because I am not 100% how the empty body was checked [11:12:07] it seems that they were using a chrome debug testing a specific wikipage, I want to know how they were calling liftwing and what endpoint they hit [11:12:27] because there may be other layers at fault, possibly not liftwing-related [11:14:26] sweet, thank you! the load test on prod just finished and I can see again some 502 connection resets but no empty responses [11:14:34] I'll rerun in some time to verify [11:15:13] the 502 are expected since the load test is probably reaching the max capacity [11:15:55] anyway, making sure that we cannot easily repro in prod now is already a good result, because it tells us that the bug is probably something related to a weird corner case [11:16:31] at the time of the incident on the 25th IIUC there was a replica missing, so I'd have expected 50x responses as well [11:18:26] Do you mean that there were no replicas available or a new replica couldn't scale for some time? [11:23:41] I think the latter [11:24:02] it seems consistent with what you described, but I haven't checked [11:25:21] I'll write a small update under the ticket and let's wait for more details from David [11:25:35] thanks a lot for digging into it! <3 [11:27:40] anytime :) [11:29:21] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Investigate `edit-check` returning empty responses - https://phabricator.wikimedia.org/T400606#11046476 (10BWojtowicz-WMF) Small update: The connection reset errors shown above are "successful" requests, but they are not returning 200 status co... [14:13:23] 06Machine-Learning-Team, 10Editing-team (Tracking): Build model training pipeline for tone check using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#11046991 (10gkyziridis) > #### Idea 2 > We could create another `release` within the [[https://github.com/wikimedia/operations-deployment-ch... [16:58:20] 10Lift-Wing, 06Machine-Learning-Team: Request to host kid-friendly-classifier on Lift Wing - https://phabricator.wikimedia.org/T399872#11047635 (10derenrich) Is there something more you need from me here? What is the typical turn-around time for requests like this? [22:02:38] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 10SRE-SLO, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11048515 (10DLynch) @elukey The 24th was when the train reached most wikis containing the change that turned on running...