[07:12:14] bartosz: o/ [07:13:23] elukey: o/ [07:14:25] After reviewing David's comments on slack and the traffic to edit-check I realized that they are contacting the model via the API gateway (so api.wikimedia.org) rather than hitting directly our internal endpoints (that makes sense, since it is Visual Editor calling the model). [07:14:34] during your tests, what endpoint did you hit? [07:15:57] ahh it makes sense, I was hitting directly `https://inference.svc.eqiad.wmnet:30443/v1/models/edit-check:predict` from our VPC [07:16:49] so hitting via api gateway would be `https://api.wikimedia.org/service/lw/inference/v1/models/edit-check:predict`, right? [07:17:12] exactly yes [07:17:49] while reviewing https://logstash.wikimedia.org/goto/2e1b3308bedd80133bb1aca005395aa4 I also noticed that we have a mixture of HTTP/2 and HTTP 1.1, that is new to me [07:18:11] good morning [07:20:11] mmm wait I don't see it here https://api.wikimedia.org/wiki/Lift_Wing_API [07:21:51] but it is configured in deployment-charts [07:22:02] hmm possibly the docs are outdated? I can hit the api gateway with e.g. `curl -i https://api.wikimedia.org/service/lw/inference/v1/models/edit-check:predict -X POST -d '{"instances": [{"lang": "en", "check_type": "tone", "original_text": "original text example original", "modified_text": "modified text example with hype", "page_title": "test title"}]}'` [07:25:05] hmm do we have an idea if other services also receive mix of http/1.1 and http/2 requests? [07:25:40] not sure, it may be due to the api-gateway - if the client starts with HTTP/2, it may contact us via HTTP/2 [07:26:00] so far I haven't found any trace of HTTP-related issues, but worth to keep in mind [07:26:08] the logs for the api-gateway are here https://logstash.wikimedia.org/goto/0e20e8e2d8602237656442099beeed9c [07:26:13] but not that telling [07:26:21] let's see if we can repro hitting the api-gateway [07:26:44] started hitting it [07:33:07] bartosz: for future tests - when you hit these endpoints try to set a unique User Agent, so we'll better find you in the logs [07:34:13] (running a quick errand, back in ~1h) [07:34:31] hmm I received a lot of 429 with too many requests error [07:34:36] will scale it down a little [07:34:48] let's chat once you're back, will try setting unique user agent [07:41:11] (03PS1) 10Bartosz Wójtowicz: article-descriptions: Update base image to latest bookworm image. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1174611 (https://phabricator.wikimedia.org/T400351) [07:54:40] (03PS1) 10Bartosz Wójtowicz: readability: Update base image to latest bookworm. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1174648 (https://phabricator.wikimedia.org/T400352) [08:05:59] morning :) [08:32:32] elukey/ I managed to squeeze in 1000 requests with 10rps before I got rate-limited again, but there were 0 empty responses so far [08:34:02] elukey, bartosz: we haven't added the editcheck to the docs because it's currently intended for only use in tone check project rather than fully public [08:39:26] ack! [08:40:47] aiko: I see, it makes total sense, thank you! [08:46:46] my guess is that the 200-empty-body problem was probably related to the weird state in which the replicas were in [08:47:16] IIUC there wasn't enough capacity for a new pod? Or was it something different? [08:51:07] I think you're right about that there wasn't enough capacity to scale up immediately, but the replicas were unchedulable for ~3mins, which feels like a totally normal thing to me [08:52:45] I re-ran the load test from 2 different stat-boxes and all the responses came back looking good [09:06:35] This may be far fetched idea, but maybe during the initialization of the pod after scale-up, some traffic reached the container before the kserve endpoint was ready to serve predictions, which resulted in pinging the endpoint and returning just an empty 200? [09:07:27] I'm not sure how (or if) the readiness probes are configured there [09:07:42] 06Machine-Learning-Team, 06Research, 05Goal: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11049168 (10Miriam) @SSalgaonkar-WMF hi! Is this task in progress... [09:11:21] Alright, I see we're setting the `kserve.Model.ready` flag in our edit-check inference code once we load up the model, which translates later to readiness probe, so this should be handled correctly [09:20:48] bartosz: yeah I think that what you described may be what happened - the http 200 may have come from the istio sidecar, maybe it was up but not fully configured [09:20:52] Readiness: http-get http://:15020/app-health/queue-proxy/readyz delay=0s timeout=1s period=10s #success=1 #failure=3 [09:20:52] Readiness: http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30 [09:20:52] ISTIO_KUBE_APP_PROBERS: {"/app-health/queue-proxy/readyz":{"httpGet":{"path":"/","port":8012,"scheme":"HTTP","httpHeaders":[{"name":"K-Network-Probe","value":"queue"}]},"timeoutSeconds":1}} [09:22:02] the queue-proxy container may have played a role as well [09:22:41] anyway, it feels like a bug in a weird/unknown situation that cannot be reproducible in "normal" scenarios, that is reassuring :) [09:28:41] I agree that it seems like a weird edge-case, however David said that it happened multiple times for him so it can’t be that uncommon. I’ll play around a little more with running requests during scale-ups to try reproducing it and will post some updates under the ticket afterwards [09:29:13] maybe we could explore setting some readiness delay so that the load balancer/istio waits a little before routing traffic to the new replicas [09:29:40] thank you for the pair debugging Luca, it’s very fun! :D [09:29:48] bartosz: My understanding is that David was able to repro only on the day of the replica issue, not in other times [09:30:34] I think he said that he's definitely seen it since 25th as well [09:32:44] What’s not convincing me about the replica hypothesis is that it was a window of ~3mins and it only happened a few times the past week where the new replicas were unschedulable for this time [09:33:30] But we did scale-up way more where the new replicas were immediately available so it could be the routing problem on the init [09:41:50] * bartosz afk lunch time [10:21:14] 06Machine-Learning-Team, 06Research, 05Goal: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11049311 (10SSalgaonkar-WMF) @Miriam hi!! yes this is! I don't th... [10:52:11] 06Machine-Learning-Team, 06Research, 05Goal: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11049398 (10Miriam) Sure! I might remove the #Research tag then i... [10:52:28] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11049399 (10Miriam) [10:53:32] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11049412 (10SSalgaonkar-WMF) @Miriam yes totally, sorry for not doing that soon... [11:48:25] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11049676 (10OKarakaya-WMF) added remaining steps. confirming that the pipeline works end to end 🎉 https://gitlab.wikimedia.org/repos/data-enginee... [12:28:05] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Investigate `edit-check` returning empty responses - https://phabricator.wikimedia.org/T400606#11049896 (10BWojtowicz-WMF) We realized that the original issue happened by querying model via API Gateway at `https://api.wikimedia.org/service/lw/i... [12:33:04] Hello team, I'm looking for brave reviewers for those 2 small patches updating images from bullseye to bookworm for readability and article-description models 😊 https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1174648 https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1174611 [12:56:18] 👀 [12:57:23] (03CR) 10Ozge: [C:03+1] article-descriptions: Update base image to latest bookworm image. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1174611 (https://phabricator.wikimedia.org/T400351) (owner: 10Bartosz Wójtowicz) [12:57:31] (03CR) 10Ozge: [C:03+1] readability: Update base image to latest bookworm. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1174648 (https://phabricator.wikimedia.org/T400352) (owner: 10Bartosz Wójtowicz) [13:06:35] thank you Ozge! [13:06:39] (03CR) 10Bartosz Wójtowicz: [C:03+2] article-descriptions: Update base image to latest bookworm image. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1174611 (https://phabricator.wikimedia.org/T400351) (owner: 10Bartosz Wójtowicz) [13:07:46] (03Merged) 10jenkins-bot: article-descriptions: Update base image to latest bookworm image. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1174611 (https://phabricator.wikimedia.org/T400351) (owner: 10Bartosz Wójtowicz) [13:11:34] 🙌 [13:15:52] whoops, post-merge image build failed for article-descriptions due to some missing dependencies, will investigate and open a follow-up [13:16:36] would be probably useful to alerts for this :D [13:29:57] 10Lift-Wing, 06Machine-Learning-Team: Request to host kid-friendly-classifier on Lift Wing - https://phabricator.wikimedia.org/T399872#11050157 (10SSalgaonkar-WMF) Hey @derenrich, sorry for the delayed follow-up. This is on me personally; I'm balancing a few time-sensitive items right now and a bit behind on t... [13:32:08] 10Lift-Wing, 06Machine-Learning-Team: Request to host kid-friendly-classifier on Lift Wing - https://phabricator.wikimedia.org/T399872#11050163 (10SSalgaonkar-WMF) This is kind of a unique case for us, since most of our hosting requests come from the Research team, so I'm sorry that we don't have a process bui... [13:38:38] Hey folks, is it ok if I clean a little bit the space from images in ml-testing machine? [13:39:05] I see some images/containers built 3 months ago. [13:44:21] (03PS1) 10Bartosz Wójtowicz: article-descriptions: Add missing build dependencies for sentencepiece. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1174730 (https://phabricator.wikimedia.org/T1174611) [13:44:45] georgekyz: please feel free to remove the unused images ml-testing [13:46:13] looking for a small review adding missing build dependencies in article-description models https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1174730. Seems that some dependencies like cmake are not installed by default in bookworm [13:47:17] 06Machine-Learning-Team, 06Data-Platform-SRE: Create an analytics service user for the ML team - https://phabricator.wikimedia.org/T400902#11050219 (10kevinbazira) [13:49:52] ^-- DPE is going to help us create an analytics service user (i.e. `analytics-ml`) that will be used by our Airflow jobs [13:56:01] 06Machine-Learning-Team, 06Data-Platform-SRE: Create an analytics service user for the ML team - https://phabricator.wikimedia.org/T400902#11050265 (10kevinbazira) [14:04:25] (03CR) 10AikoChou: [C:03+1] article-descriptions: Add missing build dependencies for sentencepiece. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1174730 (https://phabricator.wikimedia.org/T1174611) (owner: 10Bartosz Wójtowicz) [14:05:24] (03CR) 10Bartosz Wójtowicz: [C:03+2] article-descriptions: Add missing build dependencies for sentencepiece. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1174730 (https://phabricator.wikimedia.org/T1174611) (owner: 10Bartosz Wójtowicz) [14:05:55] (03Merged) 10jenkins-bot: article-descriptions: Add missing build dependencies for sentencepiece. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1174730 (https://phabricator.wikimedia.org/T1174611) (owner: 10Bartosz Wójtowicz) [15:20:08] (03CR) 10Bartosz Wójtowicz: [C:04-1] "There's one blocking issue to progress with this patch: It depends on the readbility-experiments repository (https://gitlab.wikimedia.org/" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1174648 (https://phabricator.wikimedia.org/T400352) (owner: 10Bartosz Wójtowicz) [15:30:07] 10Lift-Wing, 06Machine-Learning-Team: Request to host kid-friendly-classifier on Lift Wing - https://phabricator.wikimedia.org/T399872#11050676 (10derenrich) ok no worries and no rush. this isn't blocking but I was hoping to have a sense of this direction would work (and how the process works). it sounds like... [20:09:18] (03CR) 10CI reject: [V:04-1] build: Updating eslint-config-wikimedia to 0.31.0 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1174809 (owner: 10Libraryupgrader) [21:41:52] (03PS1) 10Umherirrender: tests: Provide rc_source on insert to recentchanges [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1174828 [21:53:23] (03PS2) 10Umherirrender: tests: Provide rc_source on insert to recentchanges [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1174828 [22:08:04] (03CR) 10Umherirrender: "recheck" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1174809 (owner: 10Libraryupgrader) [22:25:32] (03CR) 10Zabe: [C:03+2] "Thanks!" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1174828 (owner: 10Umherirrender) [22:42:49] (03Merged) 10jenkins-bot: tests: Provide rc_source on insert to recentchanges [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1174828 (owner: 10Umherirrender)