[06:26:08] morning folks [06:40:22] Morning! [06:44:13] 10Machine-Learning-Team, 10observability: Some Istio recording rules may be missing data - https://phabricator.wikimedia.org/T349072 (10elukey) I had a chat with Filippo over IRC, and the suspicion is that the long poll time (~30s) of the Istio recording rules has a toll on Thanos, ending up in slowness and ho... [06:47:17] isaranto: I added you, Kevin and Aiko to https://phabricator.wikimedia.org/project/members/1586/, you should now be able to see the procurement tasks in the board [06:47:25] lemme know if it works when you have a moment [06:54:31] thanks! it works! [06:58:31] super [07:21:47] 10Machine-Learning-Team, 10observability: Some Istio recording rules may be missing data - https://phabricator.wikimedia.org/T349072 (10fgiunchedi) Thank you for the detailed investigation @elukey ! I agree on both counts, definitely reducing the number of labels at ingestion time will help things, and so will... [07:33:35] 10Machine-Learning-Team, 10observability: Some Istio recording rules may be missing data - https://phabricator.wikimedia.org/T349072 (10elukey) @fgiunchedi yes definitely, I have some doubts too, I proposed the reduced `le` since this is currently our SLI/SLO defs: ` "Lift Wing Revert Risk MultiLingual se... [07:44:14] 10Machine-Learning-Team: Apply multi-processing to preprocess() in isvcs that suffer from high latency - https://phabricator.wikimedia.org/T349274 (10elukey) [07:44:30] as promised I created --^, hope it is not missing anything [08:14:16] 10Machine-Learning-Team, 10observability: Some Istio recording rules may be missing data - https://phabricator.wikimedia.org/T349072 (10fgiunchedi) Thank you for the context, which makes sense to me. If we indeed end up reducing the `le` labels and the recording rule used only for SLO/SLI we should consider al... [08:18:39] 10Machine-Learning-Team: How impactful would pre-save automoderation be on edit save times? - https://phabricator.wikimedia.org/T299436 (10Samwalton9-WMF) 05Open→03Resolved I think @calbon has looked into this far enough that we understand this is a viable approach, but we've decided to go the route of rever... [08:48:22] 10Machine-Learning-Team: Configure envoy settings to enable rec-api-ng container to access endpoints external to k8s/LiftWing - https://phabricator.wikimedia.org/T348607 (10kevinbazira) We configured envoy settings and the rec-api-ng container hosted on LiftWing is able to access endpoints external to k8s/LiftWi... [08:48:35] (03PS1) 10Ilias Sarantopoulos: llm: fix No module named 'pkg_resources' - add setuptools [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967141 (https://phabricator.wikimedia.org/T349163) [08:52:31] The multiprocessing ticket description is great! [08:53:05] I added a fix/patch for llm image similar to the one we encountered with readability/revertrisk after kserve upgrade [08:54:04] (03PS2) 10Ilias Sarantopoulos: llm: fix No module named 'pkg_resources' - add setuptools [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967141 (https://phabricator.wikimedia.org/T349163) [09:09:15] 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10kevinbazira) We resolved the envoy proxy issues as shown in T348607#9264192. However, the rec-api instance hosted on LiftWing is still not performing as expected. The container logs... [09:28:33] (03CR) 10Kevin Bazira: [C: 03+1] llm: fix No module named 'pkg_resources' - add setuptools [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967141 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos) [09:55:50] (03CR) 10Ilias Sarantopoulos: [C: 03+2] llm: fix No module named 'pkg_resources' - add setuptools [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967141 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos) [09:56:35] (03Merged) 10jenkins-bot: llm: fix No module named 'pkg_resources' - add setuptools [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967141 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos) [10:25:21] * isaranto afk lunch [10:49:39] * elukey lunch! [11:52:20] kevinbazira: I agree to test more resources on staging, but I'm thinking that maybe we should explore cutting down the search space which will limit the amount of resources required [12:26:10] (03PS1) 10Kevin Bazira: Reduce the number candidates processed by the rec-api-ng [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) [12:33:55] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [12:34:04] sure isaranto, I am not sure how reducing the search space affects the results returned by the api but I have pushed a patch that reduces the maximum number of candidates to 250 as per your advice. I've tagged you as a reviewer, please review whenever you get a minute. [12:37:05] kevinbazira: feel free to discuss here before implementing stuff. We're not in a hurry :) We just want to find the best solution in a reasonable amount of time. [12:38:15] regarding the suggestion, my understanding is that for each candidate we perform a task so limiting the number of candidates would increase performance, so it is the equivalent of increasing cpus [12:39:13] then again, I'm not that familiar with the codebase so I might be wrong. in that case discussing here helps [12:40:42] yes, isaranto I understand the part where reducing candidates processed enables the rec-api to use less compute resources. the part I don't understand is whether the quality of the rec-api recommendations will be impacted positively or negatively. [12:40:56] (03CR) 10Kevin Bazira: Reduce the number candidates processed by the rec-api-ng (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [12:44:40] ok, I understand. Ofc quality would be reduced (although we don't have evidence that this happens). But even if it does it is a compromise we are making in order to be able to deploy a scalable service. Research has a say in this, so since the initial suggestion came from Isaac I trust that we can go towards this direction [12:45:15] we can ask if there is any way to evaluate once we expose the service [12:49:32] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, and 2 others: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10ItamarWMDE) [12:49:43] (03CR) 10Ilias Sarantopoulos: [C: 03+1] Reduce the number candidates processed by the rec-api-ng (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [12:50:22] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Update API calls from ORES to Lift Wing - https://phabricator.wikimedia.org/T343731 (10ItamarWMDE) 05Open→03Resolved Verified and deployed, thank you for all the hard work! [12:54:02] 🎉 --^ [13:04:51] (03PS2) 10Kevin Bazira: Reduce the number candidates processed by the rec-api-ng [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) [13:06:56] (03CR) 10Kevin Bazira: Reduce the number candidates processed by the rec-api-ng (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [13:08:16] sure isaranto, I have reduced the candidates processed from 500 to 250 and used an OS environmental variable to ease experimenting as per your advice. [13:13:51] (03CR) 10Ilias Sarantopoulos: [C: 03+1] Reduce the number candidates processed by the rec-api-ng (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [13:16:47] Morning alll [13:19:12] (03PS3) 10Kevin Bazira: Reduce the number candidates processed by the rec-api-ng [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) [13:21:47] (03CR) 10Kevin Bazira: Reduce the number candidates processed by the rec-api-ng (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [13:22:05] hi Chris o/ [13:27:54] Hey ! [13:28:45] (03CR) 10Ilias Sarantopoulos: "Nice!" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [13:30:37] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review, Ilias. Going to merge now then work on the deployment config to test this on staging." [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [13:31:10] (03Merged) 10jenkins-bot: Reduce the number candidates processed by the rec-api-ng [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [13:31:51] ouf, finally made my lua script to start working! 😓 [13:38:37] good job! :) [13:39:48] so I think that the weird issues that we see in CI are related to the fact that we pip install via blubber some deps, and use tox deps as well [13:39:56] the result is a weird mixed setting [13:40:46] (like the virtualenv issue) [13:40:55] I am experimenting with having deps only managed by tox [13:44:05] (so we don't install anything requirements*.txt related via blubber) [13:49:04] yeah it works :) [13:58:05] (03PS6) 10Elukey: Add precommit support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 [13:58:07] (03PS6) 10Elukey: Fix pre-commit errors and bump version [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966543 [13:58:49] getting a -1 in 3..2..1.. [14:00:28] oh.. [14:00:44] (03CR) 10CI reject: [V: 04-1] Add precommit support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 (owner: 10Elukey) [14:00:54] I like this countdown approach [14:01:04] adds suspense [14:01:15] I can help a bit tomorrow on this front so that we can figure out what is wrong and fix it in the other images as well [14:01:52] uff [14:02:13] (03PS1) 10Ilias Sarantopoulos: llm: load test langid model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967220 (https://phabricator.wikimedia.org/T340507) [14:02:24] ah I got a -1 only in one change [14:02:42] isaranto: the problem is that tox-managed deps collide with the ones installed via debian-pip [14:03:02] I found a solution, namely in "Test" allow only tox to pip install [14:03:04] and it works nicely [14:03:45] ahhh right, I get a -1 because, of course, ruff wants to change the world [14:03:52] and I applied the changes to another code review [14:04:28] yeah makes sense. I dont know how we missed that. thanks for tackling this! [14:05:27] isaranto: it took a while sigh, but now it kinda makes sense, even if I hoped for a better integration.. we are talking about bullseye versions vs newer ones, maybe it makes sense [14:05:31] anywayyy [14:05:59] I'd be inclined, if you all agree, to bypass the -1 for the first change, and count on the +2 that we get later on when all the code is fixed [14:06:17] {popcorn emoji} [14:06:35] (03CR) 10Elukey: "Ruff wants to change the world, and I have committed all the code changes in the next code review. I'd be inclined to skip CI for this one" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 (owner: 10Elukey) [14:07:20] I can make the effort to make it pass in both, maybe, but it would be very painful [14:07:38] or maybe I can just flip ruff to avoid fixing [14:07:43] mmmm [14:07:49] maybe it works [14:10:32] (03CR) 10Ilias Sarantopoulos: [C: 03+1] Add precommit support (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 (owner: 10Elukey) [14:11:23] I'm in we can skip it [14:11:39] super thanks <# [14:11:42] <3 [14:12:01] I'll create a new patch after we merge this to upgrade pre-commit version (ruff, black etc). since we introduce this now to this repo [14:12:24] most of this tools are really stable with slow changes, perhaps only ruff is moving faster [14:13:19] chrisalbon: the ownership of this codebase is really important, it is in need of a big refactor (and I am afraid we'll have to do it in the future) [14:13:58] Ruff? [14:14:27] https://github.com/astral-sh/ruff - Introduced by Ilias, really nice [14:14:53] we have the same for isvcs basically [14:15:42] Wait what is the codebase that needs refactoring [14:15:50] recommendation-api [14:16:01] we are adding basic linting/etc.. to it [14:16:16] Ah okay, I thought you meant ruff, which seems like a hard sell to me [14:16:19] so we enforce some quality standard (since we are changing it to host it on Lift Wing) [14:16:29] Would it be something an intern could do with support? [14:16:42] Because we are probably getting an intern [14:17:10] in theory yes [14:17:32] I mean in theory I could win the Olympics elukey, but I don’t think anyone is counting on it [14:18:09] chrisalbon: I wouldn't tell you "don't" if you wanted to try :D [14:18:21] yes! [14:18:24] lol, let me get my running shoes [14:18:39] as Ilias said in the past, the codebase is small enough that we can rework it in a reasonable amount of time [14:18:48] but, do we want to take full ownership of it? [14:18:52] my yes goes to both the intern question and Chris trying for the olympics [14:18:58] (because it would mean yes) [14:19:46] I believe adding CI doesn't have anything to do with ownership. Whoever owns can remove it if they want to [14:19:55] I think I need a project to propose for an intern that isn’t meaningless business work, is within the scope of experience for a pre-junior engineer, and doesn’t block out main work [14:20:05] This looks like that to me [14:20:39] I can help come up with such ideas Chris if u want [14:21:05] It’s due Oct 24th so if you have ideas I am all ears [14:21:11] on a similar topic (I'm kidding it doesnt have anything to do with this :stuckI was looking at https://github.com/locustio/locust [14:21:29] isaranto: I didn't mean CI, but full refactor from an intern [14:21:40] lol :stuck --^ was supposed to be 😝 [14:22:02] so if we have something more fun to let them do it :) [14:22:36] an idea: load test isvcs with locust. deploy an application that one can load test an isvc and get a concrete report [14:24:03] that is definitely better :D [14:27:22] Sooo nice UI with locust [14:27:32] That is cool [14:32:06] 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10isarantopoulos) I ran some initial load tests with [[ https://github.com/wg/wrk | wrk ]] which is available on our... [14:34:11] load test for langid model --^ p99 lower than 20ms! [14:34:30] wow [14:34:31] wow [14:34:49] Very nice [14:35:26] that is on CPU? [14:35:46] ah fasttext, yeah [14:37:57] tbh I don't see a reason why we would need more pods at the moment but we could set maxreplicas up to 4 and minreplicas to 1. wdyt? [14:37:57] cc: klausman elukey [14:38:58] isaranto: yes exactly, did you find a good value for the rps scaling up? [14:39:04] if so go ahead [14:39:52] hmm I have no idea. i would to run some kind of distributed test as I get rate limited above 30 rps [14:40:14] SGTM [14:40:19] given how quick it is, we can set something like 20/30 [14:40:28] maybe 20, then we modify [14:41:37] elukey: btw, are there other users of the Golang build Docker image besides us? [14:41:52] Oh yeah. A whole bunch :-/ [14:41:55] I'm afraid it will scale up for no reason [14:42:40] yes but we can refine in case, it is just to absorb traffic if it happens [14:43:28] isaranto: did you run the load test with the internal endpoint? [14:43:32] cool! lets do that [14:43:36] if so you shouldn't be rate limited at 30 rps [14:43:47] yes with internal, but there is some rate limiting [14:44:31] no, not at 30 it was ok.I saw rate limiting above 50 [14:45:01] ah okok, makes sense, should be 100 rps / IP IIRC [14:45:06] this is why I was puzzled [14:45:22] I just ran another test where I increased the random delay in ms between each request and I got 3k requests in a minute, all 200s. at 50req/s [14:45:37] so great [14:45:57] at this point we can have max replicas 2/3, rps: 40/50 [14:47:00] ack! [14:48:30] ok I just verified at 99.9 rps it started getting rate limited. great results and the pod didn't even sweat. https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlserve&var-namespace=llm&var-pod=langid-predictor-default-00001-deployment-5b69778dc6-btg7v&var-container=All&from=now-1h&to=now [14:50:02] the 100 rps limit was a generic one set in place just to have $something [14:50:11] we can change it, but I doubt we'll ever see that traffic :D [14:50:18] and if we scale before we'll be good [14:50:48] 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10isarantopoulos) By increasing the random delay between responses we are able to get results close to the rate limit... [14:58:03] 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10isarantopoulos) Based on the above results we conclude that the service can easily withstand a high amount of traff... [14:58:29] iirc autoscaling kicks in at 75% of target rps [14:58:56] exactly yes, something like that [15:00:13] I'll try to dig it up and add it in our docs (if it isn't there already) [15:00:57] going afk folks, if u need anything ping me and I'll check a bit later [15:01:30] o/ [15:07:33] elukey: \o. Kserve 0.11.1 needs Golang v1.20 or later, so I sent a change to add a Golang 1.21 build image template. Unfortunately, I had to switch to Bookworm (from Buster), but maybe that's a good thing. [15:08:02] mmm and what about golang on bullseye? [15:08:14] well it is good to use bookworm anyway [15:08:16] should be fine [15:09:38] I felt that current-stable was the better option, since using oldstable would only buy us a little time, as it were [15:10:18] Plus, when builsding Go stuff, only the libc version enmds up in the build (unless you use cgo, but then you wouldn't use a generic Go builder image anyway) [15:12:38] elukey: oh, one question, actually. [15:13:05] elukey: if the script were named test.sh, it would be run automatically on image build, but I presume not naming it that was deliberate? [15:16:20] not sure, never checked it to be honest [15:37:03] (03CR) 10Elukey: [C: 03+1] llm: load test langid model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967220 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [15:37:20] going afk folks! [15:37:24] have a nice rest of the day [15:38:51] kevinbazira: re: testing with values less than 250 - at this point, since we have the variable set etc.., me or Tobias can help you in lowering the value manually without going through a deployment-charts series of patches etc.. so we can quickly spin up new pods and see how it goes [15:38:56] we can do it tomorrow [15:45:15] great. thank you for the manual test suggestion elukey. please let me know how the tests go and what candidates figure ends up working. [15:46:07] enjoy your evening o/ [16:38:56] night all