[06:26:08] <elukey>	 morning folks
[06:40:22] <isaranto>	 Morning!
[06:44:13] <wikibugs>	 10Machine-Learning-Team, 10observability: Some Istio recording rules may be missing data - https://phabricator.wikimedia.org/T349072 (10elukey) I had a chat with Filippo over IRC, and the suspicion is that the long poll time (~30s) of the Istio recording rules has a toll on Thanos, ending up in slowness and ho...
[06:47:17] <elukey>	 isaranto: I added you, Kevin and Aiko to https://phabricator.wikimedia.org/project/members/1586/, you should now be able to see the procurement tasks in the board
[06:47:25] <elukey>	 lemme know if it works when you have a moment
[06:54:31] <isaranto>	 thanks! it works!
[06:58:31] <elukey>	 super
[07:21:47] <wikibugs>	 10Machine-Learning-Team, 10observability: Some Istio recording rules may be missing data - https://phabricator.wikimedia.org/T349072 (10fgiunchedi) Thank you for the detailed investigation @elukey ! I agree on both counts, definitely reducing the number of labels at ingestion time will help things, and so will...
[07:33:35] <wikibugs>	 10Machine-Learning-Team, 10observability: Some Istio recording rules may be missing data - https://phabricator.wikimedia.org/T349072 (10elukey) @fgiunchedi yes definitely, I have some doubts too, I proposed the reduced `le` since this is currently our SLI/SLO defs:  `     "Lift Wing Revert Risk MultiLingual se...
[07:44:14] <wikibugs>	 10Machine-Learning-Team: Apply multi-processing to preprocess() in isvcs that suffer from high latency - https://phabricator.wikimedia.org/T349274 (10elukey)
[07:44:30] <elukey>	 as promised I created --^, hope it is not missing anything
[08:14:16] <wikibugs>	 10Machine-Learning-Team, 10observability: Some Istio recording rules may be missing data - https://phabricator.wikimedia.org/T349072 (10fgiunchedi) Thank you for the context, which makes sense to me. If we indeed end up reducing the `le` labels and the recording rule used only for SLO/SLI we should consider al...
[08:18:39] <wikibugs>	 10Machine-Learning-Team: How impactful would pre-save automoderation be on edit save times? - https://phabricator.wikimedia.org/T299436 (10Samwalton9-WMF) 05Open→03Resolved I think @calbon has looked into this far enough that we understand this is a viable approach, but we've decided to go the route of rever...
[08:48:22] <wikibugs>	 10Machine-Learning-Team: Configure envoy settings to enable rec-api-ng container to access endpoints external to k8s/LiftWing - https://phabricator.wikimedia.org/T348607 (10kevinbazira) We configured envoy settings and the rec-api-ng container hosted on LiftWing is able to access endpoints external to k8s/LiftWi...
[08:48:35] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: llm: fix No module named 'pkg_resources' - add setuptools [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967141 (https://phabricator.wikimedia.org/T349163)
[08:52:31] <isaranto>	 The multiprocessing ticket description is great!
[08:53:05] <isaranto>	 I added a fix/patch for llm image similar to the one we encountered with readability/revertrisk after kserve upgrade
[08:54:04] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: llm: fix No module named 'pkg_resources' - add setuptools [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967141 (https://phabricator.wikimedia.org/T349163)
[09:09:15] <wikibugs>	 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10kevinbazira) We resolved the envoy proxy issues as shown in T348607#9264192. However, the rec-api instance hosted on LiftWing is still not performing as expected. The container logs...
[09:28:33] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] llm: fix No module named 'pkg_resources' - add setuptools [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967141 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos)
[09:55:50] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] llm: fix No module named 'pkg_resources' - add setuptools [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967141 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos)
[09:56:35] <wikibugs>	 (03Merged) 10jenkins-bot: llm: fix No module named 'pkg_resources' - add setuptools [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967141 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos)
[10:25:21] * isaranto afk lunch
[10:49:39] * elukey lunch!
[11:52:20] <isaranto>	 kevinbazira: I agree to test more resources on staging, but I'm thinking that maybe we should explore cutting down the search space which will limit the amount of resources required
[12:26:10] <wikibugs>	 (03PS1) 10Kevin Bazira: Reduce the number candidates processed by the rec-api-ng [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475)
[12:33:55] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[12:34:04] <kevinbazira>	 sure isaranto, I am not sure how reducing the search space affects the results returned by the api but I have pushed a patch that reduces the maximum number of candidates to 250 as per your advice. I've tagged you as a reviewer, please review whenever you get a minute.
[12:37:05] <isaranto>	 kevinbazira: feel free to discuss here before implementing stuff. We're not in a hurry :) We just want to find the best solution in a reasonable amount of time.
[12:38:15] <isaranto>	 regarding the suggestion, my understanding is that for each candidate we perform a task so limiting the number of candidates would increase performance, so it is the equivalent of increasing cpus
[12:39:13] <isaranto>	 then again, I'm not that familiar with the codebase so I might be wrong. in that case discussing here helps
[12:40:42] <kevinbazira>	 yes, isaranto I understand the part where reducing candidates processed enables the rec-api to use less compute resources. the part I don't understand is whether the quality of the rec-api recommendations will be impacted positively or negatively.
[12:40:56] <wikibugs>	 (03CR) 10Kevin Bazira: Reduce the number candidates processed by the rec-api-ng (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[12:44:40] <isaranto>	 ok, I understand. Ofc quality would be reduced (although we don't have evidence that this happens). But even if it does it is a compromise we are making in order to be able to deploy a scalable service. Research has a say in this, so since the initial suggestion came from Isaac I trust that we can go towards this direction
[12:45:15] <isaranto>	 we can ask if there is any way to evaluate once we expose the service
[12:49:32] <wikibugs>	 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, and 2 others: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10ItamarWMDE)
[12:49:43] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] Reduce the number candidates processed by the rec-api-ng (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[12:50:22] <wikibugs>	 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Update API calls from ORES to Lift Wing - https://phabricator.wikimedia.org/T343731 (10ItamarWMDE) 05Open→03Resolved Verified and deployed, thank you for all the hard work!
[12:54:02] <isaranto>	 🎉 --^
[13:04:51] <wikibugs>	 (03PS2) 10Kevin Bazira: Reduce the number candidates processed by the rec-api-ng [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475)
[13:06:56] <wikibugs>	 (03CR) 10Kevin Bazira: Reduce the number candidates processed by the rec-api-ng (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[13:08:16] <kevinbazira>	 sure isaranto, I have reduced the candidates processed from 500 to 250  and used an OS environmental variable to ease experimenting as per your advice.
[13:13:51] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] Reduce the number candidates processed by the rec-api-ng (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[13:16:47] <chrisalbon>	 Morning alll
[13:19:12] <wikibugs>	 (03PS3) 10Kevin Bazira: Reduce the number candidates processed by the rec-api-ng [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475)
[13:21:47] <wikibugs>	 (03CR) 10Kevin Bazira: Reduce the number candidates processed by the rec-api-ng (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[13:22:05] <kevinbazira>	 hi Chris o/
[13:27:54] <isaranto>	 Hey !
[13:28:45] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "Nice!" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[13:30:37] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review, Ilias. Going to merge now then work on the deployment config to test this on staging." [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[13:31:10] <wikibugs>	 (03Merged) 10jenkins-bot: Reduce the number candidates processed by the rec-api-ng [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966831 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[13:31:51] <isaranto>	 ouf, finally made my lua script to start working! 😓
[13:38:37] <kevinbazira>	 good job! :)
[13:39:48] <elukey>	 so I think that the weird issues that we see in CI are related to the fact that we pip install via blubber some deps, and use tox deps as well
[13:39:56] <elukey>	 the result is a weird mixed setting
[13:40:46] <elukey>	 (like the virtualenv issue)
[13:40:55] <elukey>	 I am experimenting with having deps only managed by tox
[13:44:05] <elukey>	 (so we don't install anything requirements*.txt related via blubber)
[13:49:04] <elukey>	 yeah it works :)
[13:58:05] <wikibugs>	 (03PS6) 10Elukey: Add precommit support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542
[13:58:07] <wikibugs>	 (03PS6) 10Elukey: Fix pre-commit errors and bump version [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966543
[13:58:49] <elukey>	 getting a -1 in 3..2..1..
[14:00:28] <isaranto>	 oh..
[14:00:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add precommit support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 (owner: 10Elukey)
[14:00:54] <chrisalbon>	 I like this countdown approach
[14:01:04] <chrisalbon>	 adds suspense 
[14:01:15] <isaranto>	 I can help a bit tomorrow on this front so that we can figure out what is wrong and fix it in the other images as well
[14:01:52] <elukey>	 uff
[14:02:13] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: llm: load test langid model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967220 (https://phabricator.wikimedia.org/T340507)
[14:02:24] <elukey>	 ah I got a -1 only in one change
[14:02:42] <elukey>	 isaranto: the problem is that tox-managed deps collide with the ones installed via debian-pip
[14:03:02] <elukey>	 I found a solution, namely in "Test" allow only tox to pip install
[14:03:04] <elukey>	 and it works nicely
[14:03:45] <elukey>	 ahhh right, I get a -1 because, of course, ruff wants to change the world
[14:03:52] <elukey>	 and I applied the changes to another code review
[14:04:28] <isaranto>	 yeah makes sense. I dont  know how we missed that. thanks for tackling this!
[14:05:27] <elukey>	 isaranto: it took a while sigh, but now it kinda makes sense, even if I hoped for a better integration.. we are talking about bullseye versions vs newer ones, maybe it makes sense
[14:05:31] <elukey>	 anywayyy
[14:05:59] <elukey>	 I'd be inclined, if you all agree, to bypass the -1 for the first change, and count on the +2 that we get later on when all the code is fixed
[14:06:17] <chrisalbon>	 {popcorn emoji}
[14:06:35] <wikibugs>	 (03CR) 10Elukey: "Ruff wants to change the world, and I have committed all the code changes in the next code review. I'd be inclined to skip CI for this one" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 (owner: 10Elukey)
[14:07:20] <elukey>	 I can make the effort to make it pass in both, maybe, but it would be very painful
[14:07:38] <elukey>	 or maybe I can just flip ruff to avoid fixing
[14:07:43] <elukey>	 mmmm
[14:07:49] <elukey>	 maybe it works
[14:10:32] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] Add precommit support (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 (owner: 10Elukey)
[14:11:23] <isaranto>	 I'm in we can skip it 
[14:11:39] <elukey>	 super thanks <#
[14:11:42] <elukey>	 <3
[14:12:01] <isaranto>	 I'll create a new patch after we merge this to upgrade pre-commit version (ruff, black etc). since we introduce this now to this repo
[14:12:24] <isaranto>	 most of this tools are really stable with slow changes, perhaps only ruff is moving faster
[14:13:19] <elukey>	 chrisalbon: the ownership of this codebase is really important, it is in need of a big refactor (and I am afraid we'll have to do it in the future)
[14:13:58] <chrisalbon>	 Ruff?
[14:14:27] <elukey>	 https://github.com/astral-sh/ruff - Introduced by Ilias, really nice
[14:14:53] <elukey>	 we have the same for isvcs basically
[14:15:42] <chrisalbon>	 Wait what is the codebase that needs refactoring
[14:15:50] <elukey>	 recommendation-api
[14:16:01] <elukey>	 we are adding basic linting/etc.. to it
[14:16:16] <chrisalbon>	 Ah okay, I thought you meant ruff, which seems like a hard sell to me
[14:16:19] <elukey>	 so we enforce some quality standard (since we are changing it to host it on Lift Wing)
[14:16:29] <chrisalbon>	 Would it be something an intern could do with support?
[14:16:42] <chrisalbon>	 Because we are probably getting an intern
[14:17:10] <elukey>	 in theory yes
[14:17:32] <chrisalbon>	 I mean in theory I could win the Olympics elukey, but I don’t think anyone is counting on it
[14:18:09] <elukey>	 chrisalbon: I wouldn't tell you "don't" if you wanted to try :D
[14:18:21] <isaranto>	 yes!
[14:18:24] <chrisalbon>	 lol, let me get my running shoes
[14:18:39] <elukey>	 as Ilias said in the past, the codebase is small enough that we can rework it in a reasonable amount of time
[14:18:48] <elukey>	 but, do we want to take full ownership of it?
[14:18:52] <isaranto>	 my yes goes to both the intern question and Chris trying for the olympics
[14:18:58] <elukey>	 (because it would mean yes)
[14:19:46] <isaranto>	 I believe adding CI doesn't have anything to do with ownership. Whoever owns can remove it if they want to
[14:19:55] <chrisalbon>	 I think I need a project to propose for an intern that isn’t meaningless business work, is within the scope of experience for a pre-junior engineer, and doesn’t block out main work
[14:20:05] <chrisalbon>	 This looks like that to me
[14:20:39] <isaranto>	 I can help come up with such ideas Chris if u want
[14:21:05] <chrisalbon>	 It’s due Oct 24th so if you have ideas I am all ears
[14:21:11] <isaranto>	 on a similar topic  (I'm kidding it doesnt have anything to do with this :stuckI was looking at https://github.com/locustio/locust 
[14:21:29] <elukey>	 isaranto: I didn't mean CI, but full refactor from an intern
[14:21:40] <isaranto>	 lol :stuck  --^ was supposed to be 😝
[14:22:02] <elukey>	 so if we have something more fun to let them do it :)
[14:22:36] <isaranto>	 an idea: load test isvcs with locust. deploy an application that one can load test an isvc and get a concrete report
[14:24:03] <elukey>	 that is definitely better :D
[14:27:22] <chrisalbon>	 Sooo nice UI with locust
[14:27:32] <chrisalbon>	 That is cool
[14:32:06] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10isarantopoulos) I ran some initial load tests with [[ https://github.com/wg/wrk | wrk ]] which is available on our...
[14:34:11] <isaranto>	 load test for langid model --^  p99 lower than 20ms!
[14:34:30] <chrisalbon>	 wow
[14:34:31] <elukey>	 wow
[14:34:49] <klausman>	 Very nice
[14:35:26] <chrisalbon>	 that is on CPU?
[14:35:46] <chrisalbon>	 ah fasttext, yeah
[14:37:57] <isaranto>	 tbh I don't see a reason why we would need more pods at the moment but we could set maxreplicas up to 4 and minreplicas to 1. wdyt?
[14:37:57] <isaranto>	 cc: klausman elukey 
[14:38:58] <elukey>	 isaranto: yes exactly, did you find a good value for the rps scaling up?
[14:39:04] <elukey>	 if so go ahead
[14:39:52] <isaranto>	 hmm I have no idea. i would to run some kind of distributed test as I get rate limited above 30 rps
[14:40:14] <klausman>	 SGTM
[14:40:19] <elukey>	 given how quick it is, we can set something like 20/30
[14:40:28] <elukey>	 maybe 20, then we modify
[14:41:37] <klausman>	 elukey: btw, are there other users of the Golang build Docker image besides us?
[14:41:52] <klausman>	 Oh yeah. A whole bunch :-/
[14:41:55] <isaranto>	 I'm afraid it will scale up for no reason
[14:42:40] <elukey>	 yes but we can refine in case, it is just to absorb traffic if it happens
[14:43:28] <elukey>	 isaranto: did you run the load test with the internal endpoint?
[14:43:32] <isaranto>	 cool! lets do that
[14:43:36] <elukey>	 if so you shouldn't be rate limited at 30 rps
[14:43:47] <isaranto>	 yes with internal, but there is some rate limiting
[14:44:31] <isaranto>	 no, not at 30 it was ok.I saw rate limiting above 50
[14:45:01] <elukey>	 ah okok, makes sense, should be 100 rps / IP IIRC
[14:45:06] <elukey>	 this is why I was puzzled
[14:45:22] <isaranto>	 I just ran another test where I increased the random delay in ms between each request and I got 3k requests in a minute, all 200s. at 50req/s
[14:45:37] <elukey>	 so great
[14:45:57] <elukey>	 at this point we can have max replicas 2/3, rps: 40/50
[14:47:00] <isaranto>	 ack!
[14:48:30] <isaranto>	 ok I just verified at 99.9 rps it started getting rate limited. great results and the pod didn't even sweat. https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlserve&var-namespace=llm&var-pod=langid-predictor-default-00001-deployment-5b69778dc6-btg7v&var-container=All&from=now-1h&to=now
[14:50:02] <elukey>	 the 100 rps limit was a generic one set in place just to have $something 
[14:50:11] <elukey>	 we can change it, but I doubt we'll ever see that traffic :D
[14:50:18] <elukey>	 and if we scale before we'll be good
[14:50:48] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10isarantopoulos) By increasing the random delay between responses we are able to get results close to the rate limit...
[14:58:03] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10isarantopoulos) Based on the above results we conclude that the service can easily withstand a high amount of traff...
[14:58:29] <isaranto>	 iirc autoscaling kicks in at 75% of target rps
[14:58:56] <elukey>	 exactly yes, something like that
[15:00:13] <isaranto>	 I'll try to dig it up and add it in our docs (if it isn't there already)
[15:00:57] <isaranto>	 going afk folks, if u need anything ping me and I'll check a bit later
[15:01:30] <elukey>	 o/
[15:07:33] <klausman>	 elukey: \o. Kserve 0.11.1 needs Golang v1.20 or later, so I sent a change to add a Golang 1.21 build image template. Unfortunately, I had to switch to Bookworm (from Buster), but maybe that's a good thing.
[15:08:02] <elukey>	 mmm and what about golang on bullseye?
[15:08:14] <elukey>	 well it is good to use bookworm anyway
[15:08:16] <elukey>	 should be fine
[15:09:38] <klausman>	 I felt that current-stable was the better option, since using oldstable would only buy us a little time, as it were
[15:10:18] <klausman>	 Plus, when builsding Go stuff, only the libc version enmds up in the build (unless you use cgo, but then you wouldn't use a generic Go builder image anyway)
[15:12:38] <klausman>	 elukey: oh, one question, actually.
[15:13:05] <klausman>	 elukey: if the script were named test.sh, it would be run automatically on image build, but I presume not naming it that was deliberate?
[15:16:20] <elukey>	 not sure, never checked it to be honest
[15:37:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] llm: load test langid model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967220 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos)
[15:37:20] <elukey>	 going afk folks!
[15:37:24] <elukey>	 have a nice rest of the day
[15:38:51] <elukey>	 kevinbazira: re: testing with values less than 250 - at this point, since we have the variable set etc.., me or Tobias can help you in lowering the value manually without going through a deployment-charts series of patches etc.. so we can quickly spin up new pods and see how it goes
[15:38:56] <elukey>	 we can do it tomorrow
[15:45:15] <kevinbazira>	 great. thank you for the manual test suggestion elukey. please let me know how the tests go and what candidates figure ends up working.
[15:46:07] <kevinbazira>	 enjoy your evening o/
[16:38:56] <chrisalbon>	 night all