[07:00:24] <isaranto>	 o/ Good morning!
[09:23:19] <wikibugs>	 06Machine-Learning-Team, 06Structured-Data-Backlog: Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9629389 (10kevinbazira) Thank you for providing details about the logo detection project, @mfossati! The ML team is excited to explore hosting it on LiftWing.  We have...
[10:33:11] <aiko>	 morning!
[10:34:37] <isaranto>	 o/ aiko 
[10:57:06] <aiko>	 isaranto: o/ I updated https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1008858
[11:00:27] <isaranto>	 aiko: nice! I'm thinking whether we would like to load the model on cpu if gpu is not available or we would like it to fail
[11:01:23] <isaranto>	 I'm thinking of the following scenario: we deploy a new version, the gpu is not detected and the model is using cpu. we'll have no idea about it unless we check the logs
[11:01:38] <isaranto>	 in the meantime we would assume that GPU is being used
[11:02:24] <isaranto>	 just laying out my thoughts about it. I'm thinking I'd prefer the deployment to fail so that I know that I should look into it
[11:02:33] <isaranto>	 what do u think about this?
[11:08:31] <aiko>	 that's good +1 I'll update the patch 
[11:12:10] <wikibugs>	 (03PS4) 10AikoChou: revertrisk-ml: add a RevertRiskMultilingualGPU object [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045)
[11:22:52] <isaranto>	 aiko: sry one last thing!
[11:23:11] <isaranto>	 what is the error that will be thrown if the gpu doesn't exist?
[11:23:35] <isaranto>	 we could do a try except, and in the except block log the error message and raise an exception
[11:24:37] <isaranto>	 it is a good practice to catch an expected exception and manually raise your own error as it makes error messages more explicit  and debugging much easier.
[11:25:21] <isaranto>	 perhaps the error message in this case is self explanatory but there are cases where the stack trace is not that intuitive
[11:29:35] <isaranto>	 sry for the back and forth, trying to save us from future issues (if they occur!)
[11:33:08] <aiko>	 when I tested on stat8, it detected the gpu but encountered RuntimeError: No HIP GPUs are available when loading the model on GPU, not sure why. But it works if I try manually loading the model using a jupyter notebook on stat8
[11:34:51] <isaranto>	 ok, let's go with this we'll figure it out if need sth more!
[11:38:07] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou)
[11:38:11] * klausman lunch
[11:41:22] <aiko>	 if the gpu doesn't exist, it won't go to RRMLGPU class. it will use the base model
[11:44:19] <isaranto>	 right!
[12:00:06] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] "thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou)
[12:00:16] * aiko lunch!
[12:06:01] * isaranto lunch as well!
[12:08:23] <wikibugs>	 (03Merged) 10jenkins-bot: revertrisk-ml: add a RevertRiskMultilingualGPU object [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou)
[13:02:38] <elukey>	 hello folks!
[13:12:01] <isaranto>	 o/ Luca!
[13:20:47] <wikibugs>	 06Machine-Learning-Team: Set automatically libomp's num threads when using Pytorch - https://phabricator.wikimedia.org/T360111 (10elukey) 03NEW
[13:30:31] <aiko>	 hi Luca :)
[13:38:41] <wikibugs>	 (03PS3) 10Kevin Bazira: RRLA: upgrade KI from v5 to v6 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742)
[13:42:31] <chrisalbon>	 Morning all, I’m back!
[13:43:35] <elukey>	 o/
[13:43:42] <klausman>	 hey chris!
[13:44:02] <elukey>	 isaranto: (if you have a minute) - I don't recall the procedure to run pytests for inference-services
[13:45:49] <isaranto>	 hey Chris!
[13:46:22] <isaranto>	 give me a bit cause I'm in meetings :)
[13:46:29] <elukey>	 sure sure :)
[13:46:39] <isaranto>	 doesn't  running `pytest` do the trick?
[13:47:14] <elukey>	 it requires a lot of deps, and stuff like "from python import etc.." fail since it doesn't find the module
[13:47:26] <elukey>	 I recall that it was manual (creation of venv etc..)
[13:47:32] <elukey>	 but not sure if anything changed
[13:48:35] <elukey>	 ah right
[13:48:35] <elukey>	  PYTHONPATH=$(pwd):$PYTHONPATH pytest test/unit/
[13:53:16] * elukey Ilias injects the answer via telepathy while doing meetings
[13:55:22] <wikibugs>	 (03CR) 10Kevin Bazira: RRLA: upgrade KI from v5 to v6 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) (owner: 10Kevin Bazira)
[14:00:49] <isaranto>	 I created this patch which was WIP a while ago https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/982396
[14:01:32] <elukey>	 super let's open a task then
[14:01:47] <elukey>	 (I'll do it after meetings if you want)
[14:03:44] <wikibugs>	 (03CR) 10AikoChou: "LGTM! I left one suggestion about using version tags." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) (owner: 10Kevin Bazira)
[14:06:31] <wikibugs>	 (03PS1) 10Elukey: Fix lint issues highlighted by tox [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011129
[14:06:32] <wikibugs>	 (03PS1) 10Elukey: resource_utils.py: add a function to automatically set OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011130 (https://phabricator.wikimedia.org/T360111)
[14:06:34] <wikibugs>	 (03PS1) 10Elukey: readability: set automatically OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011131 (https://phabricator.wikimedia.org/T360111)
[14:08:28] <isaranto>	 I couldnt make it work at the time and then I just left it there 
[14:09:03] <wikibugs>	 (03PS2) 10Elukey: resource_utils.py: add a function to automatically set OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011130 (https://phabricator.wikimedia.org/T360111)
[14:09:05] <wikibugs>	 (03PS2) 10Elukey: readability: set automatically OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011131 (https://phabricator.wikimedia.org/T360111)
[14:12:33] <wikibugs>	 (03CR) 10Elukey: "I think that this approach should work, but I am not 100% sure yet. Worth to test it in staging in my opinion, but let me know if you have" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011131 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey)
[14:13:05] <wikibugs>	 (03CR) 10Elukey: "Not sure why these are not highlighted in CI, but I got errors while running tox locally." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011129 (owner: 10Elukey)
[14:15:28] <elukey>	 klausman: o/ https://kserve.github.io/website/0.11/admin/kubernetes_deployment/ is interesting, I never really checked it but IIUC kserve now supports running only on Istio without Knative
[14:16:02] <elukey>	 not suggesting that we should proceed, but worth to keep it in mind
[14:16:16] <elukey>	 it would surely simplify the whole architecture
[14:16:27] <klausman>	 Interesting indeed.
[14:17:11] <klausman>	 Though as the page mentions, some scaling functionality would go away if we dropped knative.
[14:18:26] <klausman>	 I mean, I am all for fewer moving parts, but I don't think we really have run into knative limitations (or I don't remember...)
[14:19:58] <elukey>	 yes sure the autoscaling would not be present anymore, since it is provided by knative
[14:20:06] <elukey>	 it would require us to manually set the number of pods
[14:20:13] <wikibugs>	 (03PS4) 10Kevin Bazira: RRLA: upgrade KI from v5 to v6 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742)
[14:20:46] <elukey>	 knative is nice but it is a big layer to maintain/upgrade/etc..
[14:21:18] <elukey>	 it has been painful in the past to make it work correctly, and I think we should check what autoscale does atm since it needed some tuning IIRC
[14:21:27] <elukey>	 (stuff comes to mind now that I think about it)
[14:21:37] <klausman>	 Maybe one day we'll discover an alternative that does exactly what we need eithout extra bits.
[14:22:46] <elukey>	 HorizontalPodAutoscaler may be something nice, but it requires a metrics server to fetch stuff from
[14:22:52] <elukey>	 IIRC there is a task from serviceops
[14:23:01] <elukey>	 but it is not as evolved as knative
[14:25:15] <wikibugs>	 06Machine-Learning-Team: Run unit tests for the inference-services repo in CI - https://phabricator.wikimedia.org/T360120 (10elukey) 03NEW
[14:25:27] <elukey>	 isaranto: --^ created
[14:26:43] <isaranto>	 🙏 thaaanks
[14:27:11] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] RRLA: upgrade KI from v5 to v6 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) (owner: 10Kevin Bazira)
[14:38:07] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) (owner: 10Kevin Bazira)
[14:46:57] <wikibugs>	 (03Merged) 10jenkins-bot: RRLA: upgrade KI from v5 to v6 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) (owner: 10Kevin Bazira)
[14:57:02] <isaranto>	 elukey: do you remember if we need to set also the OMP_THREAD_LIMIT var?
[14:57:41] <isaranto>	 I see we did set it for revertrisk but I'm not sure if we need it
[14:58:37] <elukey>	 isaranto: I think it is not needed, if we set num threads, lemme check the specs for that var
[15:00:54] <elukey>	 yeah I think both are not needed together, but lemme know if you think otherwise
[15:01:25] <elukey>	 are we doing the sync with research?
[15:01:31] <isaranto>	 no, I just asked cause I couldn't figure out if we did need it
[15:01:50] <isaranto>	 > are we doing the sync with research?
[15:01:50] <isaranto>	 No since we have the staff meeting
[15:02:14] <elukey>	 I see a lot of people declined, super
[15:32:49] <wikibugs>	 (03CR) 10Jsn.sherman: [C:03+1] "This looks reasonable to me; I'm going to surface this to my team as part of a discussion about instrumentation for our work. Really my on" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/994194 (https://phabricator.wikimedia.org/T356158) (owner: 10Kosta Harlan)
[15:51:17] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9631576 (10isarantopoulos) I've managed to make it work with a model available on disk (which means no connection to HF repo). The issues I faced were specific to the example...
[16:40:44] <isaranto>	 ended up being unable to build images anymore locally
[16:41:03] <isaranto>	 ` OSError: [Errno 28] No space left on device:`
[16:41:33] <isaranto>	 so after pruning all the images `Total reclaimed space: 184.2GB` 🔥
[16:42:26] <wikibugs>	 06Machine-Learning-Team, 10Observability-Metrics: SLO dashboards for Lift Wing showing unexpected values - https://phabricator.wikimedia.org/T359879#9631756 (10elukey)
[16:42:48] <elukey>	 :D
[16:42:53] <elukey>	 klausman: fyi --^
[16:43:28] <elukey>	 while checking some thanos metrics, I noticed that the evaluation time for the istio latency sli is around 30s every time :(
[16:43:34] <elukey>	 it is probably due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/989458
[16:43:46] <elukey>	 we are crunching too many metrics on the Thanos siee
[16:43:48] <elukey>	 *side
[16:44:12] <elukey>	 it shouldn't be the cause of the issue that we are seeing, but not sure if it is sustainable long term
[16:44:13] <klausman>	 Mh, thta's hard to address
[16:44:46] <elukey>	 we can keep less le buckets, to avoid all of them
[16:44:48] <klausman>	 Unless we want to rewrite metrics at scrape time, but that is only moving the problem from Thanos to our prometheus
[16:45:11] <elukey>	 it was around some seconds in Dec IIRC
[16:45:17] <klausman>	 Do you think reducing the amount of buckets would be enough?
[16:45:29] <elukey>	 it will improve for sure
[16:45:48] <elukey>	 every le bucket increases a lot the number of metrics that we process in thanos
[16:46:49] <klausman>	 We just have to make sure to not drop +inf
[16:47:02] <elukey>	 yes that one for sure
[16:47:19] <klausman>	 And then up to what? 30s?
[16:47:26] <klausman>	 5s was definitely too low
[16:48:05] <elukey>	 5s was chosen since we wanted to have SLOs up to 5 seconds HTTP calls, we can add 30 as well
[16:48:25] <elukey>	 more is probably not needed
[16:48:28] <klausman>	 There may be interesting buckets in-between, lemme check
[16:52:22] <klausman>	 So it's (in ms): 0.5 1 5 10 25 50 100 250 500 1000 2500 5000 10000 30000 60000
[16:52:58] <klausman>	 I am not sure we care about <100ms, but I could be covinced to start with 50ms.
[16:53:29] <klausman>	 so (50?) 100 250 500 1000 2500 5000 10000 30000?
[16:53:43] <klausman>	 And +inf of course
[16:54:06] <klausman>	 That would get us from 20 buckets to 9 or 10
[16:56:08] <elukey>	 any specific requirement when you re-added them all?
[16:56:58] <elukey>	 We should aim for 7/8 in my opinion
[16:57:02] <klausman>	 I think ti was mostly that a) we didn't have +Inf at all and b) a service I was looking at was just over 5s in a non-trivial amount of cases, and so it was kinda useless
[16:57:28] <klausman>	 Thing is, with +inf, we can drop the RR on like 170~172 in the patch you linked
[16:57:33] <klausman>	 line*
[16:57:38] <elukey>	 sure +Inf was missing but going from 4 to 20 was a big jump, this is why I am asking :)
[16:57:41] <klausman>	 I'll make a patch with an explanation
[16:58:28] <elukey>	 can you please sync in here first, then we decide what to do?
[17:01:18] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: resource_utils.py: add a function to automatically set OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011130 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey)
[17:01:30] <klausman>	 I think we should have the discussion on the patch, so it's not lost (IRC is hard to search)
[17:02:17] <elukey>	 ...
[17:03:00] <klausman>	 I mean, some form of reverting (ish) the patch you linked is necessary, no?
[17:03:58] <elukey>	 my point is that we can just change the le filter with something that is less heavy, add the motivation as commit msg and submit to observability
[17:04:13] <klausman>	 That is what I meant
[17:04:20] <klausman>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1011146 
[17:06:14] <elukey>	 sure, but we didn't discuss any way forward
[17:06:17] <klausman>	 I just now realized something: the superfluous rule on lines 170-172 meant that we were recording every bucket twice!
[17:06:45] <elukey>	 we cannot remove it until we have it in the dashboards
[17:06:56] <elukey>	 even if they are not 100% correct now
[17:07:06] <klausman>	 Alright, will put it back in
[17:07:26] <elukey>	 the other thing to discuss is 
[17:07:27] <elukey>	 le=~"(50|100|250|500|1000|2500|5000|10000|30000)"
[17:07:33] <elukey>	 in my opinion they are too many
[17:07:54] <elukey>	 and also, is +inf accounted?
[17:08:21] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] "This is a nice idea!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011130 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey)
[17:08:36] <klausman>	 I added Inf in an updated patchset
[17:09:14] <klausman>	 I am also having second thoughts about the second rule. We actually need it. 
[17:09:41] <klausman>	 The first rule has the cumulative latency of the assorted buckets. To do the math right, we need the counts as well.
[17:11:59] <klausman>	 for the buckets, we can drop 50ms and 30s, I think
[17:12:15] <klausman>	 That would make eight buckets in all.
[17:12:42] <elukey>	 also afaics the "le" label is not present in _count, so it was probably added by mistake (almost surely by me)
[17:13:48] <klausman>	 I can drop that as well
[17:14:13] <elukey>	 let's do this - we can keep the current list of labels, and see how it impacts the evaluation time in thanos
[17:14:18] <klausman>	 having it in the sum by doesn't break it, but it's misleading
[17:14:20] <elukey>	 if it reaches a decent performance, we may keep them
[17:14:28] <elukey>	 yes yes I agree, you can remove it
[17:14:31] <klausman>	 So including 50ms and 30s?
[17:14:47] <elukey>	 seems ok for the moment, we can always trim it down
[17:14:52] <klausman>	 ack
[17:15:28] <klausman>	 oops, need to fix xommit msg
[17:15:45] <wikibugs>	 (03PS4) 10Elukey: resource_utils.py: add a function to automatically set OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011130 (https://phabricator.wikimedia.org/T360111)
[17:16:06] <klausman>	 There. Should be ready.
[17:16:20] <elukey>	 isaranto: o/ sorry I rebased the change on top of the first one in the chain, not sure what happened but when you rebased it got lost
[17:16:48] <wikibugs>	 (03PS3) 10Elukey: readability: set automatically OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011131 (https://phabricator.wikimedia.org/T360111)
[17:17:44] <wikibugs>	 (03PS2) 10Elukey: Fix lint issues highlighted by tox [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011129
[17:18:08] <wikibugs>	 (03PS5) 10Elukey: resource_utils.py: add a function to automatically set OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011130 (https://phabricator.wikimedia.org/T360111)
[17:18:18] <wikibugs>	 (03PS4) 10Elukey: readability: set automatically OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011131 (https://phabricator.wikimedia.org/T360111)
[17:18:57] <isaranto>	 sry I just hit the rebase button 
[17:19:44] <elukey>	 np!
[17:19:57] <elukey>	 this is weird https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1011129/2
[17:20:02] <isaranto>	 didn't open my patch yet as I'm testing different environment variables in order for models to work for both local and remote models
[17:20:08] <elukey>	 not sure why my local tox complained
[17:20:51] <isaranto>	 a yes, it is because these files are probably not in the test image that CI checks
[17:21:00] <elukey>	 ahh okok
[17:21:02] <elukey>	 
[17:21:10] <isaranto>	 ideally we'd have it check all files on the repo
[17:21:20] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] Fix lint issues highlighted by tox [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011129 (owner: 10Elukey)
[17:21:58] <elukey>	 <#
[17:21:59] <elukey>	 <4
[17:22:00] <wikibugs>	 06Machine-Learning-Team, 10observability: Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390#9631971 (10elukey)
[17:22:02] <elukey>	 aaaahhhhh
[17:22:04] <elukey>	 <3 :D
[17:22:06] <wikibugs>	 06Machine-Learning-Team, 10Observability-Metrics, 13Patch-For-Review: SLO dashboards for Lift Wing showing unexpected values - https://phabricator.wikimedia.org/T359879#9631972 (10elukey)
[17:22:25] <klausman>	 Time to put down the keyboard and enjoy the evening? ;)
[17:22:34] <elukey>	 in a bit yes :)
[17:22:48] <isaranto>	 I'm going afk folks, I'll finish up the image work tomorrow. If you look at the patch you'll see there isn't much in it, but I've tried a gazillion things in the process https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1009783
[17:22:51] <isaranto>	 :)
[17:23:02] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Fix lint issues highlighted by tox [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011129 (owner: 10Elukey)
[17:23:05] <isaranto>	 enjoy your eveing, cu tomorrow!
[17:23:12] <isaranto>	 *evening/rest of day
[17:23:48] <wikibugs>	 (03Merged) 10jenkins-bot: Fix lint issues highlighted by tox [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011129 (owner: 10Elukey)
[17:23:54] <klausman>	 enjoy your evening, Ilias
[17:25:20] <elukey>	 going afk as well, cu tommorrow folks!
[17:27:16] <klausman>	 \o heading out as well
[17:35:41] <aiko>	 o/
[17:52:21] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] resource_utils.py: add a function to automatically set OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011130 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey)
[18:51:07] <wikibugs>	 (03CR) 10AikoChou: "I tested it on stat1008 locally. It seems that the set_omp_num_threads() here was not executed. I didn't see any logs like "The OMP_NUM_TH" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011131 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey)
[18:52:12] <aiko>	 logging off! I'll keep working on the error handling for batch prediction tomorrow :)
[19:08:55] <wikibugs>	 (03PS1) 10Umherirrender: Type hint IReadableDatabase in WatchedItemQueryServiceExtension [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1011172