[07:00:24] o/ Good morning! [09:23:19] 06Machine-Learning-Team, 06Structured-Data-Backlog: Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9629389 (10kevinbazira) Thank you for providing details about the logo detection project, @mfossati! The ML team is excited to explore hosting it on LiftWing. We have... [10:33:11] morning! [10:34:37] o/ aiko [10:57:06] isaranto: o/ I updated https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1008858 [11:00:27] aiko: nice! I'm thinking whether we would like to load the model on cpu if gpu is not available or we would like it to fail [11:01:23] I'm thinking of the following scenario: we deploy a new version, the gpu is not detected and the model is using cpu. we'll have no idea about it unless we check the logs [11:01:38] in the meantime we would assume that GPU is being used [11:02:24] just laying out my thoughts about it. I'm thinking I'd prefer the deployment to fail so that I know that I should look into it [11:02:33] what do u think about this? [11:08:31] that's good +1 I'll update the patch [11:12:10] (03PS4) 10AikoChou: revertrisk-ml: add a RevertRiskMultilingualGPU object [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) [11:22:52] aiko: sry one last thing! [11:23:11] what is the error that will be thrown if the gpu doesn't exist? [11:23:35] we could do a try except, and in the except block log the error message and raise an exception [11:24:37] it is a good practice to catch an expected exception and manually raise your own error as it makes error messages more explicit and debugging much easier. [11:25:21] perhaps the error message in this case is self explanatory but there are cases where the stack trace is not that intuitive [11:29:35] sry for the back and forth, trying to save us from future issues (if they occur!) [11:33:08] when I tested on stat8, it detected the gpu but encountered RuntimeError: No HIP GPUs are available when loading the model on GPU, not sure why. But it works if I try manually loading the model using a jupyter notebook on stat8 [11:34:51] ok, let's go with this we'll figure it out if need sth more! [11:38:07] (03CR) 10Ilias Sarantopoulos: "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [11:38:11] * klausman lunch [11:41:22] if the gpu doesn't exist, it won't go to RRMLGPU class. it will use the base model [11:44:19] right! [12:00:06] (03CR) 10AikoChou: [C:03+2] "thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [12:00:16] * aiko lunch! [12:06:01] * isaranto lunch as well! [12:08:23] (03Merged) 10jenkins-bot: revertrisk-ml: add a RevertRiskMultilingualGPU object [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [13:02:38] hello folks! [13:12:01] o/ Luca! [13:20:47] 06Machine-Learning-Team: Set automatically libomp's num threads when using Pytorch - https://phabricator.wikimedia.org/T360111 (10elukey) 03NEW [13:30:31] hi Luca :) [13:38:41] (03PS3) 10Kevin Bazira: RRLA: upgrade KI from v5 to v6 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) [13:42:31] Morning all, Iā€™m back! [13:43:35] o/ [13:43:42] hey chris! [13:44:02] isaranto: (if you have a minute) - I don't recall the procedure to run pytests for inference-services [13:45:49] hey Chris! [13:46:22] give me a bit cause I'm in meetings :) [13:46:29] sure sure :) [13:46:39] doesn't running `pytest` do the trick? [13:47:14] it requires a lot of deps, and stuff like "from python import etc.." fail since it doesn't find the module [13:47:26] I recall that it was manual (creation of venv etc..) [13:47:32] but not sure if anything changed [13:48:35] ah right [13:48:35] PYTHONPATH=$(pwd):$PYTHONPATH pytest test/unit/ [13:53:16] * elukey Ilias injects the answer via telepathy while doing meetings [13:55:22] (03CR) 10Kevin Bazira: RRLA: upgrade KI from v5 to v6 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) (owner: 10Kevin Bazira) [14:00:49] I created this patch which was WIP a while ago https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/982396 [14:01:32] super let's open a task then [14:01:47] (I'll do it after meetings if you want) [14:03:44] (03CR) 10AikoChou: "LGTM! I left one suggestion about using version tags." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) (owner: 10Kevin Bazira) [14:06:31] (03PS1) 10Elukey: Fix lint issues highlighted by tox [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011129 [14:06:32] (03PS1) 10Elukey: resource_utils.py: add a function to automatically set OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011130 (https://phabricator.wikimedia.org/T360111) [14:06:34] (03PS1) 10Elukey: readability: set automatically OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011131 (https://phabricator.wikimedia.org/T360111) [14:08:28] I couldnt make it work at the time and then I just left it there [14:09:03] (03PS2) 10Elukey: resource_utils.py: add a function to automatically set OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011130 (https://phabricator.wikimedia.org/T360111) [14:09:05] (03PS2) 10Elukey: readability: set automatically OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011131 (https://phabricator.wikimedia.org/T360111) [14:12:33] (03CR) 10Elukey: "I think that this approach should work, but I am not 100% sure yet. Worth to test it in staging in my opinion, but let me know if you have" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011131 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [14:13:05] (03CR) 10Elukey: "Not sure why these are not highlighted in CI, but I got errors while running tox locally." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011129 (owner: 10Elukey) [14:15:28] klausman: o/ https://kserve.github.io/website/0.11/admin/kubernetes_deployment/ is interesting, I never really checked it but IIUC kserve now supports running only on Istio without Knative [14:16:02] not suggesting that we should proceed, but worth to keep it in mind [14:16:16] it would surely simplify the whole architecture [14:16:27] Interesting indeed. [14:17:11] Though as the page mentions, some scaling functionality would go away if we dropped knative. [14:18:26] I mean, I am all for fewer moving parts, but I don't think we really have run into knative limitations (or I don't remember...) [14:19:58] yes sure the autoscaling would not be present anymore, since it is provided by knative [14:20:06] it would require us to manually set the number of pods [14:20:13] (03PS4) 10Kevin Bazira: RRLA: upgrade KI from v5 to v6 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) [14:20:46] knative is nice but it is a big layer to maintain/upgrade/etc.. [14:21:18] it has been painful in the past to make it work correctly, and I think we should check what autoscale does atm since it needed some tuning IIRC [14:21:27] (stuff comes to mind now that I think about it) [14:21:37] Maybe one day we'll discover an alternative that does exactly what we need eithout extra bits. [14:22:46] HorizontalPodAutoscaler may be something nice, but it requires a metrics server to fetch stuff from [14:22:52] IIRC there is a task from serviceops [14:23:01] but it is not as evolved as knative [14:25:15] 06Machine-Learning-Team: Run unit tests for the inference-services repo in CI - https://phabricator.wikimedia.org/T360120 (10elukey) 03NEW [14:25:27] isaranto: --^ created [14:26:43] šŸ™ thaaanks [14:27:11] (03CR) 10AikoChou: [C:03+1] RRLA: upgrade KI from v5 to v6 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) (owner: 10Kevin Bazira) [14:38:07] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) (owner: 10Kevin Bazira) [14:46:57] (03Merged) 10jenkins-bot: RRLA: upgrade KI from v5 to v6 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) (owner: 10Kevin Bazira) [14:57:02] elukey: do you remember if we need to set also the OMP_THREAD_LIMIT var? [14:57:41] I see we did set it for revertrisk but I'm not sure if we need it [14:58:37] isaranto: I think it is not needed, if we set num threads, lemme check the specs for that var [15:00:54] yeah I think both are not needed together, but lemme know if you think otherwise [15:01:25] are we doing the sync with research? [15:01:31] no, I just asked cause I couldn't figure out if we did need it [15:01:50] > are we doing the sync with research? [15:01:50] No since we have the staff meeting [15:02:14] I see a lot of people declined, super [15:32:49] (03CR) 10Jsn.sherman: [C:03+1] "This looks reasonable to me; I'm going to surface this to my team as part of a discussion about instrumentation for our work. Really my on" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/994194 (https://phabricator.wikimedia.org/T356158) (owner: 10Kosta Harlan) [15:51:17] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9631576 (10isarantopoulos) I've managed to make it work with a model available on disk (which means no connection to HF repo). The issues I faced were specific to the example... [16:40:44] ended up being unable to build images anymore locally [16:41:03] ` OSError: [Errno 28] No space left on device:` [16:41:33] so after pruning all the images `Total reclaimed space: 184.2GB` šŸ”„ [16:42:26] 06Machine-Learning-Team, 10Observability-Metrics: SLO dashboards for Lift Wing showing unexpected values - https://phabricator.wikimedia.org/T359879#9631756 (10elukey) [16:42:48] :D [16:42:53] klausman: fyi --^ [16:43:28] while checking some thanos metrics, I noticed that the evaluation time for the istio latency sli is around 30s every time :( [16:43:34] it is probably due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/989458 [16:43:46] we are crunching too many metrics on the Thanos siee [16:43:48] *side [16:44:12] it shouldn't be the cause of the issue that we are seeing, but not sure if it is sustainable long term [16:44:13] Mh, thta's hard to address [16:44:46] we can keep less le buckets, to avoid all of them [16:44:48] Unless we want to rewrite metrics at scrape time, but that is only moving the problem from Thanos to our prometheus [16:45:11] it was around some seconds in Dec IIRC [16:45:17] Do you think reducing the amount of buckets would be enough? [16:45:29] it will improve for sure [16:45:48] every le bucket increases a lot the number of metrics that we process in thanos [16:46:49] We just have to make sure to not drop +inf [16:47:02] yes that one for sure [16:47:19] And then up to what? 30s? [16:47:26] 5s was definitely too low [16:48:05] 5s was chosen since we wanted to have SLOs up to 5 seconds HTTP calls, we can add 30 as well [16:48:25] more is probably not needed [16:48:28] There may be interesting buckets in-between, lemme check [16:52:22] So it's (in ms): 0.5 1 5 10 25 50 100 250 500 1000 2500 5000 10000 30000 60000 [16:52:58] I am not sure we care about <100ms, but I could be covinced to start with 50ms. [16:53:29] so (50?) 100 250 500 1000 2500 5000 10000 30000? [16:53:43] And +inf of course [16:54:06] That would get us from 20 buckets to 9 or 10 [16:56:08] any specific requirement when you re-added them all? [16:56:58] We should aim for 7/8 in my opinion [16:57:02] I think ti was mostly that a) we didn't have +Inf at all and b) a service I was looking at was just over 5s in a non-trivial amount of cases, and so it was kinda useless [16:57:28] Thing is, with +inf, we can drop the RR on like 170~172 in the patch you linked [16:57:33] line* [16:57:38] sure +Inf was missing but going from 4 to 20 was a big jump, this is why I am asking :) [16:57:41] I'll make a patch with an explanation [16:58:28] can you please sync in here first, then we decide what to do? [17:01:18] (03PS3) 10Ilias Sarantopoulos: resource_utils.py: add a function to automatically set OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011130 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [17:01:30] I think we should have the discussion on the patch, so it's not lost (IRC is hard to search) [17:02:17] ... [17:03:00] I mean, some form of reverting (ish) the patch you linked is necessary, no? [17:03:58] my point is that we can just change the le filter with something that is less heavy, add the motivation as commit msg and submit to observability [17:04:13] That is what I meant [17:04:20] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1011146 [17:06:14] sure, but we didn't discuss any way forward [17:06:17] I just now realized something: the superfluous rule on lines 170-172 meant that we were recording every bucket twice! [17:06:45] we cannot remove it until we have it in the dashboards [17:06:56] even if they are not 100% correct now [17:07:06] Alright, will put it back in [17:07:26] the other thing to discuss is [17:07:27] le=~"(50|100|250|500|1000|2500|5000|10000|30000)" [17:07:33] in my opinion they are too many [17:07:54] and also, is +inf accounted? [17:08:21] (03CR) 10Ilias Sarantopoulos: [C:03+1] "This is a nice idea!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011130 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [17:08:36] I added Inf in an updated patchset [17:09:14] I am also having second thoughts about the second rule. We actually need it. [17:09:41] The first rule has the cumulative latency of the assorted buckets. To do the math right, we need the counts as well. [17:11:59] for the buckets, we can drop 50ms and 30s, I think [17:12:15] That would make eight buckets in all. [17:12:42] also afaics the "le" label is not present in _count, so it was probably added by mistake (almost surely by me) [17:13:48] I can drop that as well [17:14:13] let's do this - we can keep the current list of labels, and see how it impacts the evaluation time in thanos [17:14:18] having it in the sum by doesn't break it, but it's misleading [17:14:20] if it reaches a decent performance, we may keep them [17:14:28] yes yes I agree, you can remove it [17:14:31] So including 50ms and 30s? [17:14:47] seems ok for the moment, we can always trim it down [17:14:52] ack [17:15:28] oops, need to fix xommit msg [17:15:45] (03PS4) 10Elukey: resource_utils.py: add a function to automatically set OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011130 (https://phabricator.wikimedia.org/T360111) [17:16:06] There. Should be ready. [17:16:20] isaranto: o/ sorry I rebased the change on top of the first one in the chain, not sure what happened but when you rebased it got lost [17:16:48] (03PS3) 10Elukey: readability: set automatically OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011131 (https://phabricator.wikimedia.org/T360111) [17:17:44] (03PS2) 10Elukey: Fix lint issues highlighted by tox [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011129 [17:18:08] (03PS5) 10Elukey: resource_utils.py: add a function to automatically set OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011130 (https://phabricator.wikimedia.org/T360111) [17:18:18] (03PS4) 10Elukey: readability: set automatically OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011131 (https://phabricator.wikimedia.org/T360111) [17:18:57] sry I just hit the rebase button [17:19:44] np! [17:19:57] this is weird https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1011129/2 [17:20:02] didn't open my patch yet as I'm testing different environment variables in order for models to work for both local and remote models [17:20:08] not sure why my local tox complained [17:20:51] a yes, it is because these files are probably not in the test image that CI checks [17:21:00] ahh okok [17:21:02]  [17:21:10] ideally we'd have it check all files on the repo [17:21:20] (03CR) 10Ilias Sarantopoulos: [C:03+1] Fix lint issues highlighted by tox [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011129 (owner: 10Elukey) [17:21:58] <# [17:21:59] <4 [17:22:00] 06Machine-Learning-Team, 10observability: Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390#9631971 (10elukey) [17:22:02] aaaahhhhh [17:22:04] <3 :D [17:22:06] 06Machine-Learning-Team, 10Observability-Metrics, 13Patch-For-Review: SLO dashboards for Lift Wing showing unexpected values - https://phabricator.wikimedia.org/T359879#9631972 (10elukey) [17:22:25] Time to put down the keyboard and enjoy the evening? ;) [17:22:34] in a bit yes :) [17:22:48] I'm going afk folks, I'll finish up the image work tomorrow. If you look at the patch you'll see there isn't much in it, but I've tried a gazillion things in the process https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1009783 [17:22:51] :) [17:23:02] (03CR) 10Elukey: [C:03+2] Fix lint issues highlighted by tox [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011129 (owner: 10Elukey) [17:23:05] enjoy your eveing, cu tomorrow! [17:23:12] *evening/rest of day [17:23:48] (03Merged) 10jenkins-bot: Fix lint issues highlighted by tox [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011129 (owner: 10Elukey) [17:23:54] enjoy your evening, Ilias [17:25:20] going afk as well, cu tommorrow folks! [17:27:16] \o heading out as well [17:35:41] o/ [17:52:21] (03CR) 10AikoChou: [C:03+1] resource_utils.py: add a function to automatically set OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011130 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [18:51:07] (03CR) 10AikoChou: "I tested it on stat1008 locally. It seems that the set_omp_num_threads() here was not executed. I didn't see any logs like "The OMP_NUM_TH" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011131 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [18:52:12] logging off! I'll keep working on the error handling for batch prediction tomorrow :) [19:08:55] (03PS1) 10Umherirrender: Type hint IReadableDatabase in WatchedItemQueryServiceExtension [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1011172