[07:41:01] o/ good morning! [07:42:34] o/ [07:47:14] isaranto: if you agree I'd file later on a patch for inference-service to have a shared function that is able to return the number of cpus available in the cgroup [07:47:34] so we can use it to automatically set (if not already provided) OMP_NUM_THREADS and similar [07:47:53] afaics libgomp is not cgroup aware (I may be wrong but can't find anything) [07:48:07] since torch uses it be default, it will be a recurring isue [07:48:10] *issue [07:55:08] basically something like https://github.com/catboost/catboost/pull/2519 [08:13:51] (ErrorBudgetBurn) firing: - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:18:41] elukey: sure, that would be awesome! [08:25:55] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ci: refactor imports with isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981718 (owner: 10Ilias Sarantopoulos) [08:35:07] (03Merged) 10jenkins-bot: ci: refactor imports with isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981718 (owner: 10Ilias Sarantopoulos) [08:39:00] (03CR) 10AikoChou: [C: 03+1] ci: add isort to pre-commit to order imports [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981717 (owner: 10Ilias Sarantopoulos) [08:39:32] * elukey bbl! [08:40:16] morning! [08:40:40] Hey Aiko! [08:40:55] (03PS4) 10Ilias Sarantopoulos: ci: add isort to pre-commit to order imports [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981717 [08:45:13] hey Ilias :D [08:47:59] (03CR) 10Ilias Sarantopoulos: [V: 03+2 C: 03+2] ci: add isort to pre-commit to order imports [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981717 (owner: 10Ilias Sarantopoulos) [08:51:45] (03PS15) 10Ilias Sarantopoulos: nllb: add cpu optimized version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980015 (https://phabricator.wikimedia.org/T351740) [08:53:28] (03CR) 10CI reject: [V: 04-1] nllb: add cpu optimized version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980015 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [08:53:57] 10Machine-Learning-Team: Optimize response performance for the article-descriptions model-server - https://phabricator.wikimedia.org/T353127 (10kevinbazira) [08:59:38] oups isort is failing :) [09:00:41] 10Machine-Learning-Team: Optimize response performance for the article-descriptions model-server - https://phabricator.wikimedia.org/T353127 (10kevinbazira) The article-descriptions model-server container hosted on LiftWing uses 1 CPU and 4GB of memory. Here are the results showing the response performance of re... [09:03:01] fixing it now [09:04:28] aiko: if you work on kserve 0.11 for outlink would it be easy for you to also test this change https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/982043? [09:06:15] this change should be able to work with kserver 0.11 when we don;t pass a content type header in the request [09:06:51] isaranto: yes I can test that as well [09:07:00] thank youu [09:08:15] I'll work on that later today. Now I'm testing kserve batcher :) [09:17:12] whenever! there's no hurry I just opened the patch as WIP as it was the last model server left without validation [09:45:47] (03PS1) 10Ilias Sarantopoulos: ci: fix isort dependencies and add black profile [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982363 [09:46:15] fixed --^ [09:46:32] apologies, should have tested the image as well [09:58:45] (03CR) 10AikoChou: [C: 03+1] ci: fix isort dependencies and add black profile [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982363 (owner: 10Ilias Sarantopoulos) [10:11:03] back! [10:11:15] I got 6 vaccines today so I may be knocked out in the afternoon :D [10:16:39] I've read https://github.com/python/cpython/issues/80235, very interesting [10:16:53] IIUC os.cpu_count() doesn't take into account cgroups, and there is no plan [10:21:56] (ErrorBudgetBurn) resolved: - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:24:53] wow! 6! [10:27:47] (03PS1) 10Ilias Sarantopoulos: test: please disregard [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982369 [10:28:26] elukey: https://github.com/python/cpython/issues/80235#issuecomment-1294982568 seems to be working here in local docker testing [10:28:50] Haven't tested a container with <1 CPU yet [10:29:16] yes but it is not complete [10:29:40] <1cpu gives you 0 cpus back, something we should probably map to 1 for the openmp case. [10:29:57] https://facebookmicrosites.github.io/cgroup2/docs/cpu-controller.html [10:30:13] you can also have "max" in there [10:30:38] I'll add some unit tests [10:31:08] I find it both amazing and disappointing that there seems to be no standardised way fo tdoing this [10:32:13] it is an approximation and there is no clear agreement for all use cases, so everybody needs to implement their own code [10:32:32] (03CR) 10CI reject: [V: 04-1] test: please disregard [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982369 (owner: 10Ilias Sarantopoulos) [10:34:19] There is also a comment in there that sets an env var from the requests.cpu. At least that way, we wouldn't have to keep it in sync [10:34:42] but it doesn't count limits [10:34:48] in theory [10:35:03] ah, it all is very messy. [10:35:09] so getting it from cpu.max seems to be a good way to know exactly what is needed [10:35:44] (there are also interesting questions about feedback loops between a container using resources and its limits being adjusted dynamically) [10:42:14] we don't have dynamic limits, never heard of them up to now [10:55:40] * isaranto afk lunch [11:06:06] (03CR) 10Elukey: [V: 03+2 C: 03+2] ci: fix isort dependencies and add black profile [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982363 (owner: 10Ilias Sarantopoulos) [11:17:04] hey folks one question - how do we run unit tests no? [11:17:07] *now [11:17:14] for inference-services [11:23:56] elukey: /me lunch [11:37:46] thanks for letting me know :D [11:37:49] * elukey lunch as well! [11:37:53] oops :) [11:38:23] I had meant to follow-up on the CPU/limits question, but what I mean to write has already been said [11:39:01] np I figured :) [11:39:17] I have the code read with unit tests, once I get how to run them via tox etc.. I'll submit it [11:39:28] long term it may be good to have it on pywmflib [11:52:17] * aiko lunch! [12:07:45] (03CR) 10Ilias Sarantopoulos: "nooo :) I was testing and I still found issues" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982363 (owner: 10Ilias Sarantopoulos) [12:11:52] and I figured out the reason I hadnt added isort before, because there is some weird issues I haven't figured out. debugging inside the container to figure out now [12:39:41] This is at CI/lint time, right? Not runtime [12:41:41] yes ci [12:57:14] isaranto: sorry about https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/982363/! I tested it in localhost and it fixed my issues, so I merged [12:58:06] no worries, I was jocking with the "nooo" [12:58:28] isaranto: qq - how do I run pytest for inference services? [12:58:49] (03PS2) 10Ilias Sarantopoulos: test: please disregard [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982369 [13:00:38] when I run pre-commit locally it succeeds and I think it is just a configuration clash for line length [13:00:59] elukey: you can run "pytest test/unit" inside a virtualenv [13:01:08] I've been meaning to add a tox entry for that [13:02:32] actually I'll check cause I have it somewhere. I was trying to add it to run in ci but was getting some issues with virtualenvs so i suggested to revisit when we go to gitlab. But I can add it so that we can run it locally at least [13:03:11] it would be super great, I can help if you want [13:03:23] one thing that I noticed running pytest test/unit is that I get [13:03:30] from python.api_utils import get_rest_endpoint_page_summary [13:03:30] E ModuleNotFoundError: No module named 'python' [13:04:13] so I was wondering if pythonpath or similar needed to be tweaked in some way [13:04:19] (03CR) 10CI reject: [V: 04-1] test: please disregard [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982369 (owner: 10Ilias Sarantopoulos) [13:09:54] oh yes, then the top level directory of the repo needs to be added to pythonpath [13:10:13] e.g. `export PYTHONPATH=$PYTHONPATH:.` [13:10:23] lemme finish something and I'll find the branch I have it on [13:10:55] yep yep I ran PYTHONPATH=$(pwd):$PYTHONPATH pytest test/unit/ [13:18:56] (03PS3) 10Ilias Sarantopoulos: test: please disregard [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982369 [13:30:02] (03CR) 10CI reject: [V: 04-1] test: please disregard [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982369 (owner: 10Ilias Sarantopoulos) [13:33:45] (03PS4) 10Ilias Sarantopoulos: test: please disregard [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982369 [13:34:58] (03PS1) 10Ilias Sarantopoulos: ci: run tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982396 [13:39:39] (03CR) 10CI reject: [V: 04-1] ci: run tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982396 (owner: 10Ilias Sarantopoulos) [13:46:13] (03PS2) 10Ilias Sarantopoulos: ci: run tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982396 [13:47:22] (03PS1) 10Elukey: Fix imports in various modules after introducing isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982400 [13:47:24] (03PS1) 10Elukey: Add resource_utils shared module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 [13:48:54] (03PS5) 10Ilias Sarantopoulos: ci: changes required for isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982369 [13:49:22] (03CR) 10CI reject: [V: 04-1] Add resource_utils shared module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 (owner: 10Elukey) [13:50:59] weird, tox is clean in my localhost [13:52:09] (03CR) 10CI reject: [V: 04-1] Fix imports in various modules after introducing isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982400 (owner: 10Elukey) [13:52:25] elukey: the issue I have with isort is that I'm getting different results running it in my local env than when running it in docker/ci. I'm seeing the same thing for u [13:52:54] yeah sorry I wanted to help :( [13:54:59] different sorting results sounds vaguely I18n [13:55:12] bbiab, doc wants me put away my phone :D [13:55:22] ah you didn't do anything wrong. I do need help [13:56:11] I remembered the hard way why I didn't add isort to pre-commit last year [13:56:41] jumping in a meeting. If we can't resolve it today I suggest we deactivate it again. it is supposed to make our lives easier... [13:59:32] sorry for the mess! [14:03:46] nono all good changes bring a bit of troubles :) [14:04:59] the errors that I am getting are: [14:05:00] https://integration.wikimedia.org/ci/job/inference-services-pipeline-revertrisk/403/console [14:05:09] so two CR removals [14:06:07] Fixing /srv/revert_risk_model/model_server/base_model.py [14:06:08] Fixing /srv/revert_risk_model/model_server/batch_model.py [14:06:13] but I don't have them in my repo [14:08:55] no ok I have them, my IDE hates me [14:09:47] the only thing that I can think of is: [14:10:07] - in localhost, for some reason the path of the RR base/batch py modules is not scanned by isort [14:10:51] - in the Docker image it is, for some reason [14:11:30] mmm no, if I change something in base_models.py in my localhost isort complains [14:14:21] another example https://integration.wikimedia.org/ci/job/inference-services-pipeline-revscoring/389/console [14:14:45] they are all related to the python dir [14:15:13] I think that in CI the python dir is considered an external lib [14:15:33] like fastapi/kserve/etc.. [14:17:02] and the only one that succeeds is article-descriptions [14:19:57] ahhh ok wait [14:20:05] in the test docker image we don't copy "python" [14:20:15] except from article-descriptions [14:20:25] lemme try one thing [14:23:03] (03PS2) 10Elukey: Fix imports in various modules after introducing isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982400 [14:23:05] (03PS2) 10Elukey: Add resource_utils shared module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 [14:23:07] (03PS1) 10Elukey: article-descriptions: set OMP_NUM_THREADS automatically [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982407 (https://phabricator.wikimedia.org/T343123) [14:23:30] let's see [14:24:45] I'll get a -1 because I missed two blubber files [14:25:37] (03CR) 10CI reject: [V: 04-1] Add resource_utils shared module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 (owner: 10Elukey) [14:26:01] (03PS3) 10Elukey: Fix imports in various modules after introducing isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982400 [14:26:03] (03PS3) 10Elukey: Add resource_utils shared module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 [14:26:05] (03PS2) 10Elukey: article-descriptions: set OMP_NUM_THREADS automatically [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982407 (https://phabricator.wikimedia.org/T343123) [14:28:18] what I'd like to do in all docker images is just add the whole repo and get done with it [14:28:47] nah I think we can just add what's necessary [14:28:47] https://www.irccloud.com/pastebin/x8y3uNPs/ [14:28:53] otherwise it may become confusing [14:29:26] if you are ok let's start with this [14:29:28] then we can refine [14:29:33] ok , agreed! at least keep exactly the same folder structure so there is no difference from local environment to deployment [14:30:01] I see some green lights in CI, this time it should be the good one [14:31:25] (03CR) 10CI reject: [V: 04-1] Add resource_utils shared module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 (owner: 10Elukey) [14:34:58] 10Machine-Learning-Team, 10Patch-For-Review, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123 (10Seddon) >>! In T343123#9395763, @isarantopoulos wrote: > Regarding latency: As... [14:35:05] 10Machine-Learning-Team: Optimize response performance for the article-descriptions model-server - https://phabricator.wikimedia.org/T353127 (10kevinbazira) The next optimization option I explored after increasing CPU and memory resources in the ML sandbox, was code profiling to figure out which parts of the cod... [14:49:38] isaranto: o/ [14:49:38] I have been exploring your suggestions to optimize the response time for the article-descriptions model-server. [14:49:38] so far I have tried the CPU and memory usage resources, a container with 8CPUs and 4GB of memory came closest to our 3s goal: https://phabricator.wikimedia.org/T353127#9398823 [14:49:38] since this is still resource intensive, I went ahead to do some code profiling and both the model loading and prediction methods have a significantly longer execution time. here are the code profiling results, please share your thoughts on the task whenever you get a minute: https://phabricator.wikimedia.org/T353127#9399942 [14:49:38] thanks! [15:00:02] 10Machine-Learning-Team: Test the kserve batcher for Revert Risk LA isvc - https://phabricator.wikimedia.org/T348536 (10achou) Revert Risk LA with kserve batcher is running in staging. ` aikochou@deploy2002:~$ kubectl get pods NAME READY STATUS RE... [15:00:10] kevinbazira: nice work! As FYI I am working on https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/982407 (still WIP) to facilitate the setting of OMP_NUM_THREADS [15:00:47] thanks elukey! [15:01:04] hope to merge before I am out [15:01:19] I already have a comment on that patch, but it's a minor nit :) [15:01:22] (03CR) 10Klausman: Add resource_utils shared module (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 (owner: 10Elukey) [15:02:33] nice work kevinbazira will take a look afterwards! [15:02:46] (03CR) 10Elukey: Add resource_utils shared module (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 (owner: 10Elukey) [15:16:22] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Investigate increase p99 latencies in ml-serve-eqiad - https://phabricator.wikimedia.org/T352958 (10isarantopoulos) p:05Unbreak!→03High [15:16:57] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Enforce json payload in existing kserve model servers - https://phabricator.wikimedia.org/T352834 (10isarantopoulos) p:05Unbreak!→03High [15:23:15] 10Machine-Learning-Team, 10SRE Observability (FY2023/2024-Q2): Gap in metrics rendered from Thanos Rules - https://phabricator.wikimedia.org/T352756 (10herron) >>! In T352756#9390492, @elukey wrote: > I tried to compare the various graphs for ores-legacy in https://w.wiki/8QkW > > I am very puzzled about the... [15:23:41] 10Machine-Learning-Team, 10serviceops: Bump istio and Cert Manager Docker images to Bullseye - https://phabricator.wikimedia.org/T351933 (10elukey) 05Open→03Resolved a:03elukey [15:24:06] 10Machine-Learning-Team: refactor revertrisk model server to run locally - https://phabricator.wikimedia.org/T352181 (10achou) 05Open→03Resolved [15:24:08] 10Machine-Learning-Team: Refactor inference services repo to allow local runs - https://phabricator.wikimedia.org/T347404 (10achou) [15:29:06] 10Machine-Learning-Team: Optimize response performance for the article-descriptions model-server - https://phabricator.wikimedia.org/T353127 (10isarantopoulos) p:05Triage→03Medium [15:29:46] 10Machine-Learning-Team: Optimize response performance for the article-descriptions model-server - https://phabricator.wikimedia.org/T353127 (10isarantopoulos) [15:30:36] 10Machine-Learning-Team: Reduce default API response fields for article-descriptions model-server - https://phabricator.wikimedia.org/T352959 (10isarantopoulos) 05Open→03Resolved p:05Triage→03Medium [15:30:46] 10Machine-Learning-Team, 10Patch-For-Review, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123 (10isarantopoulos) [15:32:15] 10Machine-Learning-Team: Add support for multiple revisions in knowledge-integrity - https://phabricator.wikimedia.org/T352987 (10isarantopoulos) p:05Triage→03Medium [15:42:26] 10Machine-Learning-Team: Investigate LW Consumer lag alerts - https://phabricator.wikimedia.org/T351735 (10isarantopoulos) p:05Triage→03Medium [15:43:40] 10Machine-Learning-Team: Investigate LW Consumer lag alerts - https://phabricator.wikimedia.org/T351735 (10isarantopoulos) [16:02:10] 10Machine-Learning-Team: Investigate LW Consumer lag alerts - https://phabricator.wikimedia.org/T351735 (10isarantopoulos) [16:02:24] klausman: shall we try to drain ml-staging2001? [16:02:47] 10Machine-Learning-Team: Investigate LW Consumer lag alerts - https://phabricator.wikimedia.org/T351735 (10isarantopoulos) [16:04:25] elukey: yeah, let's do that [16:06:00] elukey: should I go ahead with cordoning or do you want to pull the trigger? [16:08:35] (03CR) 10Ilias Sarantopoulos: [C: 03+1] Fix imports in various modules after introducing isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982400 (owner: 10Elukey) [16:10:48] (03CR) 10Klausman: Add resource_utils shared module (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 (owner: 10Elukey) [16:10:55] (03CR) 10Klausman: [C: 03+1] Add resource_utils shared module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 (owner: 10Elukey) [16:11:13] elukey: I'm still getting errors with this patch for isort locally but ¯\_(ツ)_/¯ [16:12:16] klausman: go ahead [16:12:24] ack [16:12:37] isaranto: what errors are you getting? [16:12:57] I'm getting some fixes from isort [16:13:21] but I'd go with what CI says and I'll figure out how to work locally [16:13:33] :( okok [16:13:34] * isaranto bbiab [16:13:39] (03CR) 10Elukey: [C: 03+2] Fix imports in various modules after introducing isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982400 (owner: 10Elukey) [16:20:14] elukey: 2001 is drained, but there are some pods now in panding for >7m [16:20:31] Not sure yet why they are pending. 2002 doesn't seem super busy or low on memory [16:21:03] Maybe the controller is waiting for the disruption budget to refill [16:21:58] checking [16:23:22] klausman: check kubectl get events on those namespaces [16:24:35] HTTP lifecycle hook (/wait-for-drain) for Container "kserve-container" in Pod "revertrisk-multilingual-predictor-default-00019-deploymenth7bv5_revertrisk(229782cb-fcff-4c78-9ef8-f848fd6756bb)" failed - error: Get "http://10.194.61.222:8022/wait-for-drain": EOF, message: "" [16:24:58] look for FailedScheduling [16:24:59] mh, that's a different pod than the pending one [16:25:17] Ah, not enough CPU. [16:25:30] Which is kinda odd. The machine is basically idle [16:25:56] (03Abandoned) 10Ilias Sarantopoulos: ci: changes required for isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982369 (owner: 10Ilias Sarantopoulos) [16:26:21] (03Merged) 10jenkins-bot: Fix imports in various modules after introducing isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982400 (owner: 10Elukey) [16:26:30] klausman: check `kubectl describe nodes` [16:26:43] what counts is requests/limits [16:26:47] and what we allocate [16:27:08] Ah, so even if the machine is idle, if the reservation reaches what it can handle... [16:27:26] So uncordon and we'll just have to sync with dcops for the GPU install? [16:28:01] we can proceed in my opinion, if we ping papaul in the task and/or on #wikimedia-dcops maybe they are able to do it today [16:28:13] ok, will do [16:28:15] if not, we can uncordon tomorrow morning if anybody needs to test etc.. [16:28:56] Which si the ticket we were coordinating in? [16:29:23] (03PS16) 10Ilias Sarantopoulos: nllb: add cpu optimized version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980015 (https://phabricator.wikimedia.org/T351740) [16:29:50] klausman: I am not going to answer from now on :) [16:29:51] https://phabricator.wikimedia.org/T348118 probably [16:30:37] (remember also to downtime + shutdown) [16:31:15] (03PS3) 10Ilias Sarantopoulos: ci: run tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982396 [16:32:03] (03CR) 10CI reject: [V: 04-1] nllb: add cpu optimized version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980015 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [16:32:07] and done. [16:33:36] (03PS4) 10Elukey: Add resource_utils shared module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 [16:33:38] (03PS3) 10Elukey: article-descriptions: set OMP_NUM_THREADS automatically [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982407 (https://phabricator.wikimedia.org/T343123) [16:34:08] And 24h downtime is in [16:36:39] (03CR) 10Elukey: Add resource_utils shared module (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 (owner: 10Elukey) [16:39:06] wow the GPU should land in the next hour :) [16:39:25] klausman: tomorrow if everything goes as planned we can add the config together [16:39:41] one thing worth to ask in the task - we should verify if two GPUs fit, while we are at it [16:39:53] and if they can take pictures etc.. [16:39:59] (03CR) 10CI reject: [V: 04-1] Add resource_utils shared module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 (owner: 10Elukey) [16:40:06] so we have them saved and we keep them for future references [16:40:09] wdyt? [16:40:49] Oh, right, will add a comment [16:42:00] Jenn has replied that she will start working on it in 1-2h [16:42:27] (03PS17) 10Ilias Sarantopoulos: nllb: add cpu optimized version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980015 (https://phabricator.wikimedia.org/T351740) [16:42:59] (03PS5) 10Elukey: Add resource_utils shared module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 [16:43:01] (03PS4) 10Elukey: article-descriptions: set OMP_NUM_THREADS automatically [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982407 (https://phabricator.wikimedia.org/T343123) [16:43:03] (03PS1) 10Elukey: blubber: add the transformer dir to outlink's transformer image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982433 [16:43:12] WoW breat! [16:43:19] (03CR) 10CI reject: [V: 04-1] nllb: add cpu optimized version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980015 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [16:43:57] (03CR) 10CI reject: [V: 04-1] blubber: add the transformer dir to outlink's transformer image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982433 (owner: 10Elukey) [16:45:12] (03CR) 10Ilias Sarantopoulos: [C: 03+1] Add resource_utils shared module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 (owner: 10Elukey) [16:45:35] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982396 (owner: 10Ilias Sarantopoulos) [16:47:51] isaranto: so outlink seems the last one complaining, I filed https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/982433 but it is wrong, we already copy "transformer" [16:48:18] but it wants [16:48:19] import pytest [16:48:19] + [16:48:19] from transformer import OutlinkTransformer [16:48:41] ah ok because it sees it as separate module [16:49:03] (03CR) 10CI reject: [V: 04-1] Add resource_utils shared module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 (owner: 10Elukey) [16:50:25] ah we have "transformer.py" and transformer dir, in the same spot [16:50:28] ufff [16:51:38] (03PS18) 10Ilias Sarantopoulos: nllb: add cpu optimized version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980015 (https://phabricator.wikimedia.org/T351740) [16:58:12] (03CR) 10Elukey: [C: 04-1] "This is clearly not right, we need to figure out where to place the outlink's test in a way that we have the same import structure (CI vs " [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982433 (owner: 10Elukey) [17:01:42] ok so seems that isort understands local vs third party modules from the environment and the directory structure so unless we have the same structure in the repo and the docker image we are always going to have inconsistencies when pre-commit runs locally [17:02:53] yeah but we also have a weird test structure for outlink - it is not under tests/unit, but in the same dir [17:03:22] I tried to move it, but then when I try to import the transformer module I get into an issue - the main "outlink-topic-model" dir should have underscores :D [17:03:50] so I think that isort is unveiling some inconsistencies that we have been using recently [17:06:38] I suggest we deactivate it (leave it there but make the step manual if required) and reassess when we refactor the repo and the images a bit more. wdyt? [17:07:11] sure, it is a bit sad, but more work is needed :( [17:07:21] for example as you mentioned the outlink dir in order to be treated like a python module it needs to have underscores [17:07:54] yeah, probably if we move it to the new structure that you proposed for testing locally it will work fine [17:08:07] it is not urgent, let's disable and open a task [17:08:11] sorry, I opened this can of worms but it is best we focus on other things [17:08:30] nono I think we discovered good things, it was sort-of a spike :) [17:08:30] I think I could end up spending the whole month working in things like this [17:08:40] let's open a task with everything summarized [17:08:48] actually rabbit hole not can of worms [17:10:24] (03PS1) 10Ilias Sarantopoulos: ci: disable isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982440 [17:10:52] (03CR) 10Elukey: [C: 03+1] ci: disable isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982440 (owner: 10Ilias Sarantopoulos) [17:11:30] (03CR) 10Ilias Sarantopoulos: [C: 03+2] "Rebased and merged" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980015 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [17:12:31] (03Merged) 10jenkins-bot: nllb: add cpu optimized version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980015 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [17:13:22] sorry the test_transformer.py for outlink was created a long time ago when there were no tests/unit. I will look into that and see how we can solve it [17:14:47] 10Machine-Learning-Team, 10ORES: Merge ORES precaching with ORESFetchScoreJob - https://phabricator.wikimedia.org/T201868 (10Ottomata) [17:14:50] and make things more consistent [17:15:34] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Data-Engineering-Icebox, 10Event-Platform, 10Platform Team Initiatives (Modern Event Platform (TEC2)): ORES hook integration with EventBus - https://phabricator.wikimedia.org/T201869 (10Ottomata) 05Open→03Declined I don't think there is any plan... [17:15:46] 10Machine-Learning-Team: Enable isort in CI for inference-services repo - https://phabricator.wikimedia.org/T353281 (10isarantopoulos) [17:16:33] bye folks! logging off earlier today for an apartment viewing :) [17:16:45] aiko: no need to apologize, it is not your fault (and this wasn't the only issue we had). please don't spend time now we can focus on other things (unless you like to do it out of interest etc) [17:16:56] good afternoon! best of luck <3 [17:17:20] aiko: you added tests, absolutely no need to apologize, the contrary! I think we can just move outlink to the new dir structure for local testing, and we'll be done! [17:17:23] good luck! [17:21:22] (03PS4) 10Ilias Sarantopoulos: ci: run tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982396 [17:22:29] (03CR) 10Ilias Sarantopoulos: [V: 03+2 C: 03+2] ci: disable isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982440 (owner: 10Ilias Sarantopoulos) [17:22:35] (03PS2) 10Ilias Sarantopoulos: ci: disable isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982440 [17:22:40] (03PS6) 10Elukey: Add resource_utils shared module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 [17:22:42] (03PS5) 10Elukey: article-descriptions: set OMP_NUM_THREADS automatically [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982407 (https://phabricator.wikimedia.org/T343123) [17:22:45] (03CR) 10Ilias Sarantopoulos: ci: disable isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982440 (owner: 10Ilias Sarantopoulos) [17:22:53] (03CR) 10Ilias Sarantopoulos: [V: 03+2 C: 03+2] ci: disable isort [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982440 (owner: 10Ilias Sarantopoulos) [17:23:10] (03PS7) 10Elukey: Add resource_utils shared module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 [17:23:43] (03PS8) 10Ilias Sarantopoulos: Add resource_utils shared module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982401 (owner: 10Elukey) [17:24:43] merged and rebased the open patches [17:25:07] (03PS5) 10Ilias Sarantopoulos: ci: run tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982396 [17:25:33] going afk! Have a nice rest of the day folks! [17:25:56] (03PS6) 10Ilias Sarantopoulos: article-descriptions: set OMP_NUM_THREADS automatically [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/982407 (https://phabricator.wikimedia.org/T343123) (owner: 10Elukey) [17:29:17] I'll be out for dinner in a bit, but will followup with Jen about the GPU [17:31:03] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Investigate increase p99 latencies in ml-serve-eqiad - https://phabricator.wikimedia.org/T352958 (10isarantopoulos) We discussed on IRC that it wouldn't be a good idea to add custom configuration for these redirects as done in the [[ https://gerrit.wi... [17:31:18] ciao folks, good evening! [18:36:05] 10Machine-Learning-Team: Optimize response performance for the article-descriptions model-server - https://phabricator.wikimedia.org/T353127 (10isarantopoulos) Great work Kevin and thorough results! The profiling helps a lot to understand what is going on. So the issue is in the predict function. For things that... [18:36:15] going afk as well, cu tomorrow! [21:41:21] ml-staging2001 ~ $ lspci |grep Instinct [21:41:23] da:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [AMD Instinct MI100] (rev 01) [21:41:25] \o/ [21:41:50] Also, ml2001 uncordoned [21:42:02] Now going to do something else :D