[06:52:45] Mooorning ☀️ [10:13:00] 06Machine-Learning-Team, 10ORES: Inconsistent data type for articlequality score predictions on ptwiki - https://phabricator.wikimedia.org/T358953#9610360 (10isarantopoulos) a:03isarantopoulos [10:20:19] (03PS1) 10Ilias Sarantopoulos: ores-legacy: fix mixed boolean and string field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009470 (https://phabricator.wikimedia.org/T358953) [10:25:32] 06Machine-Learning-Team: Revert Risk multi-lingual model performance and reliability may need a review - https://phabricator.wikimedia.org/T340822#9610393 (10klausman) 05Open→03Resolved [10:25:34] 06Machine-Learning-Team, 07Epic: Lift Wing improvements to get out of MVP state - https://phabricator.wikimedia.org/T333453#9610394 (10klausman) [10:50:56] (03PS2) 10Ilias Sarantopoulos: ores-legacy: fix mixed boolean and string field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009470 (https://phabricator.wikimedia.org/T358953) [10:51:33] 06Machine-Learning-Team: Add Dragonfly to the ML k8s clusters - https://phabricator.wikimedia.org/T359416#9610511 (10JMeybohm) I think it's fine to use the existing supernodes. They act as coordinators only, so there is not much load or network traffic even during mw-deployments. [11:35:22] 06Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742#9610674 (10kevinbazira) Following the workflow we use to build LiftWing model-servers, which involves installing pip dependencies listed in the requirements.txt... [12:17:07] 06Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742#9610855 (10isarantopoulos) This happens because the kserve repository is not a python package and as the error message tells us there is no setup.py file. The py... [12:18:38] kevinbazira: o/ [12:19:00] isaranto: o/ [12:19:15] you can try the above in order install kserve from a specific commit. let me know if it works! [12:19:37] it should work also with older pip versions (I saw it existed in old documentation as well) [12:21:55] okok let me give it a shot [12:28:55] Morning all [12:29:20] Hey Chris! [12:31:00] I had a doc appt early morning today and doctor was sick [12:31:04] how meta is that? [12:31:51] My physio had to skip one of my appointments because she had sprained her knee while skiing :) [12:31:59] I guesss it would be more meta if I was sick and needed to go to the doctor but the doctor was sick [12:32:04] ouch :) [12:32:21] we need to take good care of the doctors of this world <3 [12:32:24] We had a very "do as I say, don't do as I do" kinda convo after :) [12:32:48] haha, totally! [12:33:26] There is a saying in German: "The shoemaker's children go barefoot", I feel for a lot of professions, that is true :) [12:33:42] so your physio had to do physiotherapy [12:33:42] Go ask SREs if they backup everything ;) [13:26:26] At the airport, sigh [13:26:46] I wish you safe and low-stress travel, Chris [13:30:04] have a safe flight! [13:41:15] 06Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742#9611359 (10kevinbazira) Thanks @isarantopoulos, earlier on I was missing the `python/kserve` subdirectory. After changing: ` kserve @ git+https://github.com/kser... [13:41:25] hello folks [13:43:51] Hey Luca :) [13:44:00] (03CR) 10Elukey: [C: 03+1] "Looks good!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009312 (https://phabricator.wikimedia.org/T356045) (owner: 10Ilias Sarantopoulos) [14:07:41] Hey Luca! [14:16:36] I'm mergin gthe above patch so we can check the build ok? [14:17:07] tbh it was a rhetorical question :) wanted to show my intentions [14:17:29] (03CR) 10Ilias Sarantopoulos: [C: 03+2] revertrisk-multilingual: add extra index for torch rocm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009312 (https://phabricator.wikimedia.org/T356045) (owner: 10Ilias Sarantopoulos) [14:17:53] isaranto: definitely yes [14:26:50] (03Merged) 10jenkins-bot: revertrisk-multilingual: add extra index for torch rocm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009312 (https://phabricator.wikimedia.org/T356045) (owner: 10Ilias Sarantopoulos) [14:42:05] 06Machine-Learning-Team, 10ORES, 13Patch-For-Review: Inconsistent data type for articlequality score predictions on ptwiki - https://phabricator.wikimedia.org/T358953#9611709 (10isarantopoulos) [14:58:58] 06Machine-Learning-Team: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9611793 (10isarantopoulos) I'm proceeding of creating a blubber image instead of going to production-images and handling dependencies in there. After running what upstream provides locally, I saw t... [15:01:11] going afk folks, will check what happens with the image later o/ [15:01:31] o/ [15:54:03] elukey: the prometheus-thanos thing is very odd. [15:54:43] sle [15:54:45] err sorry [15:55:02] something must have changed, but I didn't find anything, hopefully Keith knows [15:55:22] maybe they consolidated a new label for thanos rules, it would make sense [15:55:47] those time series are pulled from thanos, not prometheus-mlserve (seems more consistent) [15:55:53] Yeah, I feel like dropping the information that the prom label used to carry is not really a good idea, so something else must be happening. [15:56:12] what do you mean? [15:56:40] The information "what prometheus instance collected this data in the first place" is useful to have [15:56:56] I am not sure that the label indicates that, this is my point [15:57:04] I read it as where the data is pulled from [15:57:46] So out of curiosity, I queried `(sum_over_time(istio_sli_latency_request_duration_milliseconds_bucket:increase5m{site=~"eqiad", le="100", response_code=~"2..", destination_service_namespace="recommendation-api-ng", destination_canonical_service="recommendation-api-ng-main"}[90d]))` on Thanos and I get two results [15:58:04] one wiht prom="k8s-mlserve" and one with prom="thanos-rule" [15:58:23] With [1d] it's only the thanos one. [15:59:04] So I guess computed metrics will always be labeled as coming from Thanos. Hm. ok. That makes a certain amount of sense [15:59:35] I suspect Thanos could be computing metrics from more than one prom instance, and then the field becomes entirely meaningless (it might even sabotage computation) [16:00:10] https://phabricator.wikimedia.org/rOPUP0ce254b6eec08b2e5af3164c42e8d345e3152619 may be the source of the label [16:00:40] but I'll defer to Keith, maybe I'll ask a specific question about if it is the case to keep the prometheus label or not [16:00:52] as you pointed out it may be relevant [16:00:56] ah, scope= would be the equivalent of the old prom= label, I guess [16:01:35] wer site= [16:02:33] asked the question in the code review :) [16:03:05] :+1: [16:03:09] thanks for the review1 [16:03:09] ! [16:03:11] also +1 on the Dragonfly change [16:03:22] yw <3 [16:09:06] (03CR) 10Kevin Bazira: [C: 03+1] ores-legacy: fix mixed boolean and string field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009470 (https://phabricator.wikimedia.org/T358953) (owner: 10Ilias Sarantopoulos) [16:33:19] ok so after a quick check on the rocm libs, I think that (at least up to 5.4.0) symbols are added (Tobias raised this point previously) and rocblas, the major disk space consumer, has files related to all the GPUs supported (of course) [16:34:03] the filenames have clear gfx_number names so it may be ok to just drop the GPUs that we don't have, but it feels really dirty [16:35:37] now I am curious to know what layers we have in https://docker-registry.wikimedia.org/amd-gpu-tester/tags/ [16:35:49] but we installed tensorflow in there [16:35:50] ah, yes, dropping unneded GPU support was something I'd hoped we could do [16:38:26] I didn't find any way to do it at build time though [16:38:38] and also building pytorch on our own would be really painful [16:39:30] Could we derive an image from the built one and just rm -f the unneeded drivers? [16:40:23] we could have a build image in which we do all the horrors that we want (not paying attention to layers etc..) and then the final image would just copy what needed [16:40:51] I don't love the solution since it can break anytime :( [16:40:54] Yeah, that is likely preferable to building pytorch. [16:41:36] And yeah, the fragility of the whole thing is a major downside [16:42:03] I am reading https://docs.amd.com/en/docs-5.0.0/how_to/pytorch_install/pytorch_install.html and there seems to be a PYTORCH_ROCM_ARCH [16:44:57] ok so something interesting [16:45:00] I think the MI100 is arch 908 [16:45:12] docker-registry.wikimedia.org/amd-gpu-tester:0.0.9-1 has a layer of ~8G [16:45:26] it was built via docker-pkg, so in theory it bypasses nginx [16:45:42] so the base image idea may work, if serviceops is ok [16:46:00] we can work on reducing pytorch etc.. [16:50:48] heya! came back for a bit! [16:52:43] o/ [16:52:56] still can't figure out though why docker-pkg bypasses nginx [16:53:13] I was checking the rocm/pytorch images on dockerhub and they are all huge [16:53:28] https://hub.docker.com/r/rocm/pytorch/tags [16:53:46] yes it makes sense, they offert a torch version supporting all architectures [16:53:50] *offer [16:54:10] in our case, we already have MI100 and MI210, plus in theory the old 16GB one [16:54:49] My guess is that the 100 is 908 and the 200 is 90a [16:55:00] Or they're both 908 [16:55:09] I think they are separate yes [16:55:46] so looking at the layers described here https://hub.docker.com/layers/rocm/pytorch/latest/images/sha256-cfc5bfe46ad5d487ef9a928f50d1f2ff0941b724a6978f6d6350d13ce2c6ca88?context=explore [16:55:46] I guess we can define PYTORCH_ROCM_ARCH as you mentioned elukey: only for our gpus [16:56:02] IIUC it is meant for one single arch :( [16:56:16] but worth a try for sure [16:56:18] maybe 908:90a would work [16:56:34] (plus `gfx`) [16:57:37] the example in the link has this `ENV PYTORCH_ROCM_ARCH=gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx940;gfx941;gfx942` [16:57:38] so perhaps `ENV PYTORCH_ROCM_ARCH=gfx908;gfx90a` might work (?) [16:58:16] ahhh lovely [17:00:35] https://lernapparat.de/pytorch-rocm is also nice [17:00:46] I think at this point I can open a task only for this [17:00:54] so we don't pollute Aiko's task [17:03:36] ack! [17:04:02] isaranto: if we have a base image with torch installed somewhere, would it help with your HF one? [17:04:02] a nice resource, I think you shared it a while ago but I had forgotten about it [17:08:24] (03PS3) 10Ilias Sarantopoulos: ores-legacy: fix mixed boolean and string field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009470 (https://phabricator.wikimedia.org/T358953) [17:09:09] I rebased the above patch. I'll add some httpbb tests for ores-legacy as they are missing [17:11:05] 06Machine-Learning-Team, 10ORES, 13Patch-For-Review: Inconsistent data type for articlequality score predictions on ptwiki - https://phabricator.wikimedia.org/T358953#9612363 (10isarantopoulos) The attached patch solves the issue. I will deploy it to staging and add some httpbb tests that capture this behavi... [17:11:33] going afk for the evening, cu tomorrow folks o/ [17:13:12] 06Machine-Learning-Team: Investigate if it is possible to reduce torch's package size - https://phabricator.wikimedia.org/T359569 (10elukey) 03NEW [17:13:16] opened --^ [17:13:18] me too o/ [17:22:48] \o [17:23:04] I'll hack away a bit at the pytorch build process, see if I can figure out more space savings [22:25:42] (03CR) 10He7d3r: [C: 03+1] ores-legacy: fix mixed boolean and string field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009470 (https://phabricator.wikimedia.org/T358953) (owner: 10Ilias Sarantopoulos)