[06:24:02] Good morning \o/ [09:12:27] (03PS5) 10Ilias Sarantopoulos: revertrisk: remove obsolete step from README [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009723 [09:12:36] (03PS7) 10Ilias Sarantopoulos: huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) [09:21:14] (03CR) 10Kevin Bazira: huggingface: add huggingface image (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [09:25:26] morning Ilias! [09:35:31] 06Machine-Learning-Team: Error handling in Batch Predictions for RevertRisk Models - https://phabricator.wikimedia.org/T360406 (10achou) 03NEW [09:36:07] 06Machine-Learning-Team: Deploy RR-language-agnostic batch version to prod - https://phabricator.wikimedia.org/T358744#9641150 (10achou) [09:36:09] 06Machine-Learning-Team: Error handling in Batch Predictions for RevertRisk Models - https://phabricator.wikimedia.org/T360406#9641149 (10achou) [09:41:04] hey Aiko! [10:21:15] Mornin'! [10:26:35] o/ Tobias! [10:29:38] 06Machine-Learning-Team: Support building and running of articletopic-outlink model-server via Makefile - https://phabricator.wikimedia.org/T360177#9641327 (10achou) >>! In T360177#9636947, @kevinbazira wrote: > Running into the error below which is caused by a missing `events` module. This module is used to [[... [11:38:03] (03CR) 10Ilias Sarantopoulos: huggingface: add huggingface image (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [11:46:54] (03PS1) 10Kevin Bazira: articletopic-outlink: run model server as python module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012647 (https://phabricator.wikimedia.org/T360177) [11:48:41] (03CR) 10CI reject: [V:04-1] articletopic-outlink: run model server as python module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012647 (https://phabricator.wikimedia.org/T360177) (owner: 10Kevin Bazira) [11:55:08] (03PS8) 10Ilias Sarantopoulos: huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) [11:58:11] (03CR) 10Ilias Sarantopoulos: huggingface: add huggingface image (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [11:58:54] (03PS9) 10Ilias Sarantopoulos: huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) [11:59:09] (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] revertrisk: remove obsolete step from README [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009723 (owner: 10Ilias Sarantopoulos) [12:00:38] 06Machine-Learning-Team: Deploy RevertRisk language-agnostic with knowledge integrity v0.6.0 - https://phabricator.wikimedia.org/T360423 (10achou) 03NEW [12:02:17] 06Machine-Learning-Team, 13Patch-For-Review: Support building and running of articletopic-outlink model-server via Makefile - https://phabricator.wikimedia.org/T360177#9641641 (10kevinbazira) @achou I figured out a way to run both the transformer and predictor locally in the same container. All we have to do i... [12:05:41] 06Machine-Learning-Team: Deploy RevertRisk language-agnostic with knowledge integrity v0.6.0 - https://phabricator.wikimedia.org/T360423#9641644 (10achou) It would be nice to wait for an additional [[ https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1011305 | patch ]] (improving err... [12:06:37] fyi: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)#AI_for_WP_guidelines/_policies [12:08:33] (03PS10) 10Ilias Sarantopoulos: huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) [12:13:25] (03CR) 10CI reject: [V:04-1] huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [12:14:26] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [12:14:33] * isaranto lunch! [12:31:00] Morning all [12:31:52] morning Chris o/ [12:49:47] (03PS2) 10Kevin Bazira: articletopic-outlink: run model server as python module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012647 (https://phabricator.wikimedia.org/T360177) [12:51:22] (03CR) 10Kevin Bazira: "To make it easier to review this patch, I added instructions to the README file which I used to test the articletopic-outlink model server" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012647 (https://phabricator.wikimedia.org/T360177) (owner: 10Kevin Bazira) [13:05:19] (03CR) 10Kevin Bazira: huggingface: add huggingface image (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [13:06:43] Hey Chris o/ [13:12:06] 06Machine-Learning-Team, 07Epic: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines - https://phabricator.wikimedia.org/T360428 (10klausman) 03NEW [13:14:52] hello folks [13:19:59] elukey: \o Ok for me to submit my Prom change for the buckets? [13:20:06] (03PS4) 10Elukey: readability: add entrypoint to set environment variables [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012398 (https://phabricator.wikimedia.org/T360111) [13:20:21] klausman: o/ ok for me, but you should also verify with observability [13:20:39] hey Luca [13:20:53] hello hello [13:21:28] elukey: sure [13:25:59] I also have a weird Linter failure for an unrelated deployment charts change: https://integration.wikimedia.org/ci/job/helm-lint/16202/console [13:27:15] Mh, I think it might have been an indentation problem 9it's always indentation) [13:35:50] kevinbazira: o/ [13:36:09] isaranto: o/ [13:36:38] regarding the hf image: did you run the docker image? the current patch is intended to run on LW through the image not locally [13:36:54] elukey: bucket change is all merged [13:37:08] if you want to run it locally you'll have to git clone the kserve repo in the same directory (in the same way that happens in the image) [13:37:31] klausman: ack thanks! [13:38:40] isaranto: I did clone the kserve repo. I've shared my workflow in a commend I added to the patch. [13:38:53] **comment [13:41:55] hiii Luca! [13:43:18] yes I'm just discussing here for speed. [13:44:32] what is your pip version? [13:45:56] I am using pip 20.3.4 with python 3.9 [13:46:02] the patch is not intended to run locally, although for me the instructions that you sent work fine [13:46:15] try upgrading pip to a newer version and check again [13:46:39] v20 is quite old and I think there have been changes for editable installs [13:48:56] sure sure let me upgrade [13:50:53] folks not sure if you saw https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1012398, it is a proposal that we may use to all model servers for simplicity [13:50:58] not sure if you like it or not [13:51:05] not urgent, check anytime :) [13:52:01] it is a hack but I don't feel so bad about it [13:52:27] and it should, in theory, remove all concerns [13:53:20] elukey: one quick question I have: I see the Python code to determine CPU count. I presume nproc (from coreuitls) gives the wrong value? [13:54:15] (if we don't know, I'm fine with the Py code, just curisoity) [13:55:02] looking at it now. I think it is a good idea (no hack at all). All other alternatives would end up being more hacky (run this in an init container or similar) [13:56:57] (03CR) 10Kevin Bazira: [C:03+1] huggingface: add huggingface image (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [13:58:29] klausman: so the python code that I call in the script is the one that should be cgroupsv2 aware, so once it runs in a container it picks up the various sysfs files and process them etc.. IIRC nproc is not container aware and it may be misleading inside a container [13:58:59] basically the goal is to automatically set that OMP_NUM_THREADS with what the container has to offer [13:59:05] does it make sense? [13:59:06] Ack, I thought that was the case, but wasn't sure if I remembered correctly. [13:59:35] yep yep please raise any question, I make pebcak a lot of times and double checking is always a good safe net :D [14:00:06] (03CR) 10Klausman: readability: add entrypoint to set environment variables (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012398 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [14:00:17] originally I hoped to have a way to set the variable in python directly [14:00:21] but it turned out to be a mess [14:02:56] sorry folks okta issues [14:35:30] 06Machine-Learning-Team, 07Epic, 13Patch-For-Review: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines - https://phabricator.wikimedia.org/T360428#9642293 (10klausman) [14:37:54] 06Machine-Learning-Team, 13Patch-For-Review: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines - https://phabricator.wikimedia.org/T360428#9642298 (10klausman) [14:39:01] 06Machine-Learning-Team: Error handling in Batch Predictions for RevertRisk Models - https://phabricator.wikimedia.org/T360406#9642329 (10calbon) [14:47:22] 06Machine-Learning-Team: Add GPU check in all images - https://phabricator.wikimedia.org/T360212#9642431 (10calbon) a:03isarantopoulos [14:47:30] 06Machine-Learning-Team: Add GPU check in all images - https://phabricator.wikimedia.org/T360212#9642432 (10calbon) [14:48:33] 06Machine-Learning-Team, 13Patch-For-Review: Support building and running of articletopic-outlink model-server via Makefile - https://phabricator.wikimedia.org/T360177#9642439 (10calbon) [14:50:07] 06Machine-Learning-Team: Run unit tests for the inference-services repo in CI - https://phabricator.wikimedia.org/T360120#9642458 (10calbon) a:03elukey [14:55:11] 06Machine-Learning-Team, 13Patch-For-Review: Set automatically libomp's num threads when using Pytorch - https://phabricator.wikimedia.org/T360111#9642466 (10calbon) a:03elukey [14:55:39] 06Machine-Learning-Team, 10Observability-Metrics: SLO dashboards for Lift Wing showing unexpected values - https://phabricator.wikimedia.org/T359879#9642486 (10calbon) a:03elukey [14:57:20] 06Machine-Learning-Team: Error handling in Batch Predictions for RevertRisk Models - https://phabricator.wikimedia.org/T360406#9642506 (10achou) [15:07:22] elukey: what's the correct way to address a broken disk? ml-serve2008 has had one of its unused ones fail [15:07:46] i.e. do we make a ticket for DCops, do they seee those alerts by themselves? Something else? [15:07:55] https://alerts.wikimedia.org/?q=alertname%3DSmartNotHealthy&q=cluster%3Dml_serve&q=team%3Dsre&q=%40receiver%3Ddefault [15:11:00] klausman: so in theory there should already be a task opened to dcops automagically if there is a clear failure, ops-codfw should have it [15:11:07] if not, yeah let's open a task [15:12:40] ack, will see to it [15:13:20] (03PS5) 10Elukey: readability: add entrypoint to set environment variables [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012398 (https://phabricator.wikimedia.org/T360111) [15:13:33] (03CR) 10Elukey: readability: add entrypoint to set environment variables (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012398 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [15:13:47] (03CR) 10Elukey: readability: add entrypoint to set environment variables (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012398 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [15:14:31] kevinbazira: really nice follow ups for the logo detection model, thanks a lot :) [15:18:33] (03CR) 10Ilias Sarantopoulos: [C:03+1] "This looks nice!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012398 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [15:29:08] klausman: ok if I merge --^ ? [15:29:31] (03PS6) 10Elukey: readability: add entrypoint to set environment variables [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012398 (https://phabricator.wikimedia.org/T360111) [15:29:32] Looking [15:29:54] Yes! [15:30:00] super thanks :) [15:31:39] isaranto: https://phabricator.wikimedia.org/P58818 [15:31:58] (03CR) 10Elukey: [C:03+2] readability: add entrypoint to set environment variables [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012398 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [15:32:00] isaranto: ---^ I ran into an error when I tried to run hf server with the local bloom model. am I missing sth? [15:32:12] lemme look [15:32:44] (03Merged) 10jenkins-bot: readability: add entrypoint to set environment variables [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012398 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [15:35:01] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - https://phabricator.wikimedia.org/T360446 (10klausman) 03NEW [15:37:11] ^^^ that's the broken disk ticket. [15:37:32] I'll move it to "watching" on our board [15:41:43] super :) [15:41:58] ok finally the patch for OMP_NUM_THREADS seems to work fine in staging :) [15:51:22] (03PS1) 10Elukey: Rename entrypoint.sh to ci_entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012701 (https://phabricator.wikimedia.org/T360111) [15:54:46] \o/ [16:02:14] nice work elukey: [16:25:55] thanksss [16:26:04] I am sending a couple of patches to extend it to other model servers [16:26:23] so the OMP threads question (and possibly others) will be fixed once for all [16:28:31] (03PS1) 10Elukey: Set most of the model servers to run a specific entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) [16:30:44] all right patches ready to go! [16:30:49] I haven't tested them all of course [16:30:55] lemme know first if you like the idea [16:36:04] I like it but I have one consideration. shall we discuss here instead of the patch? [16:37:32] sure! [16:39:19] what happens if I want to run a different command as an entrypoint ? (other than python3 **/model.py). I see 2 options: [16:39:19] 1. accept the command as an argument for the script (or just the path) [16:39:19] 2. make the script only detect cpus and then define the command in the blubber entrypoint directive e.g. : `entrypoint: "./entrypoint.sh;python3 model_server/model.py")` [16:40:28] isaranto: the latter may not work, not sure if the export will apply, we need to check [16:40:57] but in theory in case we want to define a custom command we can totally have a new entrypoint [16:41:05] my main concern is what happens if I want to define a different entrypoint like I do in this horror https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1009783 [16:41:06] it will duplicate code though, not ideal [16:41:27] yeah no big deal, exactly [16:41:42] nono this is a good use case [16:41:57] so instead of the one liner entrypoint I'd use a bash script [16:42:08] the hf blubber config I mean [16:42:27] and inside it we could in theory add the OMP variable as well [16:42:32] not great yeah [16:43:42] but if we assume that 90% of the time we do python3 model_server/model.py and the other 10% something custom, we could live with a little duplication [16:44:31] or, we could do something like this [16:44:42] 1) we create a bash script with only env variables [16:45:11] 2) we create entrypoint.sh, maybe a "standard one" and other ones that at the beginning the do "souce bash-file-with-variables" [16:45:44] so we have the flexibility of creating custom entrypoints in bash scripts for every model [16:45:55] while keeping a single file with env variables [16:46:01] would it be ok? [16:48:02] I think the env file being separate is a good idea. [16:48:33] yes I agree as well. I mean I don't want to be picky about it, but since we're discussing about it... :) [16:48:35] super lemme amend the whole mess [16:48:44] nono please I didn't think about the use case [16:48:47] not picky at all [16:54:54] (03PS2) 10Elukey: Set most of the model servers to run a specific entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) [16:55:01] aiko: I didn't forget about the above error message with the hf image. I am checking now locally [16:56:37] isaranto: updated, tested for readability and it works :) [16:56:47] lemme know (Even tomorrow) if it solves the concern [16:57:39] 06Machine-Learning-Team: Investigate if it is possible to reduce torch's package size - https://phabricator.wikimedia.org/T359569#9643104 (10elukey) a:05elukey→03klausman [17:00:40] klausman: eval times for istio latency sli is now ~18s https://thanos.wikimedia.org/rules#istio_slos [17:00:48] (was ~30+) [17:01:03] still not great but a great improvement [17:02:33] isaranto: thanks! I'm also looking at the kserve code to figure it out [17:05:02] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Nice work!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [17:05:59] (03CR) 10Ilias Sarantopoulos: [C:03+1] Set most of the model servers to run a specific entrypoint.sh (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [17:06:24] elukey: +1 with a quick question more out of curiosity [17:07:52] (03CR) 10Klausman: [C:03+1] Set most of the model servers to run a specific entrypoint.sh (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [17:12:17] (03CR) 10Elukey: Set most of the model servers to run a specific entrypoint.sh (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012711 (https://phabricator.wikimedia.org/T360111) (owner: 10Elukey) [17:12:26] thanks a lot for the review folks! [17:12:33] I'll do more tests tomorrow and I'll merge in case [17:14:29] aiko: I think I messed up the local run in the end. lemme work on it a bit and retest it. I'm currently building the image again and it takes time :( [17:15:31] going afk folks! Have a nice rest of the day! [17:15:32] Initially I was explicitly setting the local dir to `/mnt/models` and then removed it as kserve uses that as default. However as your error shows it requires a cmd argument model_dir if model_id doesnt exist [17:16:22] 06Machine-Learning-Team, 06Research: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455 (10Isaac) 03NEW [17:17:12] 06Machine-Learning-Team, 06Research: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#9643206 (10Isaac) Task created -- @isarantopoulos just let me know if any details are missing or anything I can do to help with next steps when you are ready! [17:18:29] I'm heading out now. Have a nice evening, everyone! \o [17:20:28] have a nice evening folks! [17:23:30] so we'll need either model_id or model_dir [17:24:21] I saw the model_dir's default is none https://github.com/kserve/kserve/blob/master/python/huggingfaceserver/huggingfaceserver/__main__.py#L30 [17:25:38] night klausman, night isaranto [17:26:31] I'm sending a patch and I'm out for the evening :) [17:30:40] (03CR) 10Ilias Sarantopoulos: "Nice work on using the new error messages!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011305 (https://phabricator.wikimedia.org/T351278) (owner: 10AikoChou) [17:34:27] (03PS11) 10Ilias Sarantopoulos: huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) [17:35:25] (03PS12) 10Ilias Sarantopoulos: huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) [17:36:49] (03PS13) 10Ilias Sarantopoulos: huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) [17:37:21] aiko: I added the MODEL_DIR in the path so it should be working but lemme test it tomorrow morning first [17:38:05] I left a comment regarding the revertrisk error messages. I think this is a good topic for discussion for tomorrow so that we can decide what approach we want to take [17:38:45] what ORES does (sending 200 response for failing request) is not ideal and misleading in some cases (not that there is an ideal approach to it) [17:39:07] have a nice rest of day/evening! [18:04:24] o/ [18:05:36] ok! let's discuss tomorrow [18:07:13] logging off as well :)