[01:14:48] 10Machine-Learning-Team, 10Add-Link, 10Chinese-Sites, 10Growth-Team (Sprint 0 (ending Oct 16, 2023)), 10User-notice: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 (10Etonkovidova) Selectively checked some wikis from the list: `xalwiki` has only 6 suggested artic... [06:32:26] (03CR) 10Elukey: "I think that we have to explicitly include asgi in requirements.txt, there is a variant for kserve to use IIRC." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964568 (https://phabricator.wikimedia.org/T333804) (owner: 10Ilias Sarantopoulos) [06:37:13] (03PS1) 10Ilias Sarantopoulos: langid: allow local runs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965622 (https://phabricator.wikimedia.org/T347404) [06:44:53] 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10elukey) >>! In T340507#9245912, @santhosh wrote: > @elukey If I understood that documentation correctly, if the ser... [06:48:22] (03PS2) 10Ilias Sarantopoulos: langid: allow local runs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965622 (https://phabricator.wikimedia.org/T347404) [06:57:55] (03PS5) 10Ilias Sarantopoulos: revscoring: customize kserve logs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964568 (https://phabricator.wikimedia.org/T333804) [06:58:04] (03PS6) 10Ilias Sarantopoulos: revscoring: customize kserve logs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964568 (https://phabricator.wikimedia.org/T333804) [07:12:18] o/ [07:21:13] morning folks [07:22:10] (03CR) 10Ilias Sarantopoulos: "It seems that the unit tests are currently failing because the section `endpoint_host_headers` cannot be found in the test configuration f" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [07:24:06] morning! [07:25:12] kevinbazira: it seems that the unit tests are not failing cause of envoy , but because of the missing config section in the test config [07:25:37] lemme know if what I'm saying makes sense or if you need anything else. I will be able to dedicate some time later in the day [07:26:47] sure, Ilias. let me have a look. I'll ping you in case something is not clear [07:26:54] <3 [07:27:15] elukey: I still have the alert patches open as I had some issues writing the tests. I plan to try to finish the kafka lag one at least [07:32:11] super lemme know if you want me to take over [07:32:18] you have a lot of things in parallel :) [07:32:19] 10Machine-Learning-Team: Upgrade Revert Risk Language-agnostic docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347550 (10elukey) @achou found a regression in latency when load testing RR-LA with KServe 0.11 on ml-staging. After some digging, we found out that the Python process running KServe 0... [07:32:36] I added a summary of yesterday's investigation in --^ (Cc: aiko ) [07:32:59] sure, for the moment I'm ok, unless I block things [07:33:03] thanks for the summary [07:33:04] 10Machine-Learning-Team: Upgrade Revert Risk Language-agnostic docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347550 (10elukey) a:03achou [07:33:25] it seems I have an issue running helfmfile diff for articlequality for ml-staging [07:33:32] i get an error `Error: Failed to render chart: exit status 1: WARNING: Kubernetes configuration file is group-readable. This is insecure` [07:36:26] the error must be below, after the warning [07:36:40] is there anything else written? [07:39:16] oh yes sorry [07:39:26] a the resources seem to be misplaced [07:42:03] 10Machine-Learning-Team: Upgrade Revert Risk Language-agnostic docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347550 (10elukey) My theory may not be correct, I see https://github.com/dmlc/xgboost/pull/7654 that should be [[ https://github.com/dmlc/xgboost/blob/master/NEWS.md#v160-2022-apr-16 |... [07:42:54] yeah I put the under predictor/config/resources following an old patch (which was probably wrong as well while they should be under predictor/container/resources [07:45:16] +1ed [07:54:44] ok, all good now! [08:15:16] interesting, I am checking https://github.com/dmlc/xgboost/pull/7654/files [08:15:53] so the threading_utils.cc has some logic to get the cgroup's cpu assigned [08:16:02] but it seems that supports cgroups v1 [08:16:06] and we use v2 [08:16:27] so I am now wondering if the patch simply returns -1, namely "unlimited" [08:17:34] ahhh see https://github.com/dmlc/xgboost/blob/81a059864aafafa49f5d6bbc27560e74a722f939/src/common/threading_utils.cc#L77 [08:17:41] the last HEAD mentions v2 [08:17:45] ok so this may be the issue [08:18:09] o/ morning [08:18:20] lol https://github.com/dmlc/xgboost/pull/9651 [08:18:22] 2 days ago! [08:18:35] elukey: thanks for the summary! reading [08:18:47] and it may be released in https://github.com/dmlc/xgboost/issues/9657 [08:19:18] ohhh so it will solve our issue? [08:19:54] lol, great timing! [08:20:51] 10Machine-Learning-Team: Upgrade Revert Risk Language-agnostic docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347550 (10elukey) I found https://github.com/dmlc/xgboost/pull/9651, released 2 days ago, that is what it would work in our use case. The code that gets the max number of CPUs that a c... [08:21:18] aiko: I think so, at least it will not cause all those threads to be created, limiting the throttling [08:21:33] but I am still wondering why now xgboost works in that way [08:25:15] aiko: read the description in https://github.com/dmlc/xgboost/issues/9622 [08:25:20] seems basically what we are seeing [08:26:45] to double check we could add OMP_NUM_THREADS and OMP_THREAD_LIMIT [08:26:49] as env variables [08:28:22] if we see a better perf then we found the issue [08:28:27] * isaranto afk running an errand [08:28:40] yeah looks same issue that we have [08:29:11] manually setting the OMP_NUM_THREADS and OMP_THREAD_LIMIT could solve the problem [08:30:09] how many threads we should limit? 10? [08:31:01] checking openmp docs [08:31:43] checking also what the xgboost should return [08:32:04] so it reads [08:32:05] /sys/fs/cgroup/cpu.max [08:36:02] aiko: I think it should be 1 [08:36:17] root@ml-staging2001:/# cat /sys/fs/cgroup/cpu.max [08:36:18] 100000 100000 [08:36:35] and the code seems to just do 100000/100000 [08:36:49] IIRC we should have two CPUs assigned though [08:38:13] we have 2 cpus? I though we only assigned 1 [08:39:17] yes yes my bad [08:39:25] 2G of memory, I misremembered [08:39:45] so yes I think xgboost would now set nthreads in DMatrix to 1 [08:39:56] so I'd say OMP_NUM_THREADS=1 [08:40:09] not sure what THREAD_LIMIT does [08:40:38] THREAD_LIMIT is an upper limit [08:41:04] yeah but is it needed if we set OMP_NUM_THREADS? [08:41:09] we could also set THREAD_LIMIT to 1 [08:41:21] let's start with OMP_NUM_THREADS=1, what do you think? [08:41:48] thinking out loud, it may also be a way to improve RR performances [08:41:49] yeah [08:41:53] in the future I mean [08:42:02] maybe simply scaling more cpus will help [08:42:17] anyway, future work, let's try this :) [08:42:20] I read "If OMP_THREAD_LIMIT is set to a value lower than what is specified in OMP_NUM_THREADS, OpenMP will not create more threads than the limit imposed by OMP_THREAD_LIMIT." [08:42:38] okok [08:42:46] but yeah we can start with OMP_NUM_THREADS=1 [08:42:58] let's do it, are you going to file the change? [08:43:08] yes [08:55:36] the patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/965666 [08:57:11] aiko: the CI diff looks wrong :( [08:59:16] custom_env seems to override the base env [08:59:18] that is weird [09:00:07] so to avoid changing the templates etc.. for this test, we should add the other two variables as well [09:00:10] wdyt? [09:00:51] yeah ok [09:02:28] updated, is it correct now? [09:07:00] yeah +1 [09:10:49] deployed [09:11:23] super, lets load test [09:11:49] running load test [09:15:04] seems worked, I don't see throttling now [09:15:06] https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=revertrisk&var-pod=revertrisk-language-agnostic-predictor-default-00015-deplobhbt7&var-container=All&viewPanel=5&from=now-15m&to=now [09:18:07] latencies and rps are good? Like before I mean [09:19:08] yes! back to normal [09:19:12] \o/ [09:19:14] \o/\o/\o/ [09:19:50] at this point I'd ask to Muniza if it is possible to use xgboost 2.0.1 (when out) [09:20:11] or if they use 1.7.6 for some reason (maybe API etc..) [09:20:37] I am not 100% sure what triggered this behavior, but we should make sure that cgroups v2 are accounted [09:20:58] for the kserve 0.11 rollout we can use the OMP env var [09:21:08] what do you think folks? [09:21:34] agree! [09:23:01] elukey: q- are we always using cgroups v2? or did we change cgroups v1 to v2 before? [09:23:59] aiko: we have been using v2 for a while yes [09:24:34] ok I see [09:25:35] thank you luca for helping with this issue <3 [09:25:58] <3 [09:27:02] really nice discoveries (for me) about perf etc.. (in containers land) [09:28:33] for next I'll talk to Muniza about this issue and suggest them to upgrade xgboost when it's out [09:28:35] I thought there were more limitations [09:28:41] super [09:29:20] aiko: I'd also chat with Muniza about the perf of DMatrix, maybe with more cores could work better [09:29:35] once we upgrade to 2.0.1 we should be able to transparently do it [09:30:44] elukey: yeah makes sense [09:32:23] (I guess that xgboost can leverage more cpus since it is mostly python calling c++ code) [09:33:05] I think if we want to support batch prediction, that is something that could help [09:35:32] we now only use xgb.DMatrix for one set of feature for single prediction. It could be a list of feature set for batch prediction [09:37:20] that's what I think, but not sure if it's feasible. I'll discuss it further with Muniza [09:46:24] super [09:49:43] aiko: it depends again how we do batch prediction. but the Dmatrix is just an xgboost data structure similar to numpy array but more performant with xgboost [09:50:13] so its intention is to be used for multiple feature instances (as happens when training a model, all training data are in a DMatrix) [09:53:46] (03PS2) 10Kevin Bazira: Use envoy proxy to access endpoints external to k8s/LiftWing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) [09:55:25] (03CR) 10CI reject: [V: 04-1] Use envoy proxy to access endpoints external to k8s/LiftWing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [10:00:35] (03CR) 10Kevin Bazira: Use envoy proxy to access endpoints external to k8s/LiftWing (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [10:07:14] * elukey errand for abit [10:19:24] (03PS3) 10Kevin Bazira: Use envoy proxy to access endpoints external to k8s/LiftWing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) [10:24:14] (03CR) 10Kevin Bazira: "Finally got the tests to run without skipping them. Thanks to your suggestions, Ilias!" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [10:33:42] argh it seems that deploying another service is not as easy as I thought for revscoring. I'm talking about enwiki-mp-articlequality. mwapi fails when I do that [10:34:40] there are parts in the code where we do a .split('-') whcih means that some config variables aren't (only) what they say they are :) [10:35:09] anyway for now I'm just going to replace the enwiki one in staging to test. running 1-2 load tests with the current version to compare [10:45:16] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10CodeReviewBot) mfossati merged https://gitlab.wikimedia.org/repos/data-engineering/air... [10:51:39] kevinbazira: nice work, I'll review! [10:58:35] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10CodeReviewBot) mfossati merged https://gitlab.wikimedia.org/repos/data-engineering/air... [11:07:59] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10CodeReviewBot) mfossati merged https://gitlab.wikimedia.org/repos/structured-data/seal... [11:10:27] 10Machine-Learning-Team, 10Patch-For-Review: [revscoring] Fix Multiprocessing code - https://phabricator.wikimedia.org/T348265 (10isarantopoulos) Initially I tried to deploy a second articlequality model for enwiki in staging but this isnt trivial to do at the moment as we craft the response by extracting thin... [11:21:39] 10Machine-Learning-Team, 10Goal: Establish a standard load testing procedure - https://phabricator.wikimedia.org/T348850 (10isarantopoulos) [11:23:29] I created a task to discuss about load testing. It seems that with the work we are doing a standard process is needed. The recent spike with revertrisk but also the work I'm doing on revscoring would be a lot easier/faster I think [11:25:27] I got some nice results with articlequality. I'll post all the details on the task after lunch and we can discuss [11:28:11] * isaranto lunch! [11:58:11] isaranto: my impression was that we already had a standard load testing strategy [11:58:24] with the work that Aiko did under the tests dir in inference services [11:58:39] we have standard wrk scripts to use etc.. [12:00:40] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: kserve CORS error - https://phabricator.wikimedia.org/T348511 (10elukey) p:05Triage→03Medium a:05isarantopoulos→03None [12:00:45] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: kserve CORS error - https://phabricator.wikimedia.org/T348511 (10elukey) The api-gateway change was merged, I'll deploy it on Monday :) [12:00:59] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: kserve CORS error - https://phabricator.wikimedia.org/T348511 (10elukey) a:03elukey [12:02:10] the CORS patch is merged, I'll deploy api-gw on monday [12:02:15] after that we should be ok in theory :) [12:04:10] (03CR) 10Elukey: [C: 03+1] "I think that we may end up in some weird escaping corner case with helm + env variables + logging format, but it is worth to test!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964568 (https://phabricator.wikimedia.org/T333804) (owner: 10Ilias Sarantopoulos) [12:18:38] nice! [12:21:29] elukey: regarding the load testing: indeed Aiko did great work, but still if I want to test lets say all model servers I need to run each command manually, wait for the results , run the next one etc. And results are printed in a format which is not comparable [12:22:32] so perhaps modifying the lua scripts or adding a python one that inteprets wrk output but ofc if the rest of you don't see it as an issue we can skip it [12:23:08] it is surely a good point [12:23:15] my fear is that we over engineer it [12:24:58] but, more than happy to help if we want to do something simple first :) [12:25:17] one downside of the current procedure is that the knowledge about what latency is acceptable is not widespread [12:25:28] for RR Aiko knows it by heart, but the rest don't [12:25:33] so +1 to improve this point [12:25:40] (not sure if what I wrote make sense) [12:40:18] yes def. I'm all in for something simple, I'm not talking about automating or sth like that [12:40:46] something that would be done in 0.5-1 days of work [12:40:55] last famous words :D [12:43:04] always! [12:44:26] but it depends on how we approach and scope work [12:46:36] (03CR) 10Elukey: Use envoy proxy to access endpoints external to k8s/LiftWing (035 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [12:47:46] isaranto: yeah this is what I am worried about, we are not doing it at the moment :D [12:47:57] we plan for some goals, and work on other tasks that are not scoped/scheduled [12:51:08] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the readability model server to KServe 0.11.1 - https://phabricator.wikimedia.org/T348664 (10elukey) New Docker image in https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/964561 [12:51:09] this is another topic, not related though. Since we're discussing about a task and we create it , it doesnt mean we do it now. It can go into the backlog and planned for another time or in the end we decide we dont do it because things have changed and we dont need it [12:51:50] true, but usually we do things in the other way :) [12:52:03] I am not saying it as a bad thing, I was just raising a point [12:52:18] and it wasn't directed to that particular task [12:52:32] but I feel, in general, that we don't work on what we plan as we should [12:52:43] (and we don't scope the work at all) [12:52:56] it is a long process but we should start [12:53:14] and it was a response since you mentioned "it depends on how we approach and scope work [12:53:17] " [12:55:13] ok, understood! well then we can just start estimating tasks from next week. before a task leaves the unsorted column it should have an estimate [12:56:08] some weeks ago I started writing sth on estimating tasks but never finished it! [12:57:58] exactly yes I think it would be a great start [12:58:46] Chris proposed something interesting as high level "size" (or points or whatever) that indicates how big it is (2 weeks, 1 month, etc.. or other units what we care about) [12:59:06] then planning should be easier [12:59:10] at least in my opinion [12:59:29] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the readability model server to KServe 0.11.1 - https://phabricator.wikimedia.org/T348664 (10elukey) Next steps: - Deploy to staging - Run a basic load test to ensure that no regression is introduced - Rollout to prod [13:00:26] yes,I'm totally onboard with this [13:00:49] +1 estimate tasks [13:01:40] elukey: do you want me to help with deployment and test for readability model? [13:01:54] aiko: I was about to ask! Do you want to load test staging? [13:02:06] I am going to deploy readability in a min [13:02:12] yeah I can do that :) [13:02:26] will tell you when I am done :) [13:04:41] aiko: done! [13:05:40] ah snap crash loop [13:06:01] ModuleNotFoundError: No module named 'pkg_resources' :_ [13:06:22] filing the patch.. [13:07:18] ah same issue that the RRLA had [13:16:22] Morning all! [13:17:39] aiko: I see the python3-setuptools is not in RR-multilingual,wikidata, do you want me to add it? [13:19:35] (03PS1) 10Elukey: blubber: add python3-setuptools to readability [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965736 (https://phabricator.wikimedia.org/T348664) [13:20:09] elukey: I didn't add it because I didn't see the same issue there [13:20:18] aiko: ah really? [13:20:20] weird... [13:20:21] okok [13:20:25] filed the readability one :) [13:20:59] (03CR) 10Ilias Sarantopoulos: revscoring: customize kserve logs (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964568 (https://phabricator.wikimedia.org/T333804) (owner: 10Ilias Sarantopoulos) [13:21:20] o/ Chris [13:21:21] (03CR) 10AikoChou: [C: 03+1] blubber: add python3-setuptools to readability [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965736 (https://phabricator.wikimedia.org/T348664) (owner: 10Elukey) [13:22:44] elukey: is this caused by blubber or bullseye update? (I mean the fact that we need python3-setuptools) [13:26:12] isaranto: kserve upgrade, Aiko had the same issue with RR LA [13:26:48] ok! thanks [13:26:53] lemme give you the full stacktrace [13:27:17] I remember the RR patch from aiko [13:27:18] isaranto: https://phabricator.wikimedia.org/P52931 [13:27:48] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the readability model server to KServe 0.11.1 - https://phabricator.wikimedia.org/T348664 (10elukey) Found the following issue: ` Traceback (most recent call last): File "/srv/readability/model-server/model.py", line 6, in import kserve Fil... [13:27:52] it seems coming from Ray [13:28:10] ok to proceed with my change? [13:28:15] I can wait otherwise [13:31:11] (03PS3) 10Ilias Sarantopoulos: langid: allow local runs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965622 (https://phabricator.wikimedia.org/T347404) [13:31:29] (03CR) 10Ilias Sarantopoulos: [C: 03+1] blubber: add python3-setuptools to readability [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965736 (https://phabricator.wikimedia.org/T348664) (owner: 10Elukey) [13:31:43] (03CR) 10Elukey: [C: 03+2] blubber: add python3-setuptools to readability [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965736 (https://phabricator.wikimedia.org/T348664) (owner: 10Elukey) [13:31:50] thanks :) [13:42:48] 10Machine-Learning-Team: [revscoring] Fix Multiprocessing code - https://phabricator.wikimedia.org/T348265 (10isarantopoulos) Some results from running load tests **Single Process** ` isaranto@deploy2002:~/load_testing$ wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:304... [13:44:21] I updated the results for mp above. Your input is more than welcome! [13:44:52] 10Machine-Learning-Team: Establish a standard load testing procedure - https://phabricator.wikimedia.org/T348850 (10isarantopoulos) [13:46:59] 10Machine-Learning-Team: Establish a standard load testing procedure - https://phabricator.wikimedia.org/T348850 (10isarantopoulos) My initial motivation came from having more interpretable results when running comparison that just pasting results like I did in this [[ https://phabricator.wikimedia.org/T348265#... [13:47:09] 10Machine-Learning-Team: [revscoring] Fix Multiprocessing code - https://phabricator.wikimedia.org/T348265 (10elukey) +1, looks great! I would even go further and test it with 4 cores, to see if it improves :) [13:47:13] isaranto: great work! I wrote that we could do a load test with 4 cpus assigned too [13:47:42] mmm now that I think about it [13:47:46] greedy! [13:47:47] hehe [13:47:54] with two extra workers, we should set 3 cpus [13:48:05] one for the eventloop, and two others for the processes [13:48:13] ok [13:48:30] I can quickly patch the isvc in staging if you are up for an extra test [13:48:40] yes please! [13:49:39] I'm not sure if it will improve but worth to try. The main goal is for the latencies to be stable, which seems to be done with 2 workers [13:49:48] lemme know when to run the tests [13:50:08] did you check by any chance the kubernetes container dashboard to see if there was throttling etc.? [13:50:28] nope...! [13:50:38] checking now [13:51:02] new articlequality pod coming up [13:51:10] ah snap no wrong settings [13:51:11] uff [13:51:12] lemme fix [13:51:45] ok up! [13:51:49] So 3 cpus and 2 workers [13:51:55] let's see if it is better [13:51:58] * elukey brb [13:53:34] thanks! [13:53:53] I saw some throttling which was improved when we added 2 workers [13:56:02] elukey: wouldn't we want to test 3 cpus 4 workers? I see ASYNCIO_AUX_WORKERS is set to 2 , as it was before [13:57:46] (03PS4) 10Ilias Sarantopoulos: langid: allow local runs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965622 (https://phabricator.wikimedia.org/T347404) [13:58:17] isaranto: I would like to see how it goes with 3 cpus and 3 processes total (2 workers and 1 ioloop thread/process) [13:58:42] to remove any contention [13:59:29] ok! results are the same with the above setup. Improved very little [13:59:35] okok [13:59:59] I would like to keep cpus == processes if possible anyway [14:00:07] so I'd go 4 workers and 5 cpus [14:00:08] 10Machine-Learning-Team: [revscoring] Fix Multiprocessing code - https://phabricator.wikimedia.org/T348265 (10isarantopoulos) With 3 cpus and 2 workers we have the following results ` isaranto@deploy2002:~/load_testing$ wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443... [14:00:08] ok? [14:02:06] ok! I'll try a change with 5 cpus and 4 workers and we can discuss again when you're back [14:02:16] nono lemme patch it [14:02:20] so you can test it [14:03:29] isaranto: ready to go [14:03:41] aa ok super thanks! [14:04:11] really curious [14:09:32] aiko: readability ready in staging! Don't feel that you need to test now, we can do it on monday :) [14:10:36] 10Machine-Learning-Team: [revscoring] Fix Multiprocessing code - https://phabricator.wikimedia.org/T348265 (10isarantopoulos) With 5 cpus and 4 workers ` isaranto@deploy2002:~/load_testing$ wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-articlequali... [14:11:11] 10Machine-Learning-Team: Upgrade Revert Risk Multilingual docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347551 (10elukey) a:05elukey→03None [14:11:12] updated with results, I added a final table ppl can use as a summary. clearly show sth that scales [14:11:34] wow [14:11:44] this is really great [14:11:54] so MP was useful but we didn't know how! :) [14:11:58] on the top of the table I show total responses received [14:12:37] I think that we could move the most horrible isvcs to MP configs on monday [14:12:47] maybe one for each ns [14:12:51] and see how it goes [14:12:59] wdyt? [14:13:06] the difference is in the samples that we tested. now we had some that took a lot of time [14:13:15] I would do just one for the time being and monitor it [14:13:31] we are also enabling MP only for preprocess, that is great [14:13:53] do you think we should do more at once? [14:14:20] one for each ns seems safe enough, but we can start with enwiki-articlequality [14:14:42] (back in a bit) [14:15:08] ok! on monday I will be deploying all revscoring so it is a good day to do it [14:17:31] wow nice results and the summarised table! [14:18:13] elukey: ack! I'll work on it on monday :) [14:30:18] 10Machine-Learning-Team: Investigate procuring and installing GPUs on Lift Wing - https://phabricator.wikimedia.org/T327923 (10elukey) 05Open→03Resolved a:03elukey We are in the process to order an AMD Instinct MI100, we'll open new tasks to test it :) [14:30:20] 10Machine-Learning-Team, 10Epic: Experiment with GPUs in the Machine Learning infrastructure - https://phabricator.wikimedia.org/T333462 (10elukey) [14:32:23] 10Machine-Learning-Team, 10Wikilabels: Update wikilabel's dependencies - https://phabricator.wikimedia.org/T325367 (10elukey) Next step is to figure out if we need Wikilabels, if not I'd just remove the cloud config. @calbon @kevinbazira I don't recall if we decided to keep https://labels.wmflabs.org/ or not... [15:09:57] Got a test for the Kafka consumer lag alert that works \o/ [15:12:03] woohoo \o/ [15:13:12] nice!! [15:13:19] going afk folks! Have a nice weekend! [15:14:03] bye Luca! have a great weekend :) [15:22:07] ciao! Have a great weekend! [15:22:58] kevinbazira: regarding the patch for rec-api I don't have anything to add after Luca's comments. If you fix the replication issues by creating a function we would be good to go [15:23:11] if you haven't done so please feel free to do it on Monday <3 [15:23:24] logging off as well folks \o/ [18:02:54] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10mfossati) ==Update== - All requests are merged - manually released & packaged [version... [18:05:12] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10mfossati) Moving away from code review, pending deployment & production monitoring. [18:46:47] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10xcollazo) > (one is 1.22 GB, the other 1.64) !! Maybe we can add them manually to unb... [19:01:02] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10mfossati) >>! In T325316#9250803, @xcollazo wrote: >> (one is 1.22 GB, the other 1.64)... [19:20:31] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10xcollazo) Let's first copy the artifacts to HDFS manually: My script, for reference:... [19:24:30] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10xcollazo) > Would it be viable to bump the memory for future deployments? I think so?... [20:01:21] (03PS1) 10Umherirrender: Use options-messages to delay message parsing on Special:Preferences [extensions/ORES] - 10https://gerrit.wikimedia.org/r/965799 [23:03:25] (03CR) 10Jforrester: [C: 03+2] Use options-messages to delay message parsing on Special:Preferences [extensions/ORES] - 10https://gerrit.wikimedia.org/r/965799 (owner: 10Umherirrender) [23:37:48] (03Merged) 10jenkins-bot: Use options-messages to delay message parsing on Special:Preferences [extensions/ORES] - 10https://gerrit.wikimedia.org/r/965799 (owner: 10Umherirrender)