[06:17:03] Good morning folks! [06:49:55] (03PS1) 10Ilias Sarantopoulos: redability: trigger new build [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010450 (https://phabricator.wikimedia.org/T353461) [06:50:28] CI changes were deployed yesterday so I'm building a new image for readability so that we can deploy it today [06:59:51] (03CR) 10Hashar: [C: 03+2] SqlModelLookup.php: Document that empty cache bypasses is intentional [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1010236 (https://phabricator.wikimedia.org/T184938) (owner: 10Krinkle) [07:02:46] (03Merged) 10jenkins-bot: SqlModelLookup.php: Document that empty cache bypasses is intentional [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1010236 (https://phabricator.wikimedia.org/T184938) (owner: 10Krinkle) [07:45:10] (03CR) 10Ilias Sarantopoulos: [C: 03+2] redability: trigger new build [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010450 (https://phabricator.wikimedia.org/T353461) (owner: 10Ilias Sarantopoulos) [07:46:58] 06Machine-Learning-Team, 10ORES, 13Patch-For-Review: Add httpbb tests for ores-legacy - https://phabricator.wikimedia.org/T359871#9622579 (10isarantopoulos) In the attached patch I brought some old ores httpbb tests back to life. httpbb doesn't seem to support having a boolean in the body's response e.g. `tr... [07:58:59] new readability image is here -> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1010483 [08:53:01] 06Machine-Learning-Team: Add a util function in python to detect GPU - https://phabricator.wikimedia.org/T359793#9622632 (10achou) a:03achou [09:13:13] 06Machine-Learning-Team, 06Research: Explore using revertrisk language agnostic API in a pre-save context - https://phabricator.wikimedia.org/T356102#9622715 (10kostajh) [09:22:12] Morning! [09:31:10] 06Machine-Learning-Team, 06Research: Explore using revertrisk language agnostic API in a pre-save context - https://phabricator.wikimedia.org/T356102#9622766 (10kostajh) [09:32:59] o/ Tobias! [09:57:42] (03PS1) 10AikoChou: Makefile: install requirements.txt for python/*_utils [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010493 [10:50:38] * aiko lunch [11:09:18] (03CR) 10Ilias Sarantopoulos: [C: 03+1] Makefile: install requirements.txt for python/*_utils [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010493 (owner: 10AikoChou) [11:09:34] (03PS3) 10Ilias Sarantopoulos: revertrisk: remove obsolete step from README [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009723 [11:10:58] (03PS2) 10Ilias Sarantopoulos: revertrisk-batch: add env var CLASSIFIER_BATCH_SIZE to batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008837 (https://phabricator.wikimedia.org/T355656) (owner: 10AikoChou) [11:25:26] * klausman lunch [11:58:09] * isaranto lunch! [13:06:44] (03PS1) 10AikoChou: Add a util function to detect GPU in resource_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010515 (https://phabricator.wikimedia.org/T359793) [13:07:51] hello folks! [13:08:20] (03CR) 10AikoChou: [C: 03+1] revertrisk: remove obsolete step from README [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009723 (owner: 10Ilias Sarantopoulos) [13:08:54] (03CR) 10AikoChou: [C: 03+2] Makefile: install requirements.txt for python/*_utils [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010493 (owner: 10AikoChou) [13:09:21] hiii Luca o/ [13:11:11] (03CR) 10AikoChou: [V: 03+2 C: 03+2] Makefile: install requirements.txt for python/*_utils [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010493 (owner: 10AikoChou) [13:13:22] as FYI: https://phabricator.wikimedia.org/T359879 [13:13:24] (03PS2) 10AikoChou: Add a util function to detect GPU in resource_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010515 (https://phabricator.wikimedia.org/T359793) [13:13:43] this is the task from observability to investigate the slo dashboards weird results [13:13:48] hello aiko o/ [13:13:58] o/ Luca and aiko! [13:14:13] (03PS4) 10Ilias Sarantopoulos: revertrisk: remove obsolete step from README [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009723 [13:14:34] isaranto: o/ ok if I file a change to deploy readability? [13:14:43] so we can test both dragonfly and kserve 0.11 [13:14:58] I was just going to say that! [13:14:59] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1010483 [13:15:09] ah! [13:15:16] we can deploy that now if u want [13:15:35] +1ed! [13:15:39] super we can do it [13:15:57] lemme grab my last coffee of the day and I'm on it [13:19:59] (03CR) 10Elukey: Add a util function to detect GPU in resource_utils module (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010515 (https://phabricator.wikimedia.org/T359793) (owner: 10AikoChou) [13:21:35] (03CR) 10Klausman: Add a util function to detect GPU in resource_utils module (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010515 (https://phabricator.wikimedia.org/T359793) (owner: 10AikoChou) [13:23:31] I'm deploying readability on staging now [13:23:36] super [13:23:40] checking dragonfly logs [13:24:54] 2024-03-12 13:24:16.540 INFO sign:2010962 : dfget url:https://docker-registry.discovery.wmnet/v2/wikimedia/machinelearning-liftwing-inference-services-readability/blobs/sha256:74b022bb79e87f21256273490fdc55c781ccf33b1bc120529df4dd4e6715b15c [SUCCESS] cost:12.887s [13:25:09] from logstash or from the deployment/pod? [13:25:18] \o/ [13:25:28] on the host, directly via /var/lib/dragonfly-dfdaemon/logs/dfdaemon.log [13:25:33] (03CR) 10CI reject: [V: 04-1] Add a util function to detect GPU in resource_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010515 (https://phabricator.wikimedia.org/T359793) (owner: 10AikoChou) [13:26:26] so in theory, if now we kill the pod and if the scheduler deploys it to ml-staging2002, we should see it pulling the image from ml-staging2001 [13:27:32] trying to do it [13:27:59] could you also check the number of threads it uses? [13:28:23] it got re-created on 2001 [13:28:24] :D [13:28:30] yes sure [13:29:04] Yopu could cordon, but not drain 2001 and kill the pod [13:29:20] That way, it must schedule on 2002 [13:29:53] yep yep makes sense, the scheduler keeps using 2001 [13:31:09] it may be that the scheduler sees the image already in storage on 2001, or it knows the last run was there and was externally terminated (i.e. the host didn't fail and it wasn't removed for priority reasons), and thus try and schedule it there. [13:33:40] sure [13:34:05] so I see some logs on 2002 but it is not clear where the image was downloaded from [13:35:26] I wonder if one could see it in netstat as it is happening? [13:35:38] in theory it seems that some chunks were served from 2001 [13:39:49] isaranto: I see ~165 threads [13:40:12] ... :( [13:45:51] isaranto: wait a sec, we are still manually setting the threads number via code though, no? [13:46:00] thread_count = int(os.environ.get("NUM_THREADS", get_cpu_count())) [13:46:19] in fact I see INFO:root:ReadabilityModel initialized with 1 threads [13:46:47] so no idea what all those threads are doing, but we shouldn't be testing the new catboost auto-recognize cpus in theory [13:46:54] that is an attempt that didn't really work, but you're right it is used in catboost [13:47:51] how come it didn't really work? [13:48:06] I recall that we didn't see the cpu throttling problems [13:49:47] never a joy in this line of work [13:50:23] iirc it didn't do anything [13:51:20] > but we shouldn't be testing the new catboost auto-recognize cpus in theory [13:51:20] I didn't understand what you're saying. can you clarify? [13:52:10] in https://gitlab.wikimedia.org/trokhymovych/readability-liftwing/-/blob/main/readability/models/readability_bert/model.py?ref_type=heads#L97 we don't use the -1 value for num_threads, but we force a number (1 in this case) [13:52:21] -1 triggers the auto-recognize code [13:53:26] yes [13:53:59] but I see that a curl for readability staging hangs now [13:54:00] sigh [13:55:28] so, I think that OMP_NUM_THREADS is not governed by catboost's cpu recognize code [13:56:07] at this point omp threads and caboost threads are separate? [13:56:50] ¯\_(ツ)_/¯ [13:56:56] let's talk about it in a bit [13:57:05] in the meeting I mean [13:57:12] I'll ask to upstream [14:09:42] 06Machine-Learning-Team, 10ORES, 13Patch-For-Review: Add httpbb tests for ores-legacy - https://phabricator.wikimedia.org/T359871#9623592 (10isarantopoulos) [14:10:14] 06Machine-Learning-Team, 10ORES, 13Patch-For-Review: Add httpbb tests for ores-legacy - https://phabricator.wikimedia.org/T359871#9623594 (10isarantopoulos) a:03isarantopoulos [14:39:24] 06Machine-Learning-Team: Investigate if it is possible to reduce torch's package size - https://phabricator.wikimedia.org/T359569#9623730 (10isarantopoulos) [14:43:07] 06Machine-Learning-Team: Add Dragonfly to the ML k8s clusters - https://phabricator.wikimedia.org/T359416#9623734 (10isarantopoulos) [14:59:16] 06Machine-Learning-Team, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Python torch fills disk of CI Jenkins instances - https://phabricator.wikimedia.org/T338317#9623848 (10hashar) 05Resolved→03Open And that broke multiple CI Jenkins agents again: ` hashar@integration-agent-docker... [15:00:03] (03CR) 10Kevin Bazira: [C: 03+1] revertrisk: remove obsolete step from README [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009723 (owner: 10Ilias Sarantopoulos) [15:05:59] 06Machine-Learning-Team: Add Dragonfly to the ML k8s clusters - https://phabricator.wikimedia.org/T359416#9623895 (10elukey) Today we tested the deployment of a new image in staging, and everything worked as expected. Some notes: * The new image was correctly downloaded from the Registry the first time. * I cor... [15:14:28] elukey: so far the AMD build for rocm-torch is taking 25 minutes... on 16-core machine :D [15:14:59] 7000/8600 files compiled and who knows what comes after [15:16:26] it takes a ton of time :( [15:16:38] I ended up with the wheel but it didn't contain the rocm libs [15:16:50] no idea if they are added in a separate step or not [15:18:11] (03PS3) 10AikoChou: Add a util function to detect GPU in resource_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010515 (https://phabricator.wikimedia.org/T359793) [15:30:14] (03PS4) 10AikoChou: Add a util function to detect GPU in resource_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010515 (https://phabricator.wikimedia.org/T359793) [15:31:58] 06Machine-Learning-Team, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Python torch fills disk of CI Jenkins instances - https://phabricator.wikimedia.org/T338317#9623973 (10dancy) We could configure buildkit gc rules for the Docker daemon: https://docs.docker.com/build/cache/garbage-c... [15:32:05] (03CR) 10AikoChou: Add a util function to detect GPU in resource_utils module (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010515 (https://phabricator.wikimedia.org/T359793) (owner: 10AikoChou) [16:02:23] (03CR) 10Ilias Sarantopoulos: Add a util function to detect GPU in resource_utils module (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010515 (https://phabricator.wikimedia.org/T359793) (owner: 10AikoChou) [16:02:35] going afk folks, cu tomorrow o/ [16:03:01] have a nice evening, Ilias \o [16:14:29] bye Ilias o/ [16:25:20] o/ [16:56:32] 06Machine-Learning-Team, 13Patch-For-Review: Add Dragonfly to the ML k8s clusters - https://phabricator.wikimedia.org/T359416#9624256 (10elukey) dfdaemon logs on 2001 (first pull of the image in the cluster): ` [..] 2024-03-12 13:24:04.218 INFO sign:2010962 : dfget url:https://docker-registry.discovery.wmnet/... [17:03:34] going afk for today folks! [17:03:35] o/ [17:03:42] elukey: just a sec [17:03:48] sure [17:03:54] elukey: I am nearing the end of the compile 7765/8569 [17:04:08] And so far I have seen several specific mentions of the arch I selected (gfx900) [17:04:14] So there's hope :) [17:04:37] and now enjoy your evening :) [17:04:40] o/ [17:04:49] \o [17:30:05] * klausman hading out now as well