[03:26:26] 07artificial-intelligence, 06Machine-Learning-Team, 10ORES, 10MediaWiki-Patrolling: Selective patrol: an AI-based system to prioritize patrolling of edits - https://phabricator.wikimedia.org/T157715#9785018 (10Pppery) [03:27:24] 06Machine-Learning-Team, 10Wikimedia-Site-requests, 07Community-consensus-needed, 07Turkish-Sites: Enable RC patrolling on trwiki - https://phabricator.wikimedia.org/T140475#9785019 (10Pppery) [05:06:43] (03CR) 10Kevin Bazira: "Thank you for the comments, Ilias and Luca. I see your point of view. It would be great if you shared these with the Structured Content te" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1028937 (https://phabricator.wikimedia.org/T363449) (owner: 10Kevin Bazira) [05:43:18] (03PS3) 10DannyS712: Replace custom test mocks with trivial value holders [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029510 (owner: 10Thiemo Kreuz (WMDE)) [05:43:24] (03CR) 10DannyS712: "resubmit" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029510 (owner: 10Thiemo Kreuz (WMDE)) [05:43:39] (03PS2) 10Thiemo Kreuz (WMDE): Add missing type declarations to DB-related class properties [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029505 [05:43:42] (03CR) 10DannyS712: "resubmit" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029505 (owner: 10Thiemo Kreuz (WMDE)) [05:43:46] (03PS2) 10Thiemo Kreuz (WMDE): Make all @covers tags in tests absolute [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029502 [05:43:48] (03CR) 10DannyS712: "resubmit" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029502 (owner: 10Thiemo Kreuz (WMDE)) [05:46:01] Good morning! [06:07:09] (03Merged) 10jenkins-bot: Replace custom test mocks with trivial value holders [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029510 (owner: 10Thiemo Kreuz (WMDE)) [06:07:17] (03Merged) 10jenkins-bot: Add missing type declarations to DB-related class properties [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029505 (owner: 10Thiemo Kreuz (WMDE)) [06:10:35] (03Merged) 10jenkins-bot: Make all @covers tags in tests absolute [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029502 (owner: 10Thiemo Kreuz (WMDE)) [06:18:17] (03CR) 10Thiemo Kreuz (WMDE): "Sure, thanks!" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029510 (owner: 10Thiemo Kreuz (WMDE)) [06:59:49] (03CR) 10Thiemo Kreuz (WMDE): Replace expensive explode/implode with string manipulation (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029513 (owner: 10Thiemo Kreuz (WMDE)) [08:44:42] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9785333 (10isarantopoulos) @mfossati is there any other way to access the images in the upload stash other than using a cookie. Using a user cookie to access an API do... [09:33:16] isaranto: o/ [09:34:26] I have a thought for the hugging face testing - not sure if you saw it but I deployed nllb-gpu to staging, and it works fine (no error showed etc..). Would it be possible to upgrade the llm image to bookworm? Because we'd be able to test libdrm in that case [09:34:45] if the image shows the error, we would have the root cause [09:34:51] otherwise we rule out another variable [09:34:55] wdyt? [09:34:59] Hey! Ah I missed that [09:35:44] Yes I'll do it.for the moment I was updating torch base image + hf image with latest version(torch 2.3.0). But it can wait [09:38:23] isaranto: I can take care of nllb this afternoon, didn't mean to push you to do it now :) [09:39:01] let's do this - go ahead with hf on pytorch 2.2, I'll do nllb on bookworm. Does it sound ok? [09:43:20] elukey: cool! But pytorch 2.3.0 ok? [09:44:23] isaranto: ah snap I thought 2.2, that is the newer base image already present on the docker registry (IIUC we are testing 2.1 with hf atm) [09:44:33] we can create a 2.3 base image in case [09:45:19] (back to Alessandro, will check later <3) [09:46:54] Ok,thanks! let's sync again in the afternoon [10:10:38] \o [10:10:47] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9785501 (10mfossati) >>! In T362749#9785333, @isarantopoulos wrote: > @mfossati is there any other way to access the images in the upload stash other than using a cook... [10:11:00] elukey: I think the `watch` verb somehow got there from the node labeller cmdline (see https://github.com/ROCm/k8s-device-plugin/blob/master/cmd/k8s-node-labeller/README.md) [10:18:40] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9785550 (10MatthewVernon) The short answer is: no, I don't think giving privileged swift access to Lift Wing (or anything else) based on IP is a possible (or really ap... [10:30:58] o/ Tobias [10:42:50] klausman: could be yes! [10:43:21] thanks for the review [10:47:58] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9785581 (10isarantopoulos) Regarding the nllb-gpu deployment: we have successfully tested it when we first obtained the MI100. The deployment was just removed at some poi... [10:50:52] (03PS1) 10Ilias Sarantopoulos: llm: bump torch and rocm version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030059 (https://phabricator.wikimedia.org/T362984) [10:52:42] installed the new version of the device plugin to ml-staging2001, all good [10:52:47] * klausman lunch and an errand [10:56:14] * isaranto ditto! [11:51:51] (03CR) 10Thiemo Kreuz (WMDE): "Breakage was apparently because of T364569." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029511 (owner: 10Thiemo Kreuz (WMDE)) [11:51:54] (03PS3) 10Thiemo Kreuz (WMDE): Use correct IReadableDatabase interface in queryCallable callbacks [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029511 [11:53:44] aand also installed on ml-serve1001 [12:20:56] (03PS1) 10Elukey: llm: update to Bookworm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030084 (https://phabricator.wikimedia.org/T362984) [12:21:37] filed the change to update the llm image to bookworm [12:26:31] (03CR) 10Klausman: [C:03+1] llm: update to Bookworm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030084 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [12:26:39] did a driveby-lgtm :) [12:28:14] thanks! [12:28:28] elukey: o/ in the context of gpu: I wrote this earlier https://phabricator.wikimedia.org/T362984#9785581 [12:28:57] I was confused as I thought you said you tried with a different rocm version [12:30:43] isaranto: nono I didn't know the llm image was tested with the MI100, so I just added the isvc to test.. But we didn't test bookworm, this is what I want to do now [12:31:02] if it works, libdrm is probably not the culprit [12:31:04] yes, I agree on that one [12:31:33] I say probably because there is always the possibility of a weird interaction between rocm version and libdrm etc.. [12:31:39] isaranto: ok if I proceed? [12:32:02] (03CR) 10Ilias Sarantopoulos: [C:03+1] llm: update to Bookworm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030084 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [12:32:09] elukey: yes! [12:33:39] if it works, lets also try the torch2.3.0 and rocm 5.7 in the llm image to see if it works as well. I am currently testing that locally [12:36:16] (03CR) 10Elukey: [C:03+2] llm: update to Bookworm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030084 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [12:36:33] super proceeding [12:37:00] (03Merged) 10jenkins-bot: llm: update to Bookworm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030084 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [12:39:50] new kserve version is expected in may https://github.com/kserve/kserve/issues/3648#issue-2268883538 (0.13) [13:07:25] nice! [13:13:11] I manually changed the llm image on staging, it works [13:14:00] so it shouldn't be the os or libdrm [13:15:30] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9785902 (10isarantopoulos) I got an error when trying llm image locally with bullseye-torch2.3.0-rocm5.7 ([[ https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inf... [13:20:34] isaranto: one qs - have you tried pytorch 2.2 with rocm 5.7? Because we'd already have a base image ready to be used [13:21:21] No,but this is what I'm trying now [13:23:54] I guess it is nice that bookworm works! [13:28:06] yes yes! [13:28:32] the kernel is always the same as the kubernetes worker one, so the difference between bullseye and bookworm is about deb packages [13:28:49] in theory the drivers are the same, because they are provided by the kernel [13:29:03] at this point it may be ROCm version + Pytorch version [13:29:16] or a PEBCAK from me when building the base image, that is probable [13:30:04] Still puzzling that the symtpom is a permission error. [13:32:18] I checked the docker config for the running container (https://phabricator.wikimedia.org/T362984#9784022) and the devices are correctly readable/writable [13:32:24] and the fs perms are ok [13:33:11] yeah, that's what makes the failure so puzzling. [13:35:58] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9785929 (10klausman) >>! In T362984#9785902, @isarantopoulos wrote: > I got an error when trying llm image locally with bullseye-torch2.3.0-rocm5.7 ([[ https://gerrit.wik... [13:40:38] 06Machine-Learning-Team: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#9785936 (10kostajh) >>! In T356102#9747778, @achou wrote: > Hi @kostajh, yes, this is something we can work on this quarter. wonderful, thank yo... [13:47:14] building those images takes soo much time :( [13:47:19] :( [13:47:32] need to step afk for probably a couple of hours, left a message on slack! ttl [13:47:40] take care Luca <3 [14:02:48] Good morning all [14:06:56] hey Chris! [14:12:36] good morning Chris o/ [14:46:38] klausman: thanks for your comment on the task above (torch/rocm issues). it is most likely m1 related. I'm rebuilding the image with the 5.4.2 version cause afair I had been running this [14:46:55] ack! [14:53:33] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9786161 (10Ladsgroup) Yes, Upload stash shouldn't be accessed directly or indirectly. It is internal to mediawiki and private. You can do it post-upload and add a comm... [15:07:54] (03PS2) 10Ilias Sarantopoulos: llm: bump torch and rocm version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030059 (https://phabricator.wikimedia.org/T362984) [15:10:06] (03PS3) 10Ilias Sarantopoulos: llm: bump torch and rocm 5.7 versions (2.2.1-rocm5.7) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030059 (https://phabricator.wikimedia.org/T362984) [15:11:33] (03PS4) 10Ilias Sarantopoulos: llm: bump torch and rocm 5.7 versions (2.2.1-rocm5.7) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030059 (https://phabricator.wikimedia.org/T362984) [15:36:03] (03CR) 10Klausman: [C:03+1] llm: bump torch and rocm 5.7 versions (2.2.1-rocm5.7) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030059 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [15:36:15] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9786270 (10isarantopoulos) >>! In T362749#9786161, @Ladsgroup wrote: > Yes, Upload stash shouldn't be accessed directly or indirectly. It is internal to mediawiki and... [15:52:37] (03CR) 10CI reject: [V:04-1] ores-legacy: add deprecation message to UI endpoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030178 (https://phabricator.wikimedia.org/T349996) (owner: 10Ilias Sarantopoulos) [16:00:42] (03CR) 10Elukey: "sanity check: How big is the new image? And the layers?" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030059 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [16:00:44] (03CR) 10Ilias Sarantopoulos: "I am getting some errors related to ctranslate when trying to run it. Perhaps it just needs a version bump as well." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030059 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [16:01:54] elukey: unfortunately I'm not done with the above. Will update with the relevant info regarding docker layer sizes etc once it is fixed [16:02:53] (03PS2) 10Ilias Sarantopoulos: ores-legacy: add deprecation message to UI endpoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030178 (https://phabricator.wikimedia.org/T349996) [16:03:58] also WIP --^ as I want to test the app first. but if anyone has a suggestion for the deprecation message you're more than welcome [16:04:30] got to go afk folks, have a nice weekend [16:05:51] isaranto: ack! [16:05:56] * elukey back [16:25:25] elukey: unless you need me for anything, I'll head out as well. [16:26:06] yep please! Enjoy your weekend! [16:26:25] You too, once you head out! [16:26:32] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9786444 (10AUgolnikova-WMF) Product wise, it would not be a reasonable option for us and would defeat the goal of preventing bad uploads from coming during the upload... [17:12:52] going afk as well! have a nice rest of the day! [17:23:12] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9786631 (10Ladsgroup) It's not an impossible problem to fix but we should have been informed about this way sooner to be able to come up with a solution that doesn't c... [18:02:58] (03CR) 10Bartosz DziewoƄski: [C:03+2] Use correct IReadableDatabase interface in queryCallable callbacks [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029511 (owner: 10Thiemo Kreuz (WMDE)) [18:10:57] (03Merged) 10jenkins-bot: Use correct IReadableDatabase interface in queryCallable callbacks [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029511 (owner: 10Thiemo Kreuz (WMDE)) [22:42:44] FIRING: LiftWingServiceErrorRate: ... [22:42:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [22:57:44] RESOLVED: LiftWingServiceErrorRate: ... [22:57:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate