[07:16:08] Good morning! [07:17:28] morning Ilias :D [07:26:42] hey Aiko, welcome back! [07:37:59] 06Machine-Learning-Team: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#9788933 (10kostajh) [07:38:30] 06Machine-Learning-Team: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#9788935 (10kostajh) >>! In T356102#9785936, @kostajh wrote: >>>! In T356102#9747778, @achou wrote: >> Hi @kostajh, yes, this is something we can... [07:40:20] back with full energy! ^^ [07:54:18] that's great! [08:20:15] Welcome back, Aiko! [09:04:19] o/ hi Tobias! [09:09:57] (03CR) 10AikoChou: [C:03+1] "Agree! Let's remember to follow up. Maybe add a comment to note we need at least 1 user per model server." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1025805 (https://phabricator.wikimedia.org/T361881) (owner: 10Ilias Sarantopoulos) [09:10:46] (03CR) 10AikoChou: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1024425 (https://phabricator.wikimedia.org/T362663) (owner: 10Ilias Sarantopoulos) [09:57:22] * klausman lunch [10:06:55] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9789570 (10AUgolnikova-WMF) Yes, please refer to the requirements in the initial ticket discussed in March https://phabricator.wikimedia.org/T358676#9637065 [10:09:08] (03PS5) 10Ilias Sarantopoulos: revertrisk: update locust results [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1025805 (https://phabricator.wikimedia.org/T361881) [10:28:22] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9789674 (10MatthewVernon) >>! In T362749#9789570, @AUgolnikova-WMF wrote: > Yes, please refer to the requirements in the initial ticket discussed in March https://phab... [10:36:35] kevinbazira: o/ I think we can pause the work on logo detection until we figure out how the images are going to be accessed. wdyt? [10:53:11] * isaranto afk lunch [10:59:30] isaranto: o/ I agree. currently following the discussion between the Structured Content team and Data Persistence team as they devise the best solution to access images from the UploadStash. [11:04:55] +1 it seems necessary to have some discussion among us, structured data team, and data persistence team to reach a consensus of a proper solution [11:15:26] (03PS1) 10Tchanders: ContributionsHooksHandler: Inherit documentation from interfaces [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1030911 (https://phabricator.wikimedia.org/T364569) [11:16:33] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9789782 (10Cparle) >>! In T362749#9785550, @MatthewVernon wrote: > If I understand correctly, this is an admin tool that is able to look at what would otherwise be pri... [11:18:33] aiko: o/ the Structured Content team has set up a meeting tomorrow between 10:30 - 10:55 GMT to discuss the next steps. Ilias and I are attendees. Would you like to be added to the meeting? [12:11:17] kevinbazira: yes, please add me in :) [12:16:23] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9789915 (10Ladsgroup) It's not just the privacy. For example, we want to eventually implement uploading from URLs via chunks which means a file being uploaded can have... [12:23:29] aiko: sure sure [12:23:32] done [12:33:57] o/ [12:34:22] \o [12:35:21] 'ello Luca [12:35:39] elukey: I have a question 9r five) about the Logstash UI, if-when you have some time. [12:37:02] sure, shoot [12:37:27] So I tried to find the bad revids in the viwiki logs, but I can' seem to find them on logstash [12:37:43] I can find slow requests, but LS does nto show any request info, as far as I can tell [12:38:25] https://logstash.wikimedia.org/app/dashboards#/view/fa21f5e0-42ef-11ed-ae81-bb78ac0690d3?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:'2024-05-12T03:43:00.000Z',to:'2024-05-12T03:43:59.000Z'))&_a=(description:'',filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:kubernetes.namespace_name,negate:!f,params:(query:revscoring-editquality-revert [12:38:27] ed),type:phrase),query:(match_phrase:(kubernetes.namespace_name:revscoring-editquality-reverted))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:kubernetes.labels.serving_kserve_io%2Finferenceservice.keyword,negate:!f,params:(query:viwiki-reverted),type:phrase),query:(match_phrase:(kubernetes.labels.serving_kserve_io%2Finferenceservice.keyword:viwiki-reverted))) [12:38:29] ),fullScreenMode:!f,options:(hidePanelTitles:!f,useMargins:!t),query:(language:kuery,query:''),timeRestore:!f,title:'KServe%20container%20logs',viewMode:view) [12:38:39] usually what I do is to check the slow preprocess timing and annotate the x-request-id value [12:38:41] gah, that's not useful. But the shortener doesn't work :-/ [12:38:46] thank I look for it [12:39:12] How does annotation work? [12:39:13] because the logs about x-request-id -> {input-json} and the slow preprocess logs are separated [12:39:19] but they share the x-request-id [12:39:22] annotation? [12:39:36] you said you annotate the x-request-id [12:39:56] no I meant I annotate it somewhere :D [12:40:09] the x-request-id is a request header that istio injects [12:40:13] and kserve logs it [12:40:26] we also log it together with the input json [12:40:56] I'm not even sure I am using the right dashboard :-/ [12:41:09] I was looking at "Kserve container logs" [12:41:28] ah yes that one is fine, you should be able to check everything [12:41:31] but there is also [12:41:44] (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] "Added a comment about the users variable in the config" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1025805 (https://phabricator.wikimedia.org/T361881) (owner: 10Ilias Sarantopoulos) [12:41:49] * elukey finds the link sorry [12:42:17] https://logstash.wikimedia.org/app/dashboards#/view/5ab748b0-d228-11ee-985c-97a00bd32564?_g=h@c823129&_a=h@476e85d [12:42:25] that should lead to something similar [12:42:46] sorry this is better: https://logstash.wikimedia.org/app/dashboards#/view/5ab748b0-d228-11ee-985c-97a00bd32564?_g=h@c823129&_a=h@7884544 [12:42:55] I had some filters set in the other one [12:43:19] mmmm why do I get the superset dashboard [12:43:36] That gives me a superset app logs (kubernetes) page and a whole lot of "error restoring state from URL" messages [12:44:03] ok PEBCAK probalby [12:44:04] what's the dashboard's name? I can search in the least of DBs [12:44:07] let's retry! [12:44:08] https://logstash.wikimedia.org/app/dashboards#/view/7f883390-fe76-11ea-b848-090a7444f26c?_g=h@c823129&_a=h@55a2764 [12:44:58] yeah, that works [12:45:58] btw I think the autoscaling worked during the weekend, the problem auto solved quickly [12:46:52] ack, just wanted to make sure I can extract QoD if I ever need to [12:47:43] o/ hiii Luca! [12:48:12] So on that DB, what am I lookign for. I've narrowed it doiwn to viwiki-pred-019 in codfw and found a bunch of slow requests [12:48:39] Is the _id field the one to filter by? [12:52:42] try to check the x-request-id and how it is logged, it is easier than explaining it over IRC [12:52:56] ck [12:53:01] it is contained in the log itself [12:58:36] I'm struggling to understand how one goes from a message like "INFO:root:Function get_revscoring_extractor_cache took 587.3833 seconds to execute." or "kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 587.0403200000001" to finding the original request. Or do I need to combine searches in more than one dashboard? [13:00:37] I am checking `kubectl logs viwiki-reverted-predictor-default-00019-deployment-bcd64fbxqtnw -n revscoring-editquality-reverted` live atm [13:01:00] for example, this is not a slow request but let's assume so [13:01:01] 2024-05-13 12:58:03.801 kserve.trace requestId: a794750e-9c3e-4cc4-9d89-af8e30bd3ab8, preprocess_ms: 117.455244064, explain_ms: 0, predict_ms: 2.17628479, postprocess_ms: 0.000953674 [13:01:17] in our case, the preprocess_ms value would be super high [13:01:43] the requestId value can be used to find the log with the input json to repro, usually it is logged some lines before [13:01:52] in this case INFO:root:JSON payload for the request-id a794750e-9c3e-4cc4-9d89-af8e30bd3ab8: {'rev_id': 71367241, 'extended_output': False} [13:02:00] this is what I meant [13:02:23] I wrote "usually it is logged some lines before" since it may be a lot before, with batches etc.. [13:02:48] I would suspect that with Lucene mode on LS, one could just paste the reqId and find all relevant entries [13:02:57] klausman: not sure if you've read the backlog from friday, but IIRC the viwiki problem was related to a single IP making batch requests from ores-legacy [13:03:05] there was also the procedure about how I found it [13:05:07] Ah, I think i;ve figure out what's going on. My time window was too narrow, so I only found the "took this long" message or the JSON payload log [13:06:00] With the window set to 48h ago until now, I can find both slow requests and then the corresponding JSON payload entry [13:06:35] okok [13:07:51] One day, me and Logstash will be friends :) [13:08:02] it is like wishing to understand puppet :D [13:08:07] I stopped wishing a long time ago :D [13:08:47] I have to say though, I am pleasantly suprised how quickly Lucene search works once narrowed down to a pod instance [13:12:44] (03CR) 10Jforrester: [C:03+2] ContributionsHooksHandler: Inherit documentation from interfaces [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1030911 (https://phabricator.wikimedia.org/T364569) (owner: 10Tchanders) [13:13:15] anyway, thanks for the help, Luca [13:16:05] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9790147 (10mfossati) >>! In T362749#9789915, @Ladsgroup wrote: > you can just send over the file to liftwing maybe? (we should consider alternative designs and so on).... [13:17:00] (03PS4) 10Ilias Sarantopoulos: llm: bump torch and rocm 5.7 versions (2.2.1-rocm5.7) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030059 (https://phabricator.wikimedia.org/T362984) [13:17:00] (03CR) 10Ilias Sarantopoulos: "Unfortunately there is a really big layer (11.2GB), so this wouldn't work. The new versions of torch+rocm are even bigger." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030059 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [13:21:37] (03CR) 10Elukey: "Try to docker save it and then gzip it, to see more or less how the total size looks gzipped (we cannot see the gzipped size of a single l" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030059 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [13:32:58] (03Merged) 10jenkins-bot: ContributionsHooksHandler: Inherit documentation from interfaces [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1030911 (https://phabricator.wikimedia.org/T364569) (owner: 10Tchanders) [13:34:58] Good morning! [13:43:33] \o [13:43:56] o/ [14:05:18] (03CR) 10Ilias Sarantopoulos: "The gzip is 2.28GB!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030059 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [14:05:30] o/ Chris! [14:06:36] I am figuring out how I am getting to Google I/O tomorrow. I think I'm riding my bike there? [14:07:19] how far away is it? [14:11:21] mountain view, ouch! sounds like a really really good workout :P [14:13:39] o/ [14:19:06] Well, at least from The Mission to MTV it's mostly flat, so only wind would eb a problem there. But I dunno if there are any useful bike routes/paths [14:20:26] Goo Maps has a 3h23m route, with assorted hazards :-| [14:23:26] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9790461 (10mfossati) >>! In T362749#9786269, @isarantopoulos wrote: >>>! In T362749#9786161, @Ladsgroup wrote: >> Yes, Upload stash shouldn't be accessed directly or i... [14:24:10] (03CR) 10Elukey: "Super then this is not a concern :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030059 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [14:24:22] (03CR) 10Elukey: [C:03+1] llm: bump torch and rocm 5.7 versions (2.2.1-rocm5.7) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030059 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [14:29:31] (03CR) 10Ilias Sarantopoulos: [C:03+2] "Done" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030059 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [14:39:00] (03Merged) 10jenkins-bot: llm: bump torch and rocm 5.7 versions (2.2.1-rocm5.7) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1030059 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [14:54:46] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9790620 (10elukey) It seems not to be related to the OS, since nllb-gpu on Bookworm ran fine on ml-staging2001 (with the GPU). Something worth to notice: I tried to set... [15:16:41] (e)Lucky Luke - "reviews faster that his shadow" [15:19:54] klausman: re: istio cassandra changes - what was the error that you saw when deploying? It feels strange that we have to use ips [15:20:23] I'd have guessed that SNI could have been used [15:23:19] have you tried with protocol: TLS ? [15:23:51] I'll add to the review [15:25:17] isaranto: ahhaha no your reviews were one-liners [15:25:28] The error I got is in the revert-change https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1028959 [15:25:50] ah yes TLS is mentioned [15:25:52] I might try TLS [15:26:12] yes yes let's hope we can use SNI, otherwise raw IPs are really not great :( [15:26:38] Thing is, the protocol isnot HTTP, and I dunno if Istio will understand the Cassandra wire protocol [15:27:01] But maybe it's plain TLS for connection startup and after that, Istio stops caring. [15:27:53] klausman: I think that Istio just needs to be instructed that it is TLS, so it needs to check the SNI value, no cassandra protocol involved [15:28:07] it will just see a TLS connection going through [15:28:57] `resolution` should then also be `DNS`, I figure [15:29:04] isaranto: deployed to staging experimental, let's see if it works [15:29:08] klausman: yes yes [15:30:08] thanks! I was just gonna do that [15:31:51] elukey: I deleted the previous revision for mistral so the pod is now terminating (so the new one can get the GPU) [15:33:38] ah yes I was waiting for the new mixtral one to come up [15:33:52] anyway, if it works without any issues I'll cry [15:35:49] 06Machine-Learning-Team, 13Patch-For-Review: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons - https://phabricator.wikimedia.org/T363449#9790809 (10kevinbazira) Work on this task has been paused for now as the ML team, Structured Content team, and Data Pers... [15:37:28] elukey: ok for me to merge that change now but defer actual deployment until tomorrow? I aks just in case you may have pending changes for admin_ng that would be hard to deploy separately. [15:37:53] klausman: sure sure, I don't have anything pending [15:37:58] ack, thanks! [15:38:15] also you are in charge now, so anything that you decide is ok for me [15:38:19] :) [15:38:27] Sure, just coordinating :) [15:39:07] isaranto: [15:39:07] amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1) [15:39:08] amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1) [15:39:17] so it is ROCm version relateD! [15:39:39] not a great relief but at least we have some clue [15:40:07] the upside is we didn't do anything wrong [15:40:15] the downside is we need to figure this out :) [15:40:46] isaranto: this https://phabricator.wikimedia.org/T362984#9790620 now makes some sense [15:41:02] * isaranto nods [15:41:11] at this point the only hope is that ROCm 6.x fixes it [15:41:14] shall we give 5.6 a try? [15:41:29] lol. exactly [15:41:43] I tried it with the 2.1 base image IIRC [15:41:45] same error [15:42:27] a yes you said 6.x. ok yeah 5.6 wouldnt make sense cause alse torch 2.3.0 isn't there for 5.6 [15:46:38] also strace reports the same problem [15:46:39] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9790880 (10elukey) Ok finally something that is consistent: NLLB with pytorch 2.2.1 and ROCm 5.7 shows: ` amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1) amdgpu_device_initialize: a... [15:55:09] (03CR) 10Bartosz DziewoƄski: [C:03+2] Migrate IReadableDatabase::buildGroupConcatField to SelectQueryBuilder [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1027597 (owner: 10Umherirrender) [16:09:32] isaranto: I think that if we solve why print(os.access('/dev/dri/renderD128', os.F_OK)) returns False we should find the root cause [16:09:38] permissions are ok [16:09:50] os.access to /dev/dri and /dev seems good [16:10:07] but renderD128 yields to false [16:11:04] and if I do `"import os; open('/dev/dri/renderD128', 'r')"` it works [16:11:13] namely no exception returned [16:11:44] ack [16:12:42] The weird thing is that access("...", F_OK) _only_ checks if the file exists, nothing more [16:12:48] (03CR) 10Ilias Sarantopoulos: [C:03+2] utils: slow function execution wrapper (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1024425 (https://phabricator.wikimedia.org/T362663) (owner: 10Ilias Sarantopoulos) [16:12:54] (cf. https://man7.org/linux/man-pages/man2/access.2.html) [16:13:45] There is also a note about suid-running programs in that section, but AIUI, that is not relevant to us. But it makes me wonder about container stuff [16:15:40] also {R,W,X}_OK lead to consistent results (True, True, False in our case) [16:15:51] that is totally strange [16:18:30] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9791075 (10klausman) From https://man7.org/linux/man-pages/man2/access.2.html > access() checks whether the calling process can access the file pathname. If pathname is a symbolic link, it i... [16:18:43] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9791079 (10elukey) This is totally strange: ` root@deploy1002:~# kubectl exec nllb-200-gpu-predictor-00001-deployment-7fc4b4f798-9sv28 -n experimental -- ls -l /dev/dri total 0 crw-rw---- 1... [16:20:12] elukey: One detail: what changed between the working model and the nonworking one is that access() is called at all. It might well be that if the working one did an access() call, it'd presume the file didn't exist, either [16:21:42] klausman: yep yep I agree [16:21:51] this is why I think it is ROCm related [16:21:58] Unfortunately, googling for the relevant terms so far yields nothing [16:22:15] I am pretty sure we are the first one experiencing the issue [16:22:41] (03Merged) 10jenkins-bot: utils: slow function execution wrapper [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1024425 (https://phabricator.wikimedia.org/T362663) (owner: 10Ilias Sarantopoulos) [16:24:05] and on the bare metal node, 'nobody' (somebody doesn't exists there) returns True for F_OK [16:24:27] regarding rocm: it might also be that linking against a newer glibc/vdso adds that axtra access() call, i.e. it not something rocm changed consciously. [16:27:03] could be yes, this is a good point [16:27:24] but the main question to answer is, in my opinion, why F_OK in the container leads to that result [16:29:09] once more I've done an extensive github/google search on the matte with no results :P [16:29:35] *matter [16:29:48] so I'm just thinking to try torch 2.3.0 with rocm 6.0. wdyt? [16:30:38] I haven't found anything relevant in the release notes https://github.com/ROCm/ROCm/releases, although a lot of them just say "several fixes to HIP" which ofc doesnt help [16:33:44] yes yes let's try [16:34:33] docker info says [16:34:33] "PathOnHost": "/dev/dri/renderD128", [16:34:33] "PathInContainer": "/dev/dri/renderD128", [16:34:34] "CgroupPermissions": "rw" [16:34:51] now maybe "rw" is not enough for F_OK, for whatever reason [16:35:13] (03Merged) 10jenkins-bot: Migrate IReadableDatabase::buildGroupConcatField to SelectQueryBuilder [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1027597 (owner: 10Umherirrender) [16:36:06] (03PS1) 10Ilias Sarantopoulos: llm: bump torch and rocm 5.7 versions (2.2.1-rocm5.7) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031020 (https://phabricator.wikimedia.org/T362984) [16:36:34] elukey: are there any special values for CgroupPermissions? [16:36:45] can't find the docs for that [16:36:53] Lovely. [16:37:03] those values are set by the k8s device plugin and the kubelet, in theory [16:37:22] I created the patch above --^ will wait for the image to build to report on layer sizes etc. [16:37:56] I'm going afk for the day folks, enjoy the rest of your day/evening [16:39:42] I see some mentions of `rwm`, but no explanation what trhe m is for [16:40:52] isaranto: o/ [16:41:03] have a nice evening, Ilias [16:41:31] (03CR) 10Elukey: "Nit on the commit msg, the versions are wrong :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031020 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [16:42:55] ok my brain is totally fried at this point [16:43:59] will restart tomorrow :) Have a nice rest of the day folks [16:44:10] \o cya tomorrow [16:44:56] (03PS2) 10Ilias Sarantopoulos: llm: bump torch and rocm 6.0 versions (2.3.0-rocm6.0) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031020 (https://phabricator.wikimedia.org/T362984) [16:46:03] ah, found what m standsfor: mknod, i.e. the container is allowed to create devices. So not useful for us. [16:46:14] (found on https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt) [16:46:40] Heading out now as well \o