[06:53:54] good morning! [06:54:42] Good morning [07:00:59] kalimera! [07:01:19] isaranto: do you have more info about why outlink got to zero pods yesterday? [07:04:09] btw I rechecked helmfile diff for all ml-services and nothing seccomp related is pending [07:04:45] no I didn't have the chance to dig into it. I was just going afk so I was focused on just fixing it. I saw that there was a revision but no pod. [07:04:54] I don't recall anything about the deployment though [07:15:15] very weird, maybe it was something autoscaling-related [07:15:31] anyway, we should be good now [07:19:18] sorry should have kept some logs. At first sight I didn't see anything though, it seemed weird so I guess it required more digging [07:22:12] nono it was just a curiosity, it is totally my bad to have missed the seccomp rollout in outlink [07:22:53] I'll add a note to double check all namespaces in the procedure that we'll follow for eqiad [07:24:43] no worries at all, thanks for all the work! [07:24:56] np! <3 [07:26:15] 06Machine-Learning-Team, 10Editing-team (Tracking): Peacock detection model GPU deployment returns inconsistent results - https://phabricator.wikimedia.org/T393154#10798989 (10isarantopoulos) Let's create a plan and test some things in order to debug this. I'm starting with some suggestions. In every test we s... [07:37:20] good mornig [07:37:54] good morning! [07:44:04] georgekyz: o/ check my comment in https://phabricator.wikimedia.org/T393154#10798989 and let me know if you agree or have a different plan for this [07:44:22] isaranto: I am on it [07:45:01] my main suggestion is to try to capture all our efforts while debugging this so that we reach a proper conclusion and learnings which we'll use in the future [07:45:14] and ofc make some documentation out of it [07:45:18] isaranto: looks good, I think we have already prove that we have deterministic results both on gpu and cpu [07:45:36] ok! thanks [07:45:42] I will document everything in each experiment [08:20:56] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10799184 (10kevinbazira) In T385173#10737743, we ran inference latency benchmarks using the upstream ROCm-vLLM image to understand how vLLM performs when serving the `aya-expanse-32b` model on an... [08:22:36] o/ morning morning [08:22:36] here are the benchmarking results of vLLM serving the `aya-expanse-32b` model in the `wmf-debian-vllm` image: https://phabricator.wikimedia.org/T385173#10799184 [08:22:36] the ported image has similar performance as the upstream image. [08:23:14] I am trying to build the blubber in ml-lab but I am getting a permission denied: [08:23:30] https://www.irccloud.com/pastebin/kaAbM0tH/ [08:24:11] georgekyz: you need to sudo :) [08:24:40] elukey@ml-lab1001:~$ ls -l /var/run/docker.sock [08:24:40] srw-rw---- 1 root docker 0 Apr 29 08:15 /var/run/docker.sock [08:24:51] thnx elukey [08:24:55] so either root or a member of the docker group can use it [08:25:27] I did not know that I can use sudo powers in ml-lab [08:25:28] :p [08:25:39] use it with extreme care :D [08:26:51] I am not sure which is my sudo pass tho :P [08:27:10] it is passwordless, your ssh key basically grants you the sudo [08:27:11] so I am using it with extreme care :P :P [08:27:51] it doesn't seem to be like this :( [08:28:02] ah wait no, I thought that ml-team-admins granted sudo [08:28:06] I am checking puppet and it doesn't [08:28:13] okok now I get it, it asks for the password [08:28:19] yeap.... [08:28:49] but I never set up a pass for sudoers [08:29:17] okok so the easiest should probably be to change the /var/run/docker.sock ownership, or to automatically add folks to the docker group [08:30:40] is it something that I could do ? [08:31:04] lemme check puppet, in another host (build2001) where we have docker I am in the docker group, there may be a setting [08:32:41] okok there is a quick way in admin/data.yaml, but I think we need to first figure out what is the plan to use docker on ml-lab [08:32:52] georgekyz: why not just pull the image from the registry to test first? unless you want to try sth else [08:53:22] https://www.irccloud.com/pastebin/6K5nLNqn/ [09:12:02] hmm wait lemme check [09:13:29] ok found it georgekyz are you using ml-lab1002 or ml-lab1001? docker is only installed on ml-lab1002, there you shouldnt have an issue [09:15:18] it is installed on 1001 as well [09:16:23] on both the problem is the same, namely the docker socket needs to be accessed by root or members of the docker group [09:16:42] we already have automation in puppet to say "this group of uids will get to docker" [09:16:50] ack, you're right the message george says indicates that as well [09:17:03] but IIRC the docker install step was a test, we never really decided if we wanted to officially install it [09:17:14] if so, there are some decisions to make, namely who can run docker etc.. [09:17:29] because running docker means escalating to root [09:17:55] and we cannot allow, imho, everybody using ml-lab to root in that way [09:18:06] maybe ml-team-admins only [09:18:17] but it needs to be confided to some people [09:19:16] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10799385 (10kevinbazira) Hi @elukey, following your suggestion in T385173#10538744, we ported the upstream Ubuntu based [[ https://hub.docker.com/layers/rocm/vllm/rocm6.3.1_mi300_ubuntu22.04_py3.1... [09:23:27] elukey: how about 9for now) adding the ML team people to the docker group? [09:30:09] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10799408 (10isarantopoulos) I would suggest the following in order to make reviewing a bit easier: @kevinbazira can you open a new MR in the same repo and add a "polished" version of the dockerfil... [09:37:51] klausman: you can do it properly via admin's data.yaml, we did it for other use cases, my main point is that we need to figure out what we want to do [09:37:59] make a case etc.. [09:38:08] otherwise we add somethihg manually and we forget [09:38:40] yeah, that was my bad. I had added Ilias initially and then never made it a Puppet thing. [09:39:00] that's fine, a small initial test is ok [09:39:01] I think having the ml admin members be in the docker group is the easiest approach for now. [09:39:06] but now it is becoming bigger :) [09:40:34] Yeah, if I hadn't _forgotten about it,_... [09:43:54] let's open a task for access request etc.. [09:49:25] 06Machine-Learning-Team: Add the ML team to the POSIX group `docker` on the ML lab machines. - https://phabricator.wikimedia.org/T393566 (10klausman) 03NEW [09:49:31] Opened ^^^ for now [09:52:14] ack [10:16:22] sorry folks I was in a meeting. I am using `ml-lab1001` which also has docker on it. In both machines I am getting the same error. [10:18:02] I can quickly add you to the docker greoup, but you will have to re-ssh-in [10:18:08] ok [10:18:11] thnx [10:18:32] done on both machines [10:19:27] danke [10:20:32] Παρακαλώ! [10:21:18] Ahwait, that's the same false friend as "bitte"<->"you're welcome"/"please" [10:21:37] Or is it? Now I have confused myself :D [10:53:02] hahaha, no you are correct, it is used mainly as bitte so you were correct on the answer [10:53:22] so it is an answer to thank you, you say parakalw. [10:54:25] It is mainly used as the dutch "Alstublieft" which you can use it as a response to dankje, but you can also use it as "please". [11:00:02] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10799694 (10kevinbazira) >>! In T385173#10799408, @isarantopoulos wrote: > I would suggest the following in order to make reviewing a bit easier: @kevinbazira can you open a... [12:13:02] georgekyz: interesting, I did not know languages other than German (and Swamp German ;)) had the please/you're welcome overlap [12:37:07] georgekyz: o/ for archival purposes, here is the command we used to run the edit-check model-server on ml-lab1002 with GPU and mounted model volume: https://phabricator.wikimedia.org/P75872 [12:37:40] kevinbazira: thnx so much mate! I will include it to the phab ticket with the experiments [12:38:06] sure sure. np! [13:01:59] 06Machine-Learning-Team, 10LDAP-Access-Requests, 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10800232 (10isarantopoulos) [13:04:50] 06Machine-Learning-Team, 10LDAP-Access-Requests, 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10800237 (10isarantopoulos) [13:07:31] 06Machine-Learning-Team, 10LDAP-Access-Requests, 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10800246 (10isarantopoulos) I approve adding Bartos... [20:50:52] 06Machine-Learning-Team, 06Discovery-Search, 10MediaWiki-Search: Build and enable thesaurus / synonym list for search - https://phabricator.wikimedia.org/T85770#10802450 (10TJones) @Jack_who_built_the_house, as things currently stand, I don't think this is the right ticket for what you are proposing. This ti... [22:27:22] 07artificial-intelligence, 10WikiCite: Reference recommender system - https://phabricator.wikimedia.org/T155846#10802749 (10SEgt-WMF) Hi @Harej ! I'm interested in both the status of this amazing proposal (and if there are any plans of continuing the work for Librarybase as well!)