[06:05:19] 06Machine-Learning-Team: [LLM] Use Flash attention 2 for inference - https://phabricator.wikimedia.org/T371344 (10isarantopoulos) 03NEW [06:05:21] aloha! [07:01:47] 06Machine-Learning-Team: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10026270 (10isarantopoulos) [07:39:30] 06Machine-Learning-Team, 06Content-Transform-Team, 06Research, 13Patch-For-Review: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#10026369 (10isarantopoulos) a:03isarantopoulos [07:39:47] kevinbazira: o/ is there anything I can help with for logo detection? [07:41:02] isaranto: o/ yes, I am pushing a patch in a bit. your review will be helpful. thanks! [07:41:11] ok! [07:44:24] here is the patch: https://gerrit.wikimedia.org/r/1058031 [07:53:56] looks good! let's wait for Tobias to review later as well cause he'll also have to create the namespace [08:07:52] yep, I noticed the ns is missing: https://phabricator.wikimedia.org/P67036 [09:00:34] 06Machine-Learning-Team, 06Content-Transform-Team, 06Research, 13Patch-For-Review: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#10026587 (10isarantopoulos) Update: I'm having some issues while building the Lift Wing service which is cause by dependencies. I'm getting t... [09:14:20] おはよう! [09:25:15] o/ [09:26:23] I'm convinced to take mandarin classes! [09:26:49] I found this to get started https://www.coursera.org/learn/learn-chinese , don't know if it is good. but suggestions welcome! [09:31:27] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1058097 is ready for review [09:38:59] thanks, lgtm! [09:45:31] klausman: o/ thank you for creating the ns. should I proceed with the deployment? [09:45:50] No, there is still a part missing (secrets) [09:46:37] ok ok. I'll wait for your confirmation before proceeding ... [09:47:06] the longets bit is running the puppet agent on the deploymentr servers :) [09:48:47] isaranto: I personally love https://www.chineasy.com/ [09:49:43] also IIUC Tobias' morning is japanese :D [09:49:53] Yes, it is :) [09:49:57] 早上好! [09:50:19] this is mandarin, basically 90% of what I can say now :D [09:50:30] elukey: thanks! I'll give that a try. I do prefer interactive learning than videoz [09:50:57] elukey: while you're here, how do the secrets from the PM make it to the deployment servers? run-p-a should be enough, right? [09:51:00] well I would be happy if I can just manage to distinguish characters. I feel sooo dumb [09:51:43] isaranto: it takes a ton of time, but it opens the door to exploring a big new culture, this is the interesting bit [09:52:06] klausman: puppet private repo -> deploy100x? [09:52:11] yeah [09:52:21] well, no, the actual secrest on the Puppetmaster [09:52:26] yep they get rendered under /etc/helmfile-something/tec.. [09:52:55] those secrets are on the private repo though [09:53:29] check /etc/helmfile-defaults on a deployment server [09:53:45] they are puppet templates rendered in there, using secrets from the private repo [09:53:54] you have that path in your helmfile.yaml config [09:54:00] so it gets picked up etc.. [09:54:06] (when you helmfile-deploy I mean) [09:54:30] So we have a new service and new NS, logo-detection [09:54:46] and when I try to diff it, I get an error that User "logo-detection-deploy" cannot list resource "secrets" in API group "" in the namespace "logo-detection" [09:55:14] The NS doesn't exist yet, but even if I create it manually, it doesn't seem to matter [09:55:20] okok I think you probably forgot to add secrets to the puppet private repo [09:55:29] (this is all in staging, so far) [09:55:39] as in the mock-private one? [09:56:39] The actual secrest repo I edit (anc committed, e2d0e7d13c56f05a9407df08333c8795433f29ea) [09:57:12] what is secrest? [09:57:21] secrets* [09:57:32] /srv/private on pm1001 [09:57:44] okok lemme check [10:00:34] ah right I think the namespace config is missing from admin_ng's ml-serve.yaml [10:00:49] ah, good point, I'll make a patch [10:07:16] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1058105 [10:09:06] I wonder if we also need extra SANs (l.99 in that file) [10:09:48] ah, but chjangeprop won't call this service, so we good. [10:09:55] mm good point [10:09:57] wait a sec [10:10:19] okok looks good, even if I'd add it for consistency [10:10:28] will add it, sec [10:10:32] if we'll ever use changeprop with it then finding the bug will be a nightmare [10:11:45] yeah [10:12:05] and it's not like having it in there costs us much. [10:24:00] kevinbazira: logo-detection NS should be ready for deployment in staging [10:31:16] klausman: thanks. going to deploy now .. [10:39:02] * isaranto afk lunch [10:40:02] 69/me same [10:40:05] oops :) [10:43:15] logo-detection deployed on staging codfw: https://phabricator.wikimedia.org/P67055 [10:43:27] going to deploy to prod ... [10:51:43] klausman: looks like there are mssing secrets in prod. running `helmfile -e ml-serve-eqiad diff` returns an error. [10:51:59] yeah, I hadn't deployed there yet, doing that now [10:53:53] and it failed, because it pulled in the broken knative change from yesterday\ [10:55:51] trying to finagle this... [10:56:41] ok, eqiad should be good [11:03:27] yep, I've deployed on eqiad and the pod is up and running. [11:04:10] codfw should be good as well [11:08:59] hi folks o/ [11:09:47] hi Aiko :) [11:22:15] aiko: o/ [11:22:55] klausman: isaranto: thank for your help. the logo-detection isvc is up and running in prod: https://phabricator.wikimedia.org/P67063 [11:23:09] most excellent [11:23:40] I shall return to my lunch now :) [11:26:11] 06Machine-Learning-Team, 06Structured-Data-Backlog: Deploy logo-detection model-server to LiftWing production - https://phabricator.wikimedia.org/T370757#10026957 (10kevinbazira) a:03kevinbazira [11:42:20] niiice! [11:49:01] 06Machine-Learning-Team, 06Structured-Data-Backlog: Deploy logo-detection model-server to LiftWing production - https://phabricator.wikimedia.org/T370757#10027062 (10kevinbazira) @mfossati, the logo-detection inference service is now [[ https://phabricator.wikimedia.org/P67063 | live ]] in LiftWing production.... [13:01:37] So in order to solve the issue with numpy described in the article quality task https://phabricator.wikimedia.org/T360455#10026587 I was thinking the following: [13:03:38] TL;DR kserve requires numpy <2.0.0 and we need at least 2.0.0 to be compatible and not have issues. [13:06:24] - use wmf kserve fork and bump numpy requirement to "^2.0.0" [13:06:24] - use this to install kserve as it would allow latest versions of numpy to be intalled [13:06:24] - now the issue is with the numpy version pyopencl -> https://phabricator.wikimedia.org/P67075 so I'm going to upgrade that as well and test it but it affects all services [13:07:11] the alternative would be to request the model to be pickled using an older version of numpy but that feels like taking step backwards [13:08:12] ok this works [13:09:33] an alternative will be to just update the pyopnecl reqs.txt in the model instead of the python/reqs.txt [13:09:35] (end of monologue :P ) - I'll add this to the task [13:14:07] ah, the neverending tribulations of dep hell; [13:49:51] Good morning all [13:50:06] Morning, Chris [13:54:20] mooorning! [13:55:47] 06Machine-Learning-Team, 06Content-Transform-Team, 06Research, 13Patch-For-Review: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#10027725 (10Isaac) > After checking Isaac's notebook I found that the model has been trained using numpy 2.0.0, so ideally this would be the n... [13:58:55] 06Machine-Learning-Team: [LLM] Multi-GPU Inference - https://phabricator.wikimedia.org/T371384 (10isarantopoulos) 03NEW [14:28:38] 06Machine-Learning-Team: Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that uses an inference optimization engine in production. - https://phabricator.wikimedia.org/T371395 (10calbon) 03NEW [14:28:42] 06Machine-Learning-Team: Goal 2: People outside the ML team can ssh into an ml-lab machine, run a Jupyter Notebook, and run PyTorch powered by a GPU. - https://phabricator.wikimedia.org/T371396 (10calbon) 03NEW [14:28:43] 06Machine-Learning-Team: Goal 3: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services. - https://phabricator.wikimedia.org/T371397 (10calbon) 03NEW [14:28:44] 06Machine-Learning-Team: Goal 4: Support product teams in deploying production models. - https://phabricator.wikimedia.org/T371398 (10calbon) 03NEW [14:37:19] 06Machine-Learning-Team: [LLM] Gemma2 in staging: HIP out of memory - https://phabricator.wikimedia.org/T370615#10028055 (10isarantopoulos) p:05Triage→03High [14:39:03] 06Machine-Learning-Team: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10028059 (10isarantopoulos) p:05Triage→03High [14:39:06] 06Machine-Learning-Team: [LLM] Explore low_cpu_mem_usage option when loading model in transformers - https://phabricator.wikimedia.org/T370935#10028063 (10isarantopoulos) p:05Triage→03High [14:56:53] 10Lift-Wing, 06Machine-Learning-Team: Request to update Readability model on Lift Wing - https://phabricator.wikimedia.org/T369712#10028191 (10calbon) a:03AikoChou [17:04:55] 06Machine-Learning-Team, 06Content-Transform-Team, 06Research, 13Patch-For-Review: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#10028826 (10isarantopoulos) @Isaac We're going to solve the numpy issue by relaxing the kserve restriction by using our [[ https://github.com/... [17:05:50] * isaranto afk - have a nice evening! [17:37:14] 06Machine-Learning-Team, 06Content-Transform-Team, 06Research, 13Patch-For-Review: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#10028957 (10FNavas-foundation) @Isaac ` In theory we could include all of these in every response but getting the score and label does requi...