[03:34:46] (03CR) 10AikoChou: [C:03+1] Revert "llm: remove model-server" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1092873 (owner: 10Ilias Sarantopoulos) [03:40:24] 06Machine-Learning-Team, 10Recommendation-API, 10LPL Essential (LPL Essential 2024 Nov-Dec): Create logstash dashboard for recommendation-api-ng - https://phabricator.wikimedia.org/T380146#10338960 (10santhosh) 05Open→03Resolved Thanks @isarantopoulos For our immediate needs this seems sufficient. [03:52:42] 06Machine-Learning-Team, 10Recommendation-API, 10LPL Essential (LPL Essential 2024 Nov-Dec): Create logstash dashboard for recommendation-api-ng - https://phabricator.wikimedia.org/T380146#10338965 (10abi_) p:05Triage→03High [04:21:34] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Log and export preprocess size in inference services as a prometheus metric - https://phabricator.wikimedia.org/T374034#10338970 (10achou) I tried to request the articlequality model server's metrics and received the following: ` $ curl "https://infer... [08:13:53] o/ good morning [09:20:30] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Request to host article-country model on Lift Wing - https://phabricator.wikimedia.org/T371897#10339186 (10kevinbazira) > I saw the comment in the schema about `/* can't be one for all countries */` -- can you tell me more about that? I... [09:27:13] o/ In order to share this --^ communication with Isaac to test the latest endpoint, I have edited the article-country isvc deployment config directly in the experimental ns. [09:27:13] I pushed a patch for this change here: https://gerrit.wikimedia.org/r/1093006 [09:34:33] ack! [09:39:48] thanks for the review, Ilias :) [09:47:39] :D [10:15:24] https://grafana.wikimedia.org/goto/ocpdvInHR?orgId=1 Btw, I have fixed the GPU dashboard to only show our GPU machines now. The link I shared shows e.g. ml-lab1001 during the time we had our exploration session with Research [10:27:31] nice, thanks! [10:35:19] Also metnioned it to Martin on Slack, since he was looking for a dash [11:38:52] * isaranto lunch! [11:52:15] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: Expose ORES topics in recent changes filters - https://phabricator.wikimedia.org/T245906#10339601 (10Samwalton9-WMF) Lift Wing API for this - https://api.wikimedia... [12:11:04] * klausman lunch as well [14:29:52] (03CR) 10Klausman: [C:03+1] Revert "llm: remove model-server" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1092873 (owner: 10Ilias Sarantopoulos) [14:50:34] isaranto: I have made a fix for the wrong limitRanges thing (also verified): https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1093354 [15:43:43] klausman: o/ be careful that you also have "experimental" settings in ml-serve.yaml, not sure if it is needed in both places. It may get confusing in the future to have the settings in two files etc.. [15:52:00] Good point. I think Ilias was only needing them in staging for now, but yes, for the long term it should be more unified. [16:03:58] thanks for that. yeah we only need it for ml-staging. we wouldn't want to allow a 75GB pod in prod (at least not for now) [16:10:01] sure but do we have 'experimental' in prod nowadays? [16:10:28] if yes then we can keep both, otherwise it is surely confusing to keep configs in two files [16:10:42] for sure somebody will make a change on one forgetting the other at some point :D [16:12:58] we do have it in prod but we don't use it or intend to use it [16:16:17] is there a way to just delete the prod experimental ns? I feel that would end any confusion [16:20:43] I vaguely remember we created it because there was one corner cas for testing which we could only do in prod. [16:26:27] we created it the last time since we wanted to test the only GPU on ml-serve1001 [16:26:45] we can probably drop experimental in prod then [16:26:56] ah, right. [16:27:16] I'll make a patch tomorrow to remove the extra bits, and delete the NSes in prod [16:28:18] ack. thanks! [16:30:44] 06Machine-Learning-Team: Update output schema for reference risk model - https://phabricator.wikimedia.org/T378939#10340820 (10achou) 05Open→03Resolved New changes have been deployed to production. [16:30:56] okkk [16:32:41] thanks for the support Luca! [16:36:33] <# [16:36:35] <3 [18:06:10] I've been building this docker image for over 90 minutes! https://github.com/ROCm/vllm/blob/main/Dockerfile.rocm [18:06:11] ouch [18:06:33] going afk - have a nice evening/night/rest of day!