[06:22:21] Buongiorno o/ [07:08:21] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9709131 (10isarantopoulos) I managed to deploy [[ https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 | Mistral-7B-Instructv0.2 ]] on ml-staging using the GPU and 35GB o... [09:13:28] (03PS5) 10Kevin Bazira: logo-detection: add KServe custom model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1017453 (https://phabricator.wikimedia.org/T361803) [09:14:04] (03CR) 10Kevin Bazira: logo-detection: add KServe custom model-server (0312 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1017453 (https://phabricator.wikimedia.org/T361803) (owner: 10Kevin Bazira) [09:25:40] morning o/ [09:29:06] \o [10:07:58] * isaranto afk for a bit [10:09:55] (03CR) 10Elukey: "Thanks for the follow up! Asked a follow up question, we can chat over IRC in case you prefer!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1017453 (https://phabricator.wikimedia.org/T361803) (owner: 10Kevin Bazira) [10:19:53] (03CR) 10AikoChou: logo-detection: add KServe custom model-server (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1017453 (https://phabricator.wikimedia.org/T361803) (owner: 10Kevin Bazira) [10:32:34] * isaranto back! [11:34:46] (03CR) 10AikoChou: logo-detection: add KServe custom model-server (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1017453 (https://phabricator.wikimedia.org/T361803) (owner: 10Kevin Bazira) [11:53:27] hello folks! [11:55:06] hey Luca! [11:56:05] I am going to the physio in an hour, a bit unexpected but my knee started to act funky and I am going to get it checked :D [11:58:31] hope it is nothing to worry about! 🤞 [11:58:52] hope not, it hurts a little, and it has been a while so better to get it checked [12:44:37] * elukey afk for a bit! [13:12:35] o/ [13:22:11] (03PS6) 10Kevin Bazira: logo-detection: add KServe custom model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1017453 (https://phabricator.wikimedia.org/T361803) [13:23:14] (03CR) 10Kevin Bazira: logo-detection: add KServe custom model-server (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1017453 (https://phabricator.wikimedia.org/T361803) (owner: 10Kevin Bazira) [13:56:32] back! [14:06:54] folks if you are ok I'd deploy the new istio changes to move ml-staging from api-ro to the new k8s endpoint [14:07:01] mw-api-int-ro.discovery.wmnet [14:07:46] due to how things are wired, I need to: [14:07:58] 1) upgrade the virtual service config for mediawiki (istio) [14:08:32] the above will break the isvcs basically, since we'll need to deploy something like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1018959/3/helmfile.d/ml-services/article-descriptions/values-ml-staging-codfw.yaml for all isvs [14:08:36] *isvcs [14:10:25] 2) deploy all the isvc changes for ml-staging [14:10:42] that is fine, shouldn't take much, but we'll have staging broken for the next say one/two hours [14:10:45] is it ok? [14:10:49] Cc: aiko, isaranto, kevinbazira [14:12:22] elukey: all ok from my side! I was working on ml-staging but I'm going to work locally for the time being and I can continue there on Monday. Is there anything I can do to help? [14:12:52] elukey: o/ I am not using staging atm. so it's ok with me. [14:12:58] ack thanks! [14:13:18] elukey: just saw your comment on the environment variables :'( [14:13:23] all good I should just test if it works, nothing horribly complicated (hopefully) [14:13:38] claime: don't worry I'll fix the changes, you already did too much :) [14:13:49] <3 [14:22:17] elukey: I didn't ask but do you know roughly the RPS from your services? [14:24:26] claime: don't recall exactly but below 100 rps IIRC [14:24:31] ok cool [14:24:36] Nothing to do on my end then [15:10:34] it is so bad in staging that deploying article-descriptions fail due to not enough cpus :D [15:31:09] kevinbazira, isaranto - one thing to double check - I see article-descr pods both in experimental and article-description namespaces (on ml-staging) [15:31:17] can we drop the one in experimental? [15:31:24] to gather some free cpus [15:31:56] I was aware, left it there to experiment and forgot it eventually. Yes we can drop it [15:31:58] elukey: yes we can drop experimental [15:32:38] super filing a patch! [15:33:57] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1019294 [15:34:18] if anybody has a min :) --^ [15:37:16] +1 [15:37:53] thanksss [15:41:03] (03CR) 10Ilias Sarantopoulos: "Thanks for the fixes Kevin!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1017453 (https://phabricator.wikimedia.org/T361803) (owner: 10Kevin Bazira) [15:41:51] elukey: I can deploy the experimental ns (if you already haven't) [15:42:06] isaranto: ah yes go ahead! [15:42:11] also if mistral is failing (it will because of the GPU ) don't worry about it I'll fix it [15:42:12] ack [15:42:19] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1018963 is pending [15:42:29] needs to be rebased on top of the removal of article-descr [15:43:00] shall I do it? [15:43:06] if you want <3 [15:43:22] on it [15:45:55] elukey: done! [15:46:53] +1ed! [15:48:18] shall I merge and deploy? any way to validate the CIDRs? (other than checking files in the repo?) [15:49:33] elukey: --^ [15:50:08] you can merge, the ips are good [15:50:23] if you want to double check you can do a dig -x $ips on deploy1002 [15:51:24] ok all revscorings deployed to staging [15:51:27] ack, thanks [15:54:03] I deployed both changes in experimental [15:54:29] super [15:55:44] I'm logging off, have a nice weekend folks o/ [15:55:51] the prod upgrade will be interesting, I think we'll need to depool one dc at the time [15:55:54] o/ [15:55:55] you too! [15:58:56] 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 06SRE, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9710449 (10elukey) Current status: * all services deployed in ml-staging, need to double check that all the pods are running but so far I didn't no... [15:59:09] 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 06SRE, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9710450 (10elukey) a:05Clement_Goubert→03None [16:18:30] going afk for the weekend, have a nice rest of the day (and weekend) folks!