[04:26:39] (03PS4) 10Santhosh: Add language identification service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/932828 (https://phabricator.wikimedia.org/T340507) [04:34:00] (03CR) 10CI reject: [V: 04-1] Add language identification service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/932828 (https://phabricator.wikimedia.org/T340507) (owner: 10Santhosh) [04:37:27] (03PS5) 10Santhosh: Add language identification service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/932828 (https://phabricator.wikimedia.org/T340507) [04:37:33] (03CR) 10Santhosh: Add language identification service (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/932828 (https://phabricator.wikimedia.org/T340507) (owner: 10Santhosh) [04:44:31] (03CR) 10CI reject: [V: 04-1] Add language identification service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/932828 (https://phabricator.wikimedia.org/T340507) (owner: 10Santhosh) [05:52:24] 10Machine-Learning-Team, 10WMF-General-or-Unknown, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10elukey) [08:06:12] I've installed Java 8 security updates on ml-cache, can you please take care of roll-restarting Cassandra? [08:06:46] (I added a generic sre.cassandra.roll-restart cookbook some time ago, can be used for this) [08:09:40] moritzm: sure! [08:10:03] ack [09:23:13] 10Machine-Learning-Team, 10Gerrit-Privilege-Requests: Grant ML Team members +2 rights to the recommendation-api repository - https://phabricator.wikimedia.org/T340531 (10kevinbazira) [09:25:26] elukey: want me to do the roll restart or are you doing it? [09:26:54] o/ [09:26:56] klausman: I need to change some certs and see if the keystore gets reloaded, I'll take it [09:27:08] Alrighty [09:32:35] the change is https://gerrit.wikimedia.org/r/c/operations/puppet/+/933224 [09:32:49] that is hopefully the last one [09:50:36] LGTM [09:55:59] ok I finally found some swagger api docs for https://recommend.wmflabs.org/types/translation/ [09:58:49] I wonder what "campaign" is for [09:59:51] there is also https://recommend.wmflabs.org/api/ but requires a post [10:08:52] ah I see from the swagger ui there is also /spec [10:11:32] but we don't have anything for https://recommend.wmflabs.org/api/spec [10:11:43] I am wondering if it is due to the nginx config [10:11:47] for gap finder [10:14:23] I see [10:14:24] [enabled_services] [10:14:24] gapfinder = True [10:14:24] translation = True [10:14:24] related_articles = False [10:19:51] confronting the APIs (nodejs vs python) they seem to be somehow different [10:20:31] "You're in a maze of twisty little APIs, all different." [10:20:39] (I may be showing my age with that quote) [10:21:01] <- Lunch plus citizenship errand [10:28:13] https://phabricator.wikimedia.org/T288470#8966938 sigh [10:38:53] https://arxiv.org/abs/2305.17493v2?utm_source=www.turingpost.com&utm_medium=referral&utm_campaign=generative-ai-hype-is-overwhelming [10:38:56] wow [10:39:57] * elukey lunch [11:09:08] --^ this is interesting! it is a major debate on where things should go. different models rather than just bigger ones [11:10:03] also there is the concern about new training data. if everyone uses GPT models instead of crowd sourced QA (e.g. stackoverflow) [11:10:12] anyway lunch for me as well [12:11:34] 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10Aklapper) [12:35:13] this seems niiiice https://github.com/vllm-project/vllm [12:47:51] although not many models are supported at the moment [12:50:26] I'm seeing something inconsistent in the bloom-3b deployment in ml-staging [12:53:31] --verbose :) [12:53:42] it is using an old image version although the latest one has been deployed. Looking a bit deeper the isvc seems to be failing. running `kubectl describe isvc bloom-3b` I figure out an 137 exit code has been issued which concludes to an OOM issue and has a label "ModelLoadFailed". What would be a way to redeploy? manually delete the isvc and resync/redeploy? [12:54:12] u got me as I was writing :) `isaranto explaine --verbose` [12:54:18] *explain [12:54:43] :) [12:54:55] bloom-3b is running fine though no? [12:55:32] kubectl get revision -n experimental shows some revisions in Deploying though [12:55:47] ah I see the generation 4+ have problems [12:57:02] Error creating: pods "bloom-3b-predictor-default-00006-deployment-ccdfd996b-477rq" is forbidden: [maximum memory usage per Container is 14Gi, but limit is 20Gi, maximum memory usage per Pod is 20Gi, but limit is 22705864704] [12:58:08] there was a namespace diff not synced [12:58:10] let's see [13:00:12] aa [13:01:30] usually it takes a bit for this kind of changes, I still see the errors [13:07:26] ack [13:11:11] isaranto: cleaned up old revisions in a weird state, the pods are coming up [13:11:21] ok, thanks! [13:19:03] lol I think we are hitting the kubelet partition problem in staging [13:19:04] ufff [13:21:37] ok so I am draining ml-staging2001, expanding the partition, and same for 2002 [13:46:52] ok done, let's see if the pods come up [13:50:53] isaranto: falcon gets oom killed, not sure if there is any difference with the prod setup [13:52:26] it is too demanding and I think we can kill it as we don't need it at the moment,but we have the dict issue,so we need to override it somehow [13:55:13] we could add a special value (like empty dict) to skip isvcs in the template [13:56:11] (03PS6) 10AikoChou: readability: add readability model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931987 (https://phabricator.wikimedia.org/T334182) [13:58:47] aha! I'm on it let me try [14:02:24] (03CR) 10CI reject: [V: 04-1] readability: add readability model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931987 (https://phabricator.wikimedia.org/T334182) (owner: 10AikoChou) [14:44:44] 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10isarantopoulos) @santhosh Thanks for creating the task and taking the time to read docs and create a patch! Could y... [15:19:09] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review! :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931987 (https://phabricator.wikimedia.org/T334182) (owner: 10AikoChou) [15:29:32] (03CR) 10CI reject: [V: 04-1] readability: add readability model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931987 (https://phabricator.wikimedia.org/T334182) (owner: 10AikoChou) [15:31:03] (03CR) 10AikoChou: [V: 03+2 C: 03+2] readability: add readability model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931987 (https://phabricator.wikimedia.org/T334182) (owner: 10AikoChou) [15:37:39] (03CR) 10CI reject: [V: 04-1] readability: add readability model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931987 (https://phabricator.wikimedia.org/T334182) (owner: 10AikoChou) [16:04:54] seems if there is any pipeline failed, CI can't proceed to the publish step :( llm and langid pipeline failed [16:56:36] :( [16:56:43] let's chat with releng tomorroW! [16:56:51] going afk folks, have a nice one [19:49:40] 10Machine-Learning-Team, 10Foundational Technology Requests: Content Translation Recommendations API - https://phabricator.wikimedia.org/T293648 (10leila) @elukey I acknowledge your comment and working on it. We will get back to you. [20:37:50] 10Machine-Learning-Team, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Python torch fills disk of CI Jenkins instances - https://phabricator.wikimedia.org/T338317 (10hashar) Reopening since the disks keep filing and I also reopened the task to resize the instances (T340070). Not sure...