[06:20:48] (03PS2) 10Kevin Bazira: outlink_topic_model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054526 (https://phabricator.wikimedia.org/T369344) [07:12:55] o/ [07:25:25] (03PS1) 10Kevin Bazira: langid: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054801 (https://phabricator.wikimedia.org/T369344) [07:25:49] (03CR) 10CI reject: [V:04-1] langid: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054801 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [07:38:16] (03CR) 10Kevin Bazira: "This failing CI pipeline: https://integration.wikimedia.org/ci/job/inference-services-pipeline-langid/144/execution/node/48/log/" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054801 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [07:46:47] this is an interesting approach to solve the problem of long startup times on LLMs in kserve https://docs.google.com/document/d/1nao8Ws3tonO2zNAzdmXTYa0hECZNoP2SV_z9Zg0PzLA/edit [07:55:31] Ben is experimenting with Ceph CSI on DSE to get PV mounted, there is a ton of config/etc.. to deploy to a cluster to make it work, maybe long term it could be reused [07:59:30] o/ Luca [08:01:03] yeah, just dropping it as an idea. the above is also a proposal (there is also a PR in WIP https://github.com/kserve/kserve/pull/3781) but I doubt it will end up in kserve anytime soon [08:03:16] it will take a bit yes :) o/ [08:03:30] the size of the model binaries is getting really gigantic [08:12:42] iirc the gemma2 model we deployed in staging is 51GB [08:19:55] sigh [09:29:22] Morning! [09:29:50] I (think I) found the reason for the KubernetesDeploymentUnavailableReplicas error yesterday [09:31:05] First I looked at the Grafana DB in the link, and aorund the time the alert fired (13:30 UTC): https://grafana.wikimedia.org/goto/JTwyUmuSR?orgId=1 [09:31:59] I did this msotly to get a more precise timing, since searchin even 10m of Logstash lines is going to take forever. Since the alert might fire only after a prolonged breakage (like 30m), this helps narrow down the timeframe quite a bit [09:34:17] Then I went to the Logstash DB for Kubernetes Events, narrowed down things to 13:00-13:04, eqiad and revscoring-aq. https://logstash.wikimedia.org/goto/f6b1cec5d27435b94b8567a845c76eb3 That's still 107 messages, so I narrowed it further With a custom filter (`{"query":{"prefix":{"k8s_event.involvedObject.name":"enwiki"}}}`): https://logstash.wikimedia.org/goto/5643b75d8f0e34daa44dc87ea6f8f0b0 [09:34:50] And there, right at the top, we see: Error creating: pods "enwiki-articlequality-predictor-default-00020-deployment-dhtb6t" is forbidden: exceeded quota: quota-compute-resources, requested: limits.cpu=6, used: limits.cpu=118, limited: limits.cpu=120 [09:35:06] So the cluster was full CPU-wise. [09:35:34] Eventually, something else scaled down and enwiki-aq could scale its replicaset again. [09:43:55] ο/ [09:44:16] nice work! [09:48:08] we'll have to look into the cpu allocation and the utilization of the replicas especially since we want to enable mp in some services [09:54:33] Agreed. I had meant to do some work on that this Q, but looking at my plate ... :D [09:54:43] :D [09:55:19] we can do that! i think reducing the minreplicas for some services would be the first thing to do [09:56:02] ack [09:57:55] very nice log digging :) [09:58:34] one detail to check - the error seems indicating that we hit the namespace resource limits, so it may be just a matter of bumping the resources allowed in deployment-charts [10:08:06] ah, good point [10:30:29] 06Machine-Learning-Team, 10MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), 10Structured-Data-Backlog (Current Work): [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard - https://phabricator.wikimedia.org/T364551#9989502 (10isarantopoulos) In the past we have used the envoy proxy to... [10:31:37] * isaranto lunch! [10:54:08] (03PS10) 10Santhosh: major: modernize the codebase, keep only translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052445 (https://phabricator.wikimedia.org/T369484) [10:56:50] (03CR) 10Santhosh: major: modernize the codebase, keep only translation recommendations (033 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052445 (https://phabricator.wikimedia.org/T369484) (owner: 10Santhosh) [10:57:13] * klausman lunch [11:16:16] (03CR) 10Kevin Bazira: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054801 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [11:24:45] (03PS2) 10Kevin Bazira: langid: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054801 (https://phabricator.wikimedia.org/T369344) [11:38:48] (03CR) 10Ilias Sarantopoulos: [C:03+1] langid: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054801 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [12:46:32] (03CR) 10Kevin Bazira: [C:03+2] outlink_topic_model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054526 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [12:50:40] (03Merged) 10jenkins-bot: outlink_topic_model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054526 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [12:55:05] (03PS3) 10Kevin Bazira: langid: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054801 (https://phabricator.wikimedia.org/T369344) [12:56:32] (03CR) 10Kevin Bazira: [C:03+2] langid: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054801 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [12:57:16] (03Merged) 10jenkins-bot: langid: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054801 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [13:48:50] 10Lift-Wing, 06Machine-Learning-Team: Use huggingface text generation interface (TGI) on huggingface image. - https://phabricator.wikimedia.org/T370271 (10isarantopoulos) 03NEW [14:08:08] klausman: feel free to go ahead and deploy the ns cpu change :) [14:08:20] ack, will do, after mtg [16:03:40] going afk folks, cu tomorrow! [16:25:59] \o [19:08:22] 06Machine-Learning-Team, 06Content-Transform-Team, 06Research: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#9992313 (10Isaac) Ok, for the V1 of the model, I have everything ready to go! Specifically: * Normalization values for all language editions ** Artifact: https://an...