[06:42:17] good morning! [06:51:05] 06Machine-Learning-Team, 06Language-Team, 07Epic: Migrate Content Translation Recommendation API to Lift Wing - https://phabricator.wikimedia.org/T308164#9905922 (10kevinbazira) Super! The deprecation message you prepared is thorough. It will help users transition smoothly. Regarding other tools that use th... [07:26:41] 06Machine-Learning-Team: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) - https://phabricator.wikimedia.org/T365246#9905967 (10isarantopoulos) The model server has been successfully upgrade to kserve v0.13.0 and uses the pytorch 2.3.0 - rocm 6.0 base image. [07:44:46] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9906000 (10isarantopoulos) Huggingface image is now shipped with v0.13.0 of kserve and this is the one we are using. This task is considered done and this is the summary: - t... [07:46:58] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9906001 (10klausman) >>! In T357415#9905563, @Papaul wrote: > **Information2** > The server has only the the SFT-OOB-LIC license which is the Supermicro Out of band OOB li... [08:06:24] 06Machine-Learning-Team: Deploy 7b parameter models from HF - https://phabricator.wikimedia.org/T354870#9906050 (10isarantopoulos) I have deployed [[ https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct | llama3-8B-instruct ]] on ml-staging. making a request using the OpenAI API completions endpoint: ` t... [08:09:09] morning :) [08:09:49] o/ Aiko [08:12:49] Dobré ráno! [08:17:47] TIL --^ [08:17:54] o/ Tobias :) [08:56:02] \o [09:01:19] I updated the deployment to use llama3-instruct https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1047448 [09:01:35] this is the one that is actually deployed so this is a no-op patch [09:02:13] dirve-by +1~ [09:03:14] Danke! [10:13:34] * klausman lunch [10:14:40] * isaranto lunch [10:37:09] 06Machine-Learning-Team: Return response time as part of the logo-detection response object - https://phabricator.wikimedia.org/T367962 (10kevinbazira) 03NEW [11:00:01] (03PS1) 10Kevin Bazira: logo-detection: log response time [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1047478 (https://phabricator.wikimedia.org/T367962) [11:07:19] 06Machine-Learning-Team, 13Patch-For-Review: Return response time as part of the logo-detection response object - https://phabricator.wikimedia.org/T367962#9906623 (10kevinbazira) I have tested the implementation locally and below are the results with the `latency` key that shows execution times for both `prep... [11:09:21] (03CR) 10Kevin Bazira: "I have tested the logo-detection model-server response time locally and here are the results: https://phabricator.wikimedia.org/T367962#99" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1047478 (https://phabricator.wikimedia.org/T367962) (owner: 10Kevin Bazira) [11:38:20] fyi: wiki workshop is happening tomorrow! https://wikiworkshop.org/ [11:54:35] 06Machine-Learning-Team: Test Revert Risk model with the transparent config - https://phabricator.wikimedia.org/T366250#9906693 (10achou) 05Open→03Resolved [11:55:17] 06Machine-Learning-Team: Deploy RR-language-agnostic batch version to prod - https://phabricator.wikimedia.org/T358744#9906694 (10achou) 05Open→03Resolved [12:16:27] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9906779 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-staging2003.codfw.wmnet with OS bookworm [13:00:14] 06Machine-Learning-Team: Allow setting huggingfaceserver cmd args from deployment-charts - https://phabricator.wikimedia.org/T365842#9906910 (10isarantopoulos) 05Open→03Resolved [13:00:23] 06Machine-Learning-Team: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) - https://phabricator.wikimedia.org/T365246#9906913 (10isarantopoulos) 05Open→03Resolved [13:00:33] 06Machine-Learning-Team: Add pydantic validation to revertrisk model in liftwing-python package - https://phabricator.wikimedia.org/T366015#9906915 (10isarantopoulos) 05Open→03Resolved [13:31:13] isaranto: I have a minor fix for you to review :) [13:31:17] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047508 [13:31:44] on it! [13:33:15] ty <3 [13:40:05] (03CR) 10Kevin Bazira: "Yes, Ilias. Returning the `latency` key is a requirement asked for by Cormac from the Structured Content team. Marco has also mentioned it" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1047478 (https://phabricator.wikimedia.org/T367962) (owner: 10Kevin Bazira) [13:52:11] (03CR) 10Ilias Sarantopoulos: "I don't see it anywhere written as a requirement neither in the link you shared, this is why I was asking." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1047478 (https://phabricator.wikimedia.org/T367962) (owner: 10Kevin Bazira) [14:05:50] 06Machine-Learning-Team: Set automatically libomp's num threads when using Pytorch - https://phabricator.wikimedia.org/T360111#9907076 (10elukey) 05Open→03Resolved [14:06:06] 06Machine-Learning-Team: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing - https://phabricator.wikimedia.org/T353622#9907077 (10elukey) 05Open→03Resolved [14:07:27] 06Machine-Learning-Team: Test if we can avoid ROCm debian packages on k8s nodes - https://phabricator.wikimedia.org/T363191#9907091 (10elukey) 05Open→03Resolved [14:07:29] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9907078 (10elukey) 05Open→03Resolved The test was successful, ml-staging2001 doesn't show the same issue anymore when running Bookworm. [14:10:59] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9907108 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-staging2003.codfw.wmnet with OS bookworm completed: - ml-stag... [14:23:41] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9907185 (10klausman) [15:11:18] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9907339 (10klausman) [15:12:59] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9907350 (10klausman) Machine is imaged and running. The PXE boot was "fixed" by an ugly hack mentioned in T304483#9906962 While the firmware problem remains, at least we a... [15:22:46] logging off folks, have a nice evening/rest of day o/ [15:22:51] \o [15:28:09] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9907386 (10Jhancock.wm) The server has had the DIMM reseated. [15:37:40] bye Ilias! [15:37:55] logging off as well o/