[06:54:31] 10Machine-Learning-Team, 10Data-Engineering: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10elukey) [06:56:13] 10Machine-Learning-Team, 10Data-Engineering: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10Iqbal011299) [07:41:39] 10Machine-Learning-Team, 10Data-Engineering, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10elukey) From https://github.com/RadeonOpenCompute/ROCm/issues/761 it seems that `hsa-ext-rocr-dev` is not a concern anymore, so we can simplify the deployment procedure even fur... [07:51:13] morning! [07:51:24] after a couple of years no more binary-only packages for ROCm https://github.com/RadeonOpenCompute/ROCm/issues/761#issuecomment-968613956 [07:51:27] \o/ [09:58:37] I was able to add the knative's queue proxy metrics to prometheus, adding graphs to https://grafana-rw.wikimedia.org/d/Rvs1p4K7k/kserve?orgId=1 [10:13:57] added request count + latencies [10:14:34] we also have the knative activator reporting requests, but IIUC it is because under some circumstances knative adds it as load balancer in front of the pods [10:15:05] most of the times it is to queue requests when pods are coming up (say because they have the scale-to-zero serveless option enabled) [10:15:17] but sometimes it does it also to optimize low traffic (like in our case) [10:15:41] so the queue proxy's metrics (that run as sidecar on every service pod) is surely more complete [10:19:16] I feel that we have metrics for all layers, at least the minimum amount to help debugging problems [11:34:45] * elukey lunch! [12:15:17] Can anyone help me on using LUDWIG AI ... made by Uber AI [12:19:36] I need to make a multilingual Automatic Speech recognition system. I have collected the data set of about 200 GB. I want to know how to prepare/preprocess the data for training it offline on my system. After training I want to implment it on live calls and on recorded as well. Can anyone help me on doing this on LUDWIG-AI made by UBER AI [12:22:43] I need to make a multilingual Automatic Speech recognition system. I have collected the data set of about 200 GB. I want to know how to prepare/preprocess the data for training it offline on my system. After training I want to implment it on live calls and on recorded as well. Can anyone help me on doing this on LUDWIG-AI made by UBER AI [15:25:52] 10Machine-Learning-Team, 10Data-Engineering, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10elukey) ROCm 4.5 imported in apt. Next steps: - Wait for the release of the pypi package `tensorflow-io` - Test the new suite on one node (will need the help of @Miriam) [17:00:35] o/ [17:01:09] o/ [17:02:25] elukey: nice work on the metrics, looks great! [17:04:23] thanks! It may need some refinement of course, but it should be enough for the MVP :) [17:18:25] accraze: I didn't change anything in the ml sandbox's settings, but it you need anything ping me [17:18:52] I left some notes during the weekend about where to find our changes/overrides on the upstream releases [17:19:14] at some point we'll probably need to write some docs about bootstrapping a local kserve cluster [17:32:43] elukey: thanks! just reviewed the notes, i think i have an idea of what i need to fix re: knative gw config map [17:33:03] also yes i agree on docs for local kserve setup, keeping some notes as i go through this [17:33:40] it would be cool to write script/puppet config eventually [17:33:43] accraze: there is a specific bit for the knative setting in the README IIRC [17:36:25] ahhhh i wish i would have seen the README on Friday lol :D [17:36:50] super sorry, I felt bad when I saw your questions on IRC :( [17:36:57] hahaha [17:37:11] no worries! i have a stronger understanding of our stack now :) [17:37:36] I tried to keep track of all the custom changes done since we need to apply them every time we import the yaml files from upstream [17:37:47] and there is zero chances to remember them all [21:34:37] 10Machine-Learning-Team, 10artificial-intelligence, 10editquality-modeling, 10Hindi-Sites: Train and test editquality models for Hindi Wikipedia - https://phabricator.wikimedia.org/T252581 (10Halfak) @Nikhil1194 and I are working on an iteration. So I don't think we should resolve this quite yet. [22:36:24] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install ml-train100[1-4] - https://phabricator.wikimedia.org/T291579 (10Jclark-ctr)