[06:54:31] <wikibugs>	 10Machine-Learning-Team, 10Data-Engineering: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10elukey)
[06:56:13] <wikibugs>	 10Machine-Learning-Team, 10Data-Engineering: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10Iqbal011299)
[07:41:39] <wikibugs>	 10Machine-Learning-Team, 10Data-Engineering, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10elukey) From https://github.com/RadeonOpenCompute/ROCm/issues/761 it seems that `hsa-ext-rocr-dev` is not a concern anymore, so we can simplify the deployment procedure even fur...
[07:51:13] <elukey>	 morning!
[07:51:24] <elukey>	 after a couple of years no more binary-only packages for ROCm https://github.com/RadeonOpenCompute/ROCm/issues/761#issuecomment-968613956
[07:51:27] <elukey>	 \o/
[09:58:37] <elukey>	 I was able to add the knative's queue proxy metrics to prometheus, adding graphs to https://grafana-rw.wikimedia.org/d/Rvs1p4K7k/kserve?orgId=1
[10:13:57] <elukey>	 added request count + latencies
[10:14:34] <elukey>	 we also have the knative activator reporting requests, but IIUC it is because under some circumstances knative adds it as load balancer in front of the pods
[10:15:05] <elukey>	 most of the times it is to queue requests when pods are coming up (say because they have the scale-to-zero serveless option enabled)
[10:15:17] <elukey>	 but sometimes it does it also to optimize low traffic (like in our case)
[10:15:41] <elukey>	 so the queue proxy's metrics (that run as sidecar on every service pod) is surely more complete
[10:19:16] <elukey>	 I feel that we have metrics for all layers, at least the minimum amount to help debugging problems
[11:34:45] * elukey lunch!
[12:15:17] <SageKhan>	 Can anyone help me on using LUDWIG AI ... made by Uber AI
[12:19:36] <SageKhan>	 I need to make a multilingual Automatic Speech recognition system. I have collected the data set of about 200 GB. I want to know how to prepare/preprocess the data for training it offline on my system. After training I want to implment it on live calls and on recorded as well. Can anyone help me on doing this on LUDWIG-AI made by UBER AI
[12:22:43] <SageKhan>	 I need to make a multilingual Automatic Speech recognition system. I have collected the data set of about 200 GB. I want to know how to prepare/preprocess the data for training it offline on my system. After training I want to implment it on live calls and on recorded as well. Can anyone help me on doing this on LUDWIG-AI made by UBER AI
[15:25:52] <wikibugs>	 10Machine-Learning-Team, 10Data-Engineering, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10elukey) ROCm 4.5 imported in apt. Next steps:  - Wait for the release of the pypi package `tensorflow-io` - Test the new suite on one node (will need the help of @Miriam)
[17:00:35] <accraze>	 o/
[17:01:09] <elukey>	 o/
[17:02:25] <accraze>	 elukey: nice work on the metrics, looks great!
[17:04:23] <elukey>	 thanks! It may need some refinement of course, but it should be enough for the MVP :)
[17:18:25] <elukey>	 accraze: I didn't change anything in the ml sandbox's settings, but it you need anything ping me
[17:18:52] <elukey>	 I left some notes during the weekend about where to find our changes/overrides on the upstream releases
[17:19:14] <elukey>	 at some point we'll probably need to write some docs about bootstrapping a local kserve cluster
[17:32:43] <accraze>	 elukey: thanks! just reviewed the notes, i think i have an idea of what i need to fix re: knative gw config map
[17:33:03] <accraze>	 also yes i agree on docs for local kserve setup, keeping some notes as i go through this
[17:33:40] <accraze>	 it would be cool to write script/puppet config eventually
[17:33:43] <elukey>	 accraze: there is a specific bit for the knative setting in the README IIRC
[17:36:25] <accraze>	 ahhhh i wish i would have seen the README on Friday lol :D
[17:36:50] <elukey>	 super sorry, I felt bad when I saw your questions on IRC :(
[17:36:57] <accraze>	 hahaha
[17:37:11] <accraze>	 no worries! i have a stronger understanding of our stack now :)
[17:37:36] <elukey>	 I tried to keep track of all the custom changes done since we need to apply them every time we import the yaml files from upstream
[17:37:47] <elukey>	 and there is zero chances to remember them all
[21:34:37] <wikibugs>	 10Machine-Learning-Team, 10artificial-intelligence, 10editquality-modeling, 10Hindi-Sites: Train and test editquality models for Hindi Wikipedia - https://phabricator.wikimedia.org/T252581 (10Halfak) @Nikhil1194 and I are working on an iteration.  So I don't think we should resolve this quite yet.
[22:36:24] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install ml-train100[1-4] - https://phabricator.wikimedia.org/T291579 (10Jclark-ctr)