[09:22:39] prometheus_amd_rocm_stats.service is failing on dse-k8s-worker1001 due to temperature metrics being reported as N/A https://phabricator.wikimedia.org/P76464 [09:22:45] does that ring a bell for anyone? Thanks! [09:24:59] it is weird, rocm-smi seems not recognizing some stuff.. Has anything changed recently? reimage etc..? [09:25:17] IIRC Ben rebooted it to load a new kernel [09:25:26] according to https://github.com/ROCm/ROCm/issues/4268 "a reboot fixes it" [09:26:06] sigh [09:26:12] at least there is a report about it [09:26:22] shall we try a drain + reboot to check? [09:26:52] yep, on it [09:28:34] I've cordoned it, I have to wait for an airflow task pod to finish until I can reboot it, as I don't want to impact user jobs [09:33:26] reboot ongoing [09:39:16] I'm still seeing the same issue post reboot [09:41:25] I;'m seeing these messages when running `rocm-smi` [09:41:25] ERROR: 2 GPU[0]: power: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment. [09:47:54] I know nothing about rocm. All I know is that that worker has an "old" GPU (something like > 3, 4 years). Is there a way to upgrade rocm in any way? [09:56:17] back sorry [09:56:22] np [09:56:23] lemme check on the node [09:56:27] <3 [09:57:28] ok so the node seems to follow the recent layout, namely no ROCM packages installed [09:57:38] rocm-smi is from bookworm's upstream repos [09:57:47] we just install it to get the info for the metrics [09:58:06] the idea is that the .so rocm libs are stored in the Docker images themselves, so we can vary the OS etc.. [09:58:36] so upgrading ROCm atm is not an option, but I am a little bit puzzled that with a change in the Kernel nothing works anymore [09:59:11] the drivers are shipped by the kernel, so maybe the dropped something, but it seems strange [10:02:18] (btw, small sidetrack. We're now included the kadmin and kerberos servers hostnames in the general-$env.yaml files, to avoid hardcoding the hostnames in configmap. The deprecation of krb1001 would have led to an interesting outage if I had applied admin_ng, which would have removed the egress rule to the currently configured kadmin) [10:05:04] cf https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1151131 [10:07:47] *including [10:17:50] brouberol: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151153 [10:18:25] approved thank you! [10:18:35] brouberol: interesting about krb1001, have you notified Moritz about it? He'll be interested for sure [10:18:51] I was about to :) [10:18:55] super :) [10:19:05] we have the same GPU on ml-serve1001, but different kernels [10:19:27] 6.1.137-1 on ml-serve1001, 6.12.22-1~bpo12+1 on dse [10:20:10] at some point we'll probably need to remove the old GPUs [10:20:22] we added them as test on dse, but never used them [10:20:24] cc: klausman: [10:20:55] (nothing urgent, but let's keep it in mind) [10:21:40] re Moritz: {{done}} [10:28:13] ack. re: gpu removal