[08:28:20] moorning o/ [09:11:42] happy new year folks! [09:11:53] ml-lab1001's /srv is filled up again [09:13:00] o/ IIRC you consume the mediawiki.page_change.v1 stream to produce some prediction via eventgate for the outlink topic model, how and where do you consume mediawiki.page_change.v1 topics? [09:13:24] dcausse: o/ via Changeprop [09:13:37] elukey: hey! thanks, looking [09:13:50] there is a rule to call liftwing that in turn sends the event to Eventgate [09:14:00] ok makes sense [09:21:40] Morning! [09:21:56] elukey: I see /srv has 12G free, which should be enough usually [09:22:32] I've deduped some stuff and it's back at 18G free [09:26:40] 18G free seems a little low for the purpose of that host, but as long as the alerts are clear I am fine :) [09:32:55] (namely https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DDiskSpace&q=team%3Dmachine-learning) [09:34:18] I'll have to look up the thresholds. Larger disk arrays often have weird effects where "95% used" means "only 0.5T free", which may be fine (or not) depending on the usage patterns [09:34:57] At any rate, Ben already set up a test mount (at /mnt) on lab1002, which we'll use for /home soon (and then convetr 1001), so these will go away either way). [09:35:11] ^^ a _Ceph_ test mount, that is [09:37:10] ah, team-sre/resources.yaml has `node_filesystem_avail_bytes{...}/ node_filesystem_size_bytes) ... < 0.06, so anything <6% free would trigger [09:40:50] I've put in a silence for ml-lab.* /srv [09:47:15] 10Lift-Wing, 03Discovery-Search (Current work), 07Documentation: The Search/articletopic page at Wikitech appears to be out of date - https://phabricator.wikimedia.org/T382620#10436618 (10dcausse) a:03dcausse Thanks for pointing this out, you were correct, `mediawiki.page_outlink_topic_prediction_change.v1... [10:08:28] 10Lift-Wing, 03Discovery-Search (Current work), 07Documentation: The Search/articletopic page at Wikitech appears to be out of date - https://phabricator.wikimedia.org/T382620#10436670 (10Urbanecm_WMF) Thanks for the update, this is very useful to know. The API works for me, and I can also see the data in th... [11:18:29] 06Machine-Learning-Team: [onboarding] Update revertrisk to kserve 0.14.1 - https://phabricator.wikimedia.org/T383119 (10isarantopoulos) 03NEW [11:18:59] 10Lift-Wing, 06Machine-Learning-Team: [onboarding] Update revertrisk to kserve 0.14.1 - https://phabricator.wikimedia.org/T383119#10436826 (10isarantopoulos) a:03gkyziridis [11:20:06] this will be the onboarding task for George --^ we can discuss it later today [11:22:42] 10Lift-Wing, 06Machine-Learning-Team: [onboarding] Update revertrisk to kserve 0.14.1 - https://phabricator.wikimedia.org/T383119#10436831 (10isarantopoulos) [11:22:53] * klausman lunch [11:29:29] 10Lift-Wing, 03Discovery-Search (Current work), 07Documentation: The Search/articletopic page at Wikitech appears to be out of date - https://phabricator.wikimedia.org/T382620#10436846 (10dcausse) >>! In T382620#10436670, @Urbanecm_WMF wrote: > Thanks for the update, this is very useful to know. The API work... [11:30:53] * isaranto lunch! [12:02:38] 10Lift-Wing, 03Discovery-Search (Current work), 07Documentation: The Search/articletopic page at Wikitech appears to be out of date - https://phabricator.wikimedia.org/T382620#10436926 (10Urbanecm_WMF) Thanks for the tip! The dumping URL is useful to know about. [12:59:09] 06Machine-Learning-Team: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10437036 (10isarantopoulos) >>! In T371344#10408723, @isarantopoulos wrote: > I rebuilt a wheel with pytorch 2.5.1(rocm) using `python setup.py bdist_wheel` and was able to succesfully deploy it on... [13:01:12] klausman: o/ could you verify the ROCm version that is installed on ml-staging? [13:01:26] sure, sec [13:01:32] I see 5.4 on prod https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/hosts/ml-serve1001.yaml#2 [13:01:52] should the profile be here or is it configured somewhere else? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/hosts/ml-staging2003.yaml [13:02:07] jumping in a meeting , brb! [13:05:57] I think the former file (1001) is a leftover from when we experimented with rocm-on-host (as opposed to using it from the container) [13:06:39] As for the installed version: drwxr-xr-x 35 root root 4096 Jun 5 2023 rocm-5.4.0 [13:07:30] It probably could/should be removed these days, as it serves no function and can be confusing [13:09:01] k8s nodes don't really need a local rocm installation (besides rocm-smi for sysadmining/debugging and monitoring purposes) [13:29:51] +1 [13:30:11] change is already merged, doing the cleanup now [13:30:16] if you clean up those nodes it may be good to reimage to have a clean start, but not really mandatory [13:31:05] Since I plan on reimaging all of the old bullseye workers to bookworm soon™, I think I'll just do a local apt cleanup of the stuff in /opt [13:33:49] are you also planning to move to containerd in the same move? [13:33:55] that would be a nice coupling [13:34:07] IIRC ml-serve has not migrated yet [13:34:17] Yeah, depending a bit on how the current wikikube containerd migration goes [13:34:50] Like, if there are any colorful explosions there, I'd hold off (on both reimage and containerd) [13:36:33] aux is already migrated, all good [13:36:52] it is very stable now, the only big change is the tooling (nerdctl vs dockerctl etc..) [13:36:52] ah, I wasn't aware of that [13:37:07] (aux, I mean) [13:47:51] I'm back. ack on all of the above [13:49:17] on rocm: I realized today that ml-lab and liftwing have different rocm versions so that is (probably) why some of our efforts were failing [13:55:58] 06Machine-Learning-Team: [LLM] ML-lab benchmarking - https://phabricator.wikimedia.org/T382343#10437184 (10kevinbazira) **optimum-benchmark** The [[ https://gitlab.wikimedia.org/repos/machine-learning/huggingface-optimum-benchmark-automation | benchmark automation tool ]] now supports running benchmarks with [[... [14:15:54] Good morning all [14:16:00] I slept terribly lol [14:18:02] (03PS1) 10Gkyziridis: add lalalala to ci_entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1108752 [14:30:11] o/ Chris [14:42:11] (03PS1) 10Gkyziridis: revert [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1108758 [14:55:53] on dse-k8s-worker1001 the amd rocm prometheus exporter is broken, since it misses rocm-smio [14:55:56] *smi [14:57:08] same on ml-serve1001 [14:59:24] Looking [14:59:55] I don't recall exactly what was the thinking at the time, but on bookworm we use rocm-smi from debian upstream [15:00:02] before we expect to have it installed [15:00:22] ml-serve1001 ~ $ apt-cache search rocm-smi [15:00:24] ml-serve1001 ~ $ [15:00:45] same 1002 [15:00:50] yes the rocm component is removed and bullseye don't offer rocm packages [15:00:56] ah, but 1002 doesn't have agpu, of course [15:01:36] I think I had a had brainfart then with the earlier cleanup [15:02:05] the host override in hiera does install all of rocm, _but also_ rocm-smi. On the Bookworm hosts, we get it from Debian upstream [15:06:19] I'll revert that cleanup change to fix monitoring [15:07:48] it is probably fine to leave things as they are, do we use the old GPUs ? [15:08:02] if not we can fix puppet to not deploy gpu monitoring unless we are on bookworm [15:11:09] I think it's fine to have the extra files until we move to bookworm [16:27:23] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: [SPIKE] How could we add topic filtering to Recent Changes? [16H] - https://phabricator.wikimedia.org/T381569#10437747 (10jsn.sherman) it looks like the proposed t... [16:30:24] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: [SPIKE] How could we add topic filtering to Recent Changes? [16H] - https://phabricator.wikimedia.org/T381569#10437765 (10Scardenasmolinar) p:05Triage→03High [16:31:23] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: [SPIKE] How could we add topic filtering to Recent Changes? [8H] - https://phabricator.wikimedia.org/T381569#10437775 (10Scardenasmolinar) [17:06:36] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban): [SPIKE] How could we add topic filtering to Recent Changes? [8H] - https://phabricator.wikimedia.org/T381569#10437891 (10Samwalton9-WMF) [17:36:32] as we discussed in the meeting I'm going to investigate which rocm version we should focus on in lift wing at the moment (6.1 vs 6.2 vs 6.3) [17:36:38] going to start that tomorrow! [17:36:42] going afk folks! [18:31:02] (03CR) 10Eamedina: [C:03+2] Fix: 'list' object has no attribute 'add' [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1108471 (owner: 10Sbisson) [18:32:23] (03Merged) 10jenkins-bot: Fix: 'list' object has no attribute 'add' [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1108471 (owner: 10Sbisson)