[05:56:18] (03PS1) 10Tim Starling: Remove unused WatchedItemQueryService hooks [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1200750 (https://phabricator.wikimedia.org/T407087) [06:08:10] (03CR) 10CI reject: [V:04-1] Remove unused WatchedItemQueryService hooks [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1200750 (https://phabricator.wikimedia.org/T407087) (owner: 10Tim Starling) [07:12:52] good morning [08:02:50] good morning! :) [08:56:36] good morning [09:38:57] good morning! [10:25:24] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11334869 (10achou) > I think so, yes. If you have specific mock data in mind, a... [10:31:19] (03PS2) 10Gkyziridis: haggingface_models: Update kserve to 0.15.2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1195632 (https://phabricator.wikimedia.org/T367048) [11:35:26] 07artificial-intelligence, 10Reconciliation, 10Technical-Tool-Request: Alternative, affordable, lower-barrier approach(es) to reconciliation - https://phabricator.wikimedia.org/T362149#11335159 (10Spinster) I am wondering if the [[ https://www.wikidata.org/wiki/Wikidata:Embedding_Project | Wikidata:Embedding... [12:09:33] (03CR) 10AikoChou: "LGTM!! Thanks for adding the README, unit tests, and entries in docker-compose. This makes it really easy to test the code :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1199806 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:12:46] (03CR) 10AikoChou: [C:03+1] revise-tone: Add Tune Suggestion Generator. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1199806 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:16:18] 06Machine-Learning-Team, 06SRE, 10SRE-Access-Requests: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11335277 (10hnowlan) 05Open→03Stalled Blocked on approval from @mark. [12:21:34] (03PS4) 10Bartosz Wójtowicz: revise-tone: Add Revise Tone Task Generator. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1199806 (https://phabricator.wikimedia.org/T408538) [12:37:13] (03CR) 10Bartosz Wójtowicz: [C:03+2] revise-tone: Add Revise Tone Task Generator. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1199806 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:38:27] (03Merged) 10jenkins-bot: revise-tone: Add Revise Tone Task Generator. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1199806 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [13:01:27] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update kserve to 0.15.2 - https://phabricator.wikimedia.org/T367048#11335401 (10gkyziridis) Thank you so much for working on that one @klausman! Since the `ml-lab1001` now uses the big TB filesystem, would it be possible to enable the `buildkit`... [13:02:51] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update kserve to 0.15.2 - https://phabricator.wikimedia.org/T367048#11335405 (10klausman) >>! In T367048#11335401, @gkyziridis wrote: > Thank you so much for working on that one @klausman! > > Since the `ml-lab1001` now uses the big TB filesyste... [13:08:55] 06Machine-Learning-Team, 06Discovery-Search (2025.10.20 - 2025.11.07): Initial task generation and ingestion to Cassandra and Search weight tags - https://phabricator.wikimedia.org/T408533#11335420 (10pfischer) [13:17:43] 06Machine-Learning-Team, 13Patch-For-Review: Create a Revise Tone Task Generator in LiftWing - https://phabricator.wikimedia.org/T408538#11335446 (10achou) [13:38:10] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11335508 (10dcausse) >>! In T401021#11334869, @achou wrote: > @dcausse, regardin... [14:18:40] 07artificial-intelligence, 10Reconciliation, 10Technical-Tool-Request: Alternative, affordable, lower-barrier approach(es) to reconciliation - https://phabricator.wikimedia.org/T362149#11335731 (10Abbe98) This might also be of interest: https://byabbe.se/2024/08/05/reconcile-against-any-mediawiki-instance [14:27:13] aiko: o/ so ml-serve1012 (8xMI300 GPUs) is currently available on ml-serve-eqiad for tests. The pods can only be scheduled if we add the right "tolerations" (it is a couple of yaml settings when deploying) so only what we want gets executed on it [14:27:32] we should probably come up with a plan about what/how to test those hosts [14:28:12] (this could be something really interesting to do for dpogorzelski) [14:29:03] can do for sure, Tobias suggested to schedule current workloads there and see if they work [14:29:17] just waiting on my root access [14:30:44] I think we should also plan what kind of tests to do, since those hosts have the GPUs that can be partitioned [14:31:05] at the moment it seems that we can split each one of them in 8 smaller GPUs or in 2 smaller GPUs [14:31:37] and the kubelet, with the amd-gpu-plugin, is able to see the partitions as separate devices [14:31:42] at leasts from my tests [14:31:57] the main issue is that after a reboot, the partitioning settings are gone [14:32:42] so we'd need to figure out how we want to use those gpus and possibly test both partition types, and how to automate the partition settings (puppet? Other?) [14:33:08] so the plan is something that we can do even before the root access, but I think Aiko should chime in with some ideas/directions :D [14:34:34] what's tool to partition the gpu? [14:35:27] aiko: 2 or 8 partitions? [14:35:36] it is called amd-smi, I left some notes in https://phabricator.wikimedia.org/T403697 [14:36:16] to use it we need to live-hack it though, see https://github.com/ROCm/amdsmi/issues/132 [14:36:42] upstream didn't repro yet, not sure why [14:39:48] should the paritioning happen before the kubelet starts? [14:39:53] or doesn't matter [14:42:45] if doesn't matter i guess puppet could run but workloads do need correct partitioning so perhaps it's good if it happens before kubelet and any workload starts. perhaps a systemd unit that is a dependency for anything that has to come after is easiest [14:43:16] he unit can be just a one liner that partitions the gpu [14:47:27] I haven't tested it, but in theory the amd gpu plugin gets the partitioning changes and publishes them to the kubelet, that than sends them to the kube scheduler [14:47:44] I am not 100% sure about the first bit though, maybe the plugin needs a restart [14:47:53] but from my tests I didn't have to do it [14:49:09] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11335930 (10Ottomata) >> Could we go with page_paragraph_tone_scores? > I think... [14:54:29] kk [14:54:39] is puppet running on a scheduled basis on these hosts? [14:55:45] Puppet runs on every host every 30 minutes [14:56:31] (except when sometimes for maintenance we explicitly disable it) [14:57:05] hmmm the puppet unit seems to be a one shot, is that 30 min run managed elsewhere? [14:57:33] nvm my statement about the unit, that's a different thing [14:58:47] it's started via a systemd timer, see modules/profile/manifests/puppet/agent.pp [15:05:32] gotcha [15:12:44] seems that the timer is configured to run a 1min after boot , ish, besides the 30 min schedule which is nice. but overall we could shove another systemd unit via puppet to make sure that on boot we partition the gpu [15:13:36] when i get root i can start rolling with this [15:40:25] aiko: which services can be used to test these nodes? [15:48:11] as a side note i wonder if we could keep ml servers off the production network and just streamline access and workflow for the rest of the team [15:52:53] dpogorzelski: we can start with this llm model server we have https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/src/models/llm/ which can load and test different llms [15:53:56] we previously only used it in experimental namespace e.g. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/de08590e5e6c60e4ccd22461675d65cf5eb1bb8b/helmfile.d/ml-services/experimental/values-ml-staging-codfw.yaml#59 [15:54:41] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 10PersonalDashboard: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11336193 (10DMburugu) [15:56:55] elukey: do we need a new version of amd-pytorch image for ROCm 7.0.2? [15:57:34] aiko: good point, the newest drivers the better, but we can start with what we have and see how it goes [15:57:52] in theory on the pods we'll just see a gpu offering X VRAM [15:58:59] IIRC Ilias mentioned the vLLM image but that is still blocked by https://phabricator.wikimedia.org/T394778 [16:01:00] aiko: 2 or 8 partitions? [16:02:14] also is that already built and available in the registry? [16:02:29] nvm, it is [16:02:57] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update kserve to 0.15.2 - https://phabricator.wikimedia.org/T367048#11336250 (10elukey) Hey folks, to re-iterate another time - please stop making manual configurations to these hosts and puppetize what is needed before proceeding further. The pu... [16:10:03] does any of the machines we use as build machines have the capability of pushing to the registry? or was that still not possible? [16:11:18] dpogorzelski: https://phabricator.wikimedia.org/T394778 [16:11:34] it is not possible yet, there are some security issues to address before being able to do it [16:12:00] I left some ideas in the task, but a formal proposal needs to be done and submitted to the k8s sig etc.. [16:12:35] (you probably want to join #wikimedia-k8s-sig, it is the group of folks working on k8s. We meet once a month, I can add you to the meeting invite) [16:18:22] i saw it in my calendar i think but i declined, timing wise unfortunately it doesn't work for me [16:19:06] ah snap no problem, you can read the notes and post in the chan [16:26:43] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11336471 (10DMburugu) [17:09:12] sorry was in meetings. we could test both 2 and 8 partitions [17:32:19] ok we need to create a test plan for GPUs/LLM testing [17:32:27] this is also helpful for the coming suggestions mode project [17:33:07] we have this task https://phabricator.wikimedia.org/T403599 [17:34:44] I'll prepare something to discuss on Wednesday meeting [17:40:00] btw found this interesting article today: https://publish.obsidian.md/felafax/pages/Tune+Llama3+405B+on+AMD+MI300x+(our+journey) [19:03:37] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11337255 (10Eevans) >>! In T401021#11335930, @Ottomata wrote: >>> Could we go wi... [19:53:23] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11337441 (10Ottomata) > If that is not the case then I think we have to also con... [21:45:28] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11337888 (10Eevans) Fyi, I've rearranged some of what I'm quoting here (I hope t... [23:21:20] (03PS2) 10Tim Starling: Remove unused WatchedItemQueryService hooks [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1200750 (https://phabricator.wikimedia.org/T407087)