[02:45:45] 07artificial-intelligence, 06Machine-Learning-Team, 06Data-Engineering, 06Data-Engineering-Icebox, 10draftquality-modeling: Productionize monthly article quality prediction datasets - https://phabricator.wikimedia.org/T194741#10447039 (10Ottomata) [02:51:42] 07artificial-intelligence, 06Machine-Learning-Team, 10ORES, 06Data-Engineering, and 2 others: [Investigate] Use PMML for prediction model serialization - https://phabricator.wikimedia.org/T173244#10447117 (10Ottomata) [02:52:38] 06Machine-Learning-Team, 10ORES, 06Data-Engineering, 06Data-Engineering-Icebox: Emit synthetic mediawiki.revision-score events for both datacenters - https://phabricator.wikimedia.org/T214545#10447132 (10Ottomata) [02:52:54] 06Machine-Learning-Team, 10ORES, 06Data-Engineering, 06Data-Engineering-Icebox: Purge ORES scores from Hadoop and begin backfill when model version changes - https://phabricator.wikimedia.org/T209742#10447134 (10Ottomata) [02:53:00] 07artificial-intelligence, 06Machine-Learning-Team, 10ORES, 06Data-Engineering, and 4 others: Decide whether we will include raw features - https://phabricator.wikimedia.org/T211069#10447133 (10Ottomata) [02:53:06] 06Machine-Learning-Team, 10ORES, 06Data-Engineering, 06Data-Engineering-Icebox: Include feature values in ORES changeprop stream - https://phabricator.wikimedia.org/T209734#10447136 (10Ottomata) [02:53:12] 06Machine-Learning-Team, 10ORES, 06Data-Engineering, 06Data-Engineering-Icebox: Wire ORES recent_score events into Hadoop - https://phabricator.wikimedia.org/T209732#10447137 (10Ottomata) [02:53:19] 06Machine-Learning-Team, 10ORES, 06Data-Engineering, 06Data-Engineering-Icebox, 10Dumps-Generation: Produce dump files for ORES scores - https://phabricator.wikimedia.org/T209739#10447135 (10Ottomata) [02:53:39] 07artificial-intelligence, 06Machine-Learning-Team, 10ORES, 10[DEPRECATED] wdwb-tech, and 6 others: [Epic] Make ORES scores for wikidata available as a dump - https://phabricator.wikimedia.org/T209611#10447138 (10Ottomata) [02:54:05] 07artificial-intelligence, 06Machine-Learning-Team, 06Data-Engineering, 06Data-Engineering-Icebox, 10draftquality-modeling: Productionize monthly article quality prediction datasets - https://phabricator.wikimedia.org/T194741#10447145 (10Ottomata) 05Open→03Declined [07:11:05] Guten tag! [07:30:56] isaranto: o/ kaliméra [07:31:13] trying to understand T371344#10437036, the image hosting LLMs on LW targets rocm6.1: https://github.com/wikimedia/machinelearning-liftwing-inference-services/blob/main/.pipeline/llm/blubber.yaml#L3 [07:31:13] the fa2 wheel built on ml-lab also targets rocm6.1, torch 2.5.1+rocm6.1 works on LW but the fa2 wheel doesn't? [07:46:12] Exactly. It doesn't work and the reason may be that liftwing has rocm 5.4 installed [09:09:13] I created a patch to add the reference quality models to the API GW https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1109666 [09:09:34] I was pinged about this task from enterprise https://phabricator.wikimedia.org/T378495 [09:09:40] they are ready to consume it [09:26:55] Morning! [09:27:06] isaranto: I'll get to the apigw change in a minute [09:35:12] Tobias o/ [09:36:10] Ilias: ack, I have +1'ed [09:36:10] on LW staging, I checked and the fa2 wheel is installed and was able to import it without errors: https://phabricator.wikimedia.org/P71956 [09:36:10] are there any errors this model-server throws when it tries to use fa2? [09:44:43] the key difference I am seeing in the requirements is that ml-lab uses torch 2.4.1+rocm6.1: https://phabricator.wikimedia.org/P71677$48 [09:44:43] while the llm model-server on LW uses uses torch 2.5.1+rocm6.1: https://phabricator.wikimedia.org/P71956$105 [09:52:02] klausman: it is not urgent it can wait for monday as well [09:52:38] I just created the change and wanted to wait for Aiko to return to make sure everything is set for these models before we give wme the green light to proceed [09:52:55] ah, roger [09:55:35] kevinbazira: thanks for checking that! the only difference between ml-lab and LW is that LW has rocm 5.4 drivers installed while ml-lab has 6.1. the pytorch versions are both the same [09:55:58] I used an environment with pytorch 2.5.1 to build flash attention and not the one you mention in the paste [09:56:17] this is the wheel that is used https://github.com/isaranto/flash-attention/releases/tag/v2.7.0-py3.11 [09:57:51] so we'd need to upgrade the ROCm version on LW and see if this error goes away https://phabricator.wikimedia.org/T379052#10394897 [10:17:56] thank you for the clarification, Ilias. [10:17:56] when inside the llm model-server container, it looks like I can't confirm the ROCm version used on LW [10:30:09] that is true. the container doesn't have access to /opt/rocm on the nodes and running rocminfo on the container will just give the dummy results we used in https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1101491 [10:35:09] afaik access to /opt/rocm is not required since everything is setup using the k8s device plugin fro the amd gpus. I think we do need to allow rocminfo execution inside the containers to be able to validate the version but it seems that it is not required (I may be wrong) [10:35:56] bitsandbytes only needs rocminfo to extract the GPU architecture so ti doesnt actually need any info other than that [10:39:06] makes sense. other packages besides bitsandbytes might need more info ... [10:39:44] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): Move Lab machines into analytics net for DL access and switch to homedirs on Ceph - https://phabricator.wikimedia.org/T380279#10447521 (10Gehel) [10:49:58] georgekyz: I have added you to the machine learning team on gitlab so you should be able to see the projects [13:13:38] isaranto: thnx I can see the projects in gitlab [13:13:45] \ο/ [13:26:02] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Build and Publish ROCm-Compatible Python Packages - https://phabricator.wikimedia.org/T381859#10448020 (10MunizaA) >>! In T381859#10443645, @kevinbazira wrote: > > This process failed for bitsandbytes(P71788) wheels. The problem here is that `hip` h... [13:43:13] 10Lift-Wing, 03Discovery-Search (Current work), 07Documentation: The Search/articletopic page at Wikitech appears to be out of date - https://phabricator.wikimedia.org/T382620#10448046 (10Gehel) 05Open→03Resolved [14:20:28] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work: Create event stream for article-country model-server hosted on LiftWing - https://phabricator.wikimedia.org/T382295#10448195 (10isarantopoulos) Update: we are starting this work next week so we'll be providing updates on this task. [15:04:46] going afk folks a little bit earlier today, have a nice weekend! [16:59:07] Good morning all [16:59:12] I slept 9.5 hours last night! [17:04:00] 06Machine-Learning-Team, 10[DEPRECATED] wdwb-tech, 10API Platform, 06cloud-services-team, and 15 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953#10449301 (10Gehel)