[06:54:45] good morning. [06:58:02] good morning! [07:06:56] good morning :) [08:15:43] (03PS9) 10Bartosz Wójtowicz: outlink-topic-model: Merge transformer and predictor pods. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1187739 (https://phabricator.wikimedia.org/T404294) [08:20:02] ^ the patch for combining transformer and predictor pods is ready for review again. I have kept the transformer code as-is along with its blubber setup to keep the CI happy, but preprocessing functionality has been already added to the predictor part so the whole service can be run as a single pod. [09:00:04] can someone review that please? [09:00:48] bartosz: iiuc in https://phabricator.wikimedia.org/T401778 we need to provide a final summary of the discussion so that Data persistence team can proceed. is that correct? [09:05:11] I can review the pipeline bits, but I'd rather have an MLE look at the Python parts. I know Python well enough, but not necessarily FastAPI et al [09:06:31] isaranto: Yes, I'm working on the final design proposal including all discussed points [09:07:09] ack, thanks! [09:08:33] (03CR) 10Klausman: [C:03+1] "LGTM for everything about the Python source changes, I defer to proper MLEs for review on those. 😊" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1187739 (https://phabricator.wikimedia.org/T404294) (owner: 10Bartosz Wójtowicz) [09:21:38] klausman: o/ when you have a moment could you complete the rollout of https://phabricator.wikimedia.org/T398600 ? [09:21:53] not urgent, even next week [09:22:02] ack, will do [09:22:43] super thanks [09:25:02] isaranto: I think there’s still one thing I’d like to discuss with the team, I was thinking about todays ML meeting, but we can also discuss it in IRC - using page_id vs page_title in the article topic model. We’ve talked about this and I explored it - technically it's easy to modify model code to use page_id instead of title when searching for outlinks [09:25:18] I’m wondering if we should do it before introducing caching as well - this would allow us to use page_id as cache index and it’d be easier to do backfilling as the current hive snapshots also use page_id as index [09:25:46] there are also questions on how would we rollout this change: do we want to support both page_id and page_title parameters or would we only support page_id? [09:33:06] bartosz: I don't think there is a reason to change the existing functionality. We can just allow both options via different post arguments (page_title & page_id) so there would be no need for migrating existing users [09:34:37] klausman: the kernel 6.16 should be available in backports for trixie, ok if I reimage ml-serve1012 to clean up the current state? [09:35:03] I wrote a comment on that task a long while ago https://phabricator.wikimedia.org/T371021#10170457. We can change the title & description of the task to reflect that we are not switching. If you tested it please add your input on the task and we can go ahead and implement that [09:35:17] https://packages.debian.org/trixie-backports/linux-image-amd64 [09:35:25] elukey: yep, sgtm [09:41:40] isaranto: I see, will add a comment there! I can see one potential downside of this approach, not sure yet how big it is - for every request using page_title, we’d need to do 1 additional query to mwapi to get the page_id from page_title so that we can use it in cache. If YiR would query based on page_title, this additional query could slow down our throughput [09:49:12] let's coordinate with the Apps team and use whatever they are going to use [09:50:08] when using the cache we don't want to make any requests at all [09:52:23] 10Lift-Wing, 06Machine-Learning-Team: [articletopic-outlink] fetch data from mwapi using revid instead of article title - https://phabricator.wikimedia.org/T371021#11184095 (10BWojtowicz-WMF) I've tested the option to use `page_id` in the model and found out that it's straightforward to modify the current outl... [09:53:56] isaranto: I agree, will ask under the YIR goal ticket [09:56:08] thanks! you can ping Dbrant , I think he is the lead engineer on that project [10:10:41] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review - https://phabricator.wikimedia.org/T392833#11184216 (10BWojtowicz-WMF) Hello @Dbrant! We have 1 technical question about the way Apps side will query our LiftWing model to retriev... [11:07:22] * klausman lunch [12:02:17] ml-serve1012 seems stuck in booting and a powercycle gets stuck as well.. [12:02:20] I'll check after lunch sigh [12:27:13] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Enable volunteer evaluation of Tone Check model in additional languages - https://phabricator.wikimedia.org/T400423#11184761 (10gkyziridis) ==Update== **Datasets uploaded for the following wikis:** | Wiki | Project number | Translations | Labels ad... [12:54:22] 06Machine-Learning-Team: Fix CI/CD on ml-pipelines repository - https://phabricator.wikimedia.org/T404717 (10gkyziridis) 03NEW [13:54:26] 06Machine-Learning-Team, 07Essential-Work: Incorporate notebook into Tone-Check data generation ml-pipeline - https://phabricator.wikimedia.org/T404722 (10kevinbazira) 03NEW [14:05:19] 06Machine-Learning-Team, 07Essential-Work: Incorporate notebook into Tone-Check data generation ml-pipeline - https://phabricator.wikimedia.org/T404722#11185239 (10kevinbazira) Since the test/dev iteration cycles take a really long time, I added development limit ([[ https://gitlab.wikimedia.org/kevinbazira/ml... [14:13:19] 06Machine-Learning-Team, 07Essential-Work: Incorporate notebook into Tone-Check data generation ml-pipeline - https://phabricator.wikimedia.org/T404722#11185284 (10kevinbazira) I ended up removing the dev-limits as the small sample size results in no rows making it through the end of the pipeline as shown belo... [14:40:00] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Airflow training pipeline for Tone check model - https://phabricator.wikimedia.org/T398970#11185395 (10kevinbazira) Started working on tone-check data generation job logic in T404722: * test/dev iteration cycles take a really long time * added logs at major p... [14:44:29] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11185422 (10Eevans) [14:44:51] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11185425 (10Eevans) [15:40:40] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review - https://phabricator.wikimedia.org/T392833#11185679 (10Ottomata) @BWojtowicz-WMF we should probably sync up about this kind of requirement (and also data modeling when you work on... [15:50:12] 06Machine-Learning-Team: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697#11185767 (10elukey) @klausman ml-serve1012 is up and running with 6.16 from backports, and nvtop seems to work without horrors in the dmesg. Also please note that `rocm-smi` is now `/opt/rocm-... [19:24:46] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review - https://phabricator.wikimedia.org/T392833#11186679 (10Dbrant) >>! In T392833#11184216, @BWojtowicz-WMF wrote: > To make sure we optimize our solution for Year in Review processing...