[05:39:42] good morning! [06:16:55] Good morning [06:42:41] morning folks! [07:05:20] morniiiing [07:23:14] o/ georgekyz [07:23:47] I have a follow up question after following the convo on slack about s3 etc [07:24:28] or maybe a couple [07:24:49] why do we need to download the existing model to start a new training run? since we gather the new data the easiest thing to do is to train a new model from scratch each time [07:26:36] regarding the models in https://analytics.wikimedia.org/published/wmf-ml-models these dont have anything to do with swift/s3 it is an additional step that happens during model upload. We upload to swift & we publish to the public repo (more info here https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Deploy#How_to_upload_a_model_to_Swift) [07:32:24] When I started working on this initially the idea was to create a new model which means that we download the model inside the docker image. That produced many issues: [07:32:24] 1. There were errors during dowloading the image inside the container [07:32:24] 2. The image ended up huge [07:33:40] So after syncing up with Aiko and Kevin we thought that we needed to avoid doing this and just download the model using the s3 clinet. [07:34:27] why do we need the old model? we just need the data and then we train a new model [07:37:32] https://www.irccloud.com/pastebin/36kUCRRb/ [07:38:11] because the training starts using `AutoModelForSequenceClassification.from_pretrained()` [07:39:14] oh got it. you're right. nevermind,thanks for explaining [07:39:48] I was thinking we were trying to download the edit check model but we're trying to download mbert [07:40:15] The model name could be either a model from huggingface (as it is in the notebook where the edit-check model was initially trained using the aya model), either it needs to be a local model. Unfortunately we cannot start initially by downloading the aya model inside the container because it crashes and then it is huge to be pushed. So we need to find a way to have a model downloaded "locally on airflow" on the fly during the pipeline [07:42:28] isaranto: Oh yes that was probably my bad on how I communicated... I am trying to download the LLM safetensors binaries from https://analytics.wikimedia.org/ not the edit-check model-server [07:42:57] I will update after the sync meeting with Blathazar today [07:44:43] 06Machine-Learning-Team: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#10957324 (10BWojtowicz-WMF) I've started work on this ticket and I've reimplemented the bash script in Python, where I take advantage of `boto3` to handle connection t... [07:48:18] ^ If you have some free time, I'd love to hear your takes on those questions: https://phabricator.wikimedia.org/T394301#10957324 [07:56:21] bartosz: I will leave a comment on the ticket. [07:56:36] georgekyz: to train a new edit check model you'd need the base model which in this case is https://huggingface.co/google-bert/bert-base-multilingual-cased . So we would need to add that to swift and then download it in the airflow task [07:56:44] It doesn't need to be in analytics.wikimedia.org. For production we should not rely on analytics.wikimedia.org [07:57:38] I don't get where the aya model fits in this context unless I'm still missing something [07:58:47] let me know if I can help to clear things regarding analytics, swift etc. I'm available for a quick call if you want as well [08:04:23] 06Machine-Learning-Team: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#10957356 (10gkyziridis) Thnx for working on this initiative. I will share my thoughts which maybe answer some of your questions. 1. You can always test your python sc... [08:07:14] isaranto: Aya indeed has nothing to do with it indeed, I just confused the notebooks :P. [08:07:14] The pretrained base model that it used for edit-check training is "markussagen/xlm-roberta-longformer-base-4096" right? [08:08:03] This is the notebook: https://gitlab.wikimedia.org/repos/research/llm_evaluation/-/blob/ait/eval-datasets/notebooks/baseline-exp/binary_classification_lm.ipynb?ref_type=heads [08:09:33] iirc the one we have now in prod is based on bert-base-multilingual-cased which is mentioned in that same notebook. aiko am I right? [08:10:43] isaranto: I added you in the meeting today with Blathazar I am not sure if that fits your schedule so I added you as optional. Feel free to join [08:12:36] ok thanks but I won't make it to that meeting [08:36:57] George and I just had a quick sync about airflow/s3. Thanks georgekyz for clearing things up for me! [08:44:39] 06Machine-Learning-Team: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#10957517 (10elukey) Hi @BWojtowicz-WMF! Thanks for working on this :) My 2c: the script should be available only for ml-admins (namely, all members of your team) beca... [09:00:15] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10957578 (10OKarakaya-WMF) I share an inconsistency between the repos here which I think leads to 5% difference in training positives.... [09:01:39] 06Machine-Learning-Team: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#10957583 (10BWojtowicz-WMF) Thank you both for the answers, this helps a lot! @elukey @gkyziridis I'll try to make the script as self-contained as possible, ideally... [09:09:48] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: FY2024-25 Q4 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10957609 (10Aklapper) [09:36:41] isaranto: yes, the one in prod is based on mbert. the notebook has training code for more than one model [09:36:54] georgekyz: there are some info here https://phabricator.wikimedia.org/T388211 [09:37:57] aiko: thank youuuu [09:39:51] thanks aiko! I suggest that we create a clear notebook that has all the steps we'd like to put in an airflow [09:48:58] I agree! [10:47:40] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Create a connection in the airflow-ml instance that permits access to the current thanos-swift user and its buckets - https://phabricator.wikimedia.org/T398186 (10BTullis) 03NEW [10:47:51] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Create a connection in the airflow-ml instance that permits access to the current thanos-swift user and its buckets - https://phabricator.wikimedia.org/T398186#10958011 (10BTullis) p:05Triage→03High [11:14:23] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: FY2024-25 Q4 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10958106 (10OKarakaya-WMF) Update for the issue above is [here](https://gitlab.wikimedia.org/repos/research/research-datasets/-/com... [11:24:45] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Create a connection in the airflow-ml instance that permits access to the current thanos-swift user and its buckets - https://phabricator.wikimedia.org/T398186#10958133 (10BTullis) a:05BTullis→03brouberol [12:15:56] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10958319 (10Jclark-ctr) [12:37:14] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Create a connection in the airflow-ml instance that permits access to the current thanos-swift user and its buckets - https://phabricator.wikimedia.org/T398186#10958406 (10brouberol) The egress traffic between airflow... [12:53:12] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Create a connection in the airflow-ml instance that permits access to the current thanos-swift user and its buckets - https://phabricator.wikimedia.org/T398186#10958455 (10brouberol) ` airflow@airflow-scheduler-cd68f6... [12:56:31] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Create a connection in the airflow-ml instance that permits access to the current thanos-swift user and its buckets - https://phabricator.wikimedia.org/T398186#10958480 (10brouberol) @gkyziridis This is what you need... [12:56:41] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Create a connection in the airflow-ml instance that permits access to the current thanos-swift user and its buckets - https://phabricator.wikimedia.org/T398186#10958482 (10brouberol) 05Open→03Resolved [12:59:41] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Create a connection in the airflow-ml instance that permits access to the current thanos-swift user and its buckets - https://phabricator.wikimedia.org/T398186#10958497 (10gkyziridis) >>! In T398186#10958480, @bro... [13:25:10] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Create a connection in the airflow-ml instance that permits access to the current thanos-swift user and its buckets - https://phabricator.wikimedia.org/T398186#10958578 (10brouberol) Pleasure :) [16:02:16] * isaranto afk!