Automatically generate training config files with the `task config-generator` #620

gregtatum · 2024-05-21T21:25:53Z

This PR adds automatic generation of training configs based on the production config.

configs/en-ru-spring-2024.yml

configs/en-sl-spring-2024.yml

configs/bs-en-spring-2024.yml

utils/config_generator.py

eu9ene · 2024-05-23T18:28:47Z

utils/config_generator.py

+prod_config_path = root_dir / "taskcluster/configs/config.prod.yml"
+
+pretrained_student_models = {
+    ("ru", "en"): "https://storage.googleapis.com/releng-translations-dev/models/ru-en/better-teacher/student"


This is a great idea btw! We just need to prepare those models (copy vocab etc)

@eu9ene should ask @bhearsum where to find recently trained models

If you're talking about the ones I uploaded by hand, these are in https://console.cloud.google.com/storage/browser/moz-fx-translations-data--303e-prod-translations-data;tab=objects?forceOnBucketsSortingFiltering=true&project=moz-fx-translations-data--303e&prefix=&forceOnObjectsSortingFiltering=false.

Anything after I did those uploads only exists in Taskcluster artifacts.

ok, so the most recent ones are in the tasks and slightly less recent are in https://console.cloud.google.com/storage/browser/moz-fx-translations-data--303e-prod-translations-data/models/en-hu/. I see that the model like https://console.cloud.google.com/storage/browser/moz-fx-translations-data--303e-prod-translations-data/models/en-hu/baseline_enhu_aY25-4fXTcuJNuMcWXUYtQ/student-finetuned?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&project=moz-fx-translations-data--303e needs a vocab copied to the folder to use for training continuation as we started copying it in the pipeline only recently.

Your GCP accounts has credentials to do this, FYI. That's going to the be quickest way to unblock this.

configs/en-ru-spring-2024.yml

configs/en-cs-spring-2024.yml

configs/en-ru-spring-2024.yml

eu9ene · 2024-05-23T21:00:54Z

configs/cs-en-spring-2024.yml

+  - opus_NeuLab-TedTalks/v1 #                               111,107 sentences
+  - opus_ECB/v1 #                                            63,716 sentences
+  - opus_bible-uedin/v1 #                                    62,151 sentences
+  - opus_WMT-News/v2019 #                                    44,859 sentences


We should remove https://opus.nlpl.eu/WMT-News This is a dev/test set

Actually, for some languages, it's pretty big so maybe it's not the test set but it's not clear from https://www.statmt.org/wmt19/translation-task.html that there was an extra dataset released that can be used for training. \

sacrebleu wmt19 for en-lt is 1000 lines long and this one is 5,998 so it's likely something else

there's also

- mtdata_Statmt-newsdev_enlt-2019-eng-lit # ~402,756 sentences (45.5 MB)

I wonder if we should have a step the removes any devtest or test from the training data, so we make sure we're not lying to ourselves.

It's what I usually do manually. I don't think those datasets are generally on OPUS but some of them might be on mtdata. There was a dataset that we never use for evals that I left in training data but maybe it's also worth removing:

- mtdata_UN-un_dev-1-eng-rus - mtdata_UN-un_test-1-eng-rus

These are now automatically being moved to devtest and test.

Is it possible to add the number of sentences for sacrebleu and flores datasets? + total for the devset. We don't want the devset to be too big and slow down training

utils/config_generator.py

configs/en-lt-spring-2024.yml

eu9ene

There's still some uncertainty about dev/test datasets. We should investigate why mtdata_Neulab-tedtalks_test reports 3M sentences and add dataset sizes to dev/test datasets to understand what to include.

eu9ene · 2024-05-24T18:59:44Z

configs/bs-en-spring-2024.yml

+  - opus_ELRC-3047-wikipedia_health/v1 #                        205 sentences
+  - opus_ELRC-wikipedia_health/v1 #                             205 sentences
+  - opus_ELRC_2922/v1 #                                         204 sentences
+  - mtdata_Neulab-tedtalks_test-1-eng-bos #             ~3,117,009 sentences (352.2 MB)


This is super weird. Without the size info, I would use this dataset for test. We should investigate why it's 3M sentences

eu9ene · 2024-05-24T19:02:21Z

configs/en-ru-spring-2024.yml

+  - mtdata_Neulab-tedtalks_dev-1-eng-rus
+  - mtdata_UN-un_dev-1-eng-rus
+  - flores_aug-mix_dev
+  - sacrebleu_aug-mix_mtedx/test


this dataset was not found for en-ru... Can it be because we use different version of sacrebleu in find-corpus? https://firefox-ci-tc.services.mozilla.com/tasks/dKusTBI1Rg-UeXIJz_0amQ/runs/0/logs/public/logs/live.log

eu9ene · 2024-05-24T19:05:15Z

configs/en-ru-spring-2024.yml

+  - sacrebleu_aug-mix_wmt18
+  - sacrebleu_aug-mix_wmt17
+  - sacrebleu_aug-mix_wmt15
+  - sacrebleu_aug-mix_wmt14/full


I would not include those / ones unless there's only flores available. I did it for en-lt for example:

devtest: - flores_aug-mix_dev - sacrebleu_aug-mix_wmt19/dev - mtdata_aug-mix_Neulab-tedtalks_dev-1-eng-lit

I guess we can either comment all such datasets or just leave it up to the user to remove them

eu9ene

As discussed, we'll continue iterating in follow-up PRs including manual modification of the training configs.

IMPORTANT!!! The generated configs are not ready for production as some issues around dev/test datasets are not addressed and require manual modification.

gregtatum · 2024-05-24T21:06:36Z

I made a meta to track issues.

#633

…nerator` (#620) * Create a util to automatically generate configs * Add the generated configs * Update the config generation script * Update the configs * Update the configs * Address review comments for the config generator * Fix find_corpus test

gregtatum force-pushed the util-config-gen branch 4 times, most recently from 13bef4d to e1fc4fe Compare May 23, 2024 16:30

gregtatum marked this pull request as ready for review May 23, 2024 16:31

gregtatum requested a review from a team as a code owner May 23, 2024 16:31

gregtatum requested review from jcristau and eu9ene May 23, 2024 16:31

eu9ene reviewed May 23, 2024

View reviewed changes

configs/en-ru-spring-2024.yml Outdated Show resolved Hide resolved

gregtatum commented May 23, 2024

View reviewed changes

configs/en-cs-spring-2024.yml Outdated Show resolved Hide resolved

configs/en-ru-spring-2024.yml Outdated Show resolved Hide resolved

eu9ene reviewed May 23, 2024

View reviewed changes

utils/config_generator.py Show resolved Hide resolved

eu9ene reviewed May 23, 2024

View reviewed changes

configs/en-lt-spring-2024.yml Show resolved Hide resolved

gregtatum mentioned this pull request May 24, 2024

In config generation, switch to two stage training when the mono data is too small #632

Open

gregtatum added 5 commits May 24, 2024 13:18

Create a util to automatically generate configs

77e1349

Add the generated configs

c413540

Update the config generation script

9d96d37

Update the configs

047dd30

Update the configs

fedabfe

gregtatum force-pushed the util-config-gen branch from acffa83 to 3011ebe Compare May 24, 2024 18:18

gregtatum added 2 commits May 24, 2024 13:28

Address review comments for the config generator

edc83e1

Fix find_corpus test

2fb5a87

gregtatum force-pushed the util-config-gen branch from 3011ebe to 2fb5a87 Compare May 24, 2024 18:28

gregtatum requested a review from eu9ene May 24, 2024 18:44

eu9ene reviewed May 24, 2024

View reviewed changes

eu9ene approved these changes May 24, 2024

View reviewed changes

gregtatum mentioned this pull request May 24, 2024

[meta] Make config generation fully automated #633

Open

gregtatum merged commit 56040c9 into mozilla:main May 24, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically generate training config files with the `task config-generator` #620

Automatically generate training config files with the `task config-generator` #620

gregtatum commented May 21, 2024 •

edited

Loading

eu9ene May 23, 2024

eu9ene May 23, 2024

bhearsum May 23, 2024

eu9ene May 23, 2024

bhearsum May 23, 2024

eu9ene May 23, 2024

eu9ene May 23, 2024

eu9ene May 23, 2024

eu9ene May 23, 2024

gregtatum May 24, 2024

eu9ene May 24, 2024

gregtatum May 24, 2024

eu9ene May 24, 2024

eu9ene left a comment

eu9ene May 24, 2024

eu9ene May 24, 2024

eu9ene May 24, 2024

eu9ene left a comment

gregtatum commented May 24, 2024

Automatically generate training config files with the task config-generator #620

Automatically generate training config files with the task config-generator #620

Conversation

gregtatum commented May 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eu9ene left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eu9ene left a comment

Choose a reason for hiding this comment

gregtatum commented May 24, 2024

Automatically generate training config files with the `task config-generator` #620

Automatically generate training config files with the `task config-generator` #620

gregtatum commented May 21, 2024 •

edited

Loading