Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically generate training config files with the task config-generator #620

Merged
merged 7 commits into from
May 24, 2024

Conversation

gregtatum
Copy link
Member

@gregtatum gregtatum commented May 21, 2024

This PR adds automatic generation of training configs based on the production config.

@gregtatum gregtatum force-pushed the util-config-gen branch 4 times, most recently from 13bef4d to e1fc4fe Compare May 23, 2024 16:30
@gregtatum gregtatum marked this pull request as ready for review May 23, 2024 16:31
@gregtatum gregtatum requested a review from a team as a code owner May 23, 2024 16:31
@gregtatum gregtatum requested review from jcristau and eu9ene May 23, 2024 16:31
prod_config_path = root_dir / "taskcluster/configs/config.prod.yml"

pretrained_student_models = {
("ru", "en"): "https://storage.googleapis.com/releng-translations-dev/models/ru-en/better-teacher/student"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea btw! We just need to prepare those models (copy vocab etc)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eu9ene should ask @bhearsum where to find recently trained models

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so the most recent ones are in the tasks and slightly less recent are in https://console.cloud.google.com/storage/browser/moz-fx-translations-data--303e-prod-translations-data/models/en-hu/. I see that the model like https://console.cloud.google.com/storage/browser/moz-fx-translations-data--303e-prod-translations-data/models/en-hu/baseline_enhu_aY25-4fXTcuJNuMcWXUYtQ/student-finetuned?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&project=moz-fx-translations-data--303e needs a vocab copied to the folder to use for training continuation as we started copying it in the pipeline only recently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your GCP accounts has credentials to do this, FYI. That's going to the be quickest way to unblock this.

- opus_NeuLab-TedTalks/v1 # 111,107 sentences
- opus_ECB/v1 # 63,716 sentences
- opus_bible-uedin/v1 # 62,151 sentences
- opus_WMT-News/v2019 # 44,859 sentences
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove https://opus.nlpl.eu/WMT-News This is a dev/test set

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, for some languages, it's pretty big so maybe it's not the test set but it's not clear from https://www.statmt.org/wmt19/translation-task.html that there was an extra dataset released that can be used for training. \

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sacrebleu wmt19 for en-lt is 1000 lines long and this one is 5,998 so it's likely something else

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's also

- mtdata_Statmt-newsdev_enlt-2019-eng-lit #             ~402,756 sentences (45.5 MB)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have a step the removes any devtest or test from the training data, so we make sure we're not lying to ourselves.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's what I usually do manually. I don't think those datasets are generally on OPUS but some of them might be on mtdata. There was a dataset that we never use for evals that I left in training data but maybe it's also worth removing:

  - mtdata_UN-un_dev-1-eng-rus
  - mtdata_UN-un_test-1-eng-rus

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are now automatically being moved to devtest and test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add the number of sentences for sacrebleu and flores datasets? + total for the devset. We don't want the devset to be too big and slow down training

@gregtatum gregtatum requested a review from eu9ene May 24, 2024 18:44
Copy link
Collaborator

@eu9ene eu9ene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still some uncertainty about dev/test datasets. We should investigate why mtdata_Neulab-tedtalks_test reports 3M sentences and add dataset sizes to dev/test datasets to understand what to include.

- opus_ELRC-3047-wikipedia_health/v1 # 205 sentences
- opus_ELRC-wikipedia_health/v1 # 205 sentences
- opus_ELRC_2922/v1 # 204 sentences
- mtdata_Neulab-tedtalks_test-1-eng-bos # ~3,117,009 sentences (352.2 MB)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super weird. Without the size info, I would use this dataset for test. We should investigate why it's 3M sentences

- mtdata_Neulab-tedtalks_dev-1-eng-rus
- mtdata_UN-un_dev-1-eng-rus
- flores_aug-mix_dev
- sacrebleu_aug-mix_mtedx/test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this dataset was not found for en-ru... Can it be because we use different version of sacrebleu in find-corpus? https://firefox-ci-tc.services.mozilla.com/tasks/dKusTBI1Rg-UeXIJz_0amQ/runs/0/logs/public/logs/live.log

- sacrebleu_aug-mix_wmt18
- sacrebleu_aug-mix_wmt17
- sacrebleu_aug-mix_wmt15
- sacrebleu_aug-mix_wmt14/full
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not include those / ones unless there's only flores available. I did it for en-lt for example:

  devtest:
  - flores_aug-mix_dev
  - sacrebleu_aug-mix_wmt19/dev
  - mtdata_aug-mix_Neulab-tedtalks_dev-1-eng-lit

I guess we can either comment all such datasets or just leave it up to the user to remove them

Copy link
Collaborator

@eu9ene eu9ene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, we'll continue iterating in follow-up PRs including manual modification of the training configs.

IMPORTANT!!! The generated configs are not ready for production as some issues around dev/test datasets are not addressed and require manual modification.

@gregtatum
Copy link
Member Author

I made a meta to track issues.

#633

@gregtatum gregtatum merged commit 56040c9 into mozilla:main May 24, 2024
6 checks passed
gabrielBusta pushed a commit that referenced this pull request Jun 13, 2024
…nerator` (#620)

* Create a util to automatically generate configs

* Add the generated configs

* Update the config generation script

* Update the configs

* Update the configs

* Address review comments for the config generator

* Fix find_corpus test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants