-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically generate training config files with the task config-generator
#620
Conversation
13bef4d
to
e1fc4fe
Compare
prod_config_path = root_dir / "taskcluster/configs/config.prod.yml" | ||
|
||
pretrained_student_models = { | ||
("ru", "en"): "https://storage.googleapis.com/releng-translations-dev/models/ru-en/better-teacher/student" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea btw! We just need to prepare those models (copy vocab etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're talking about the ones I uploaded by hand, these are in https://console.cloud.google.com/storage/browser/moz-fx-translations-data--303e-prod-translations-data;tab=objects?forceOnBucketsSortingFiltering=true&project=moz-fx-translations-data--303e&prefix=&forceOnObjectsSortingFiltering=false.
Anything after I did those uploads only exists in Taskcluster artifacts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so the most recent ones are in the tasks and slightly less recent are in https://console.cloud.google.com/storage/browser/moz-fx-translations-data--303e-prod-translations-data/models/en-hu/
. I see that the model like https://console.cloud.google.com/storage/browser/moz-fx-translations-data--303e-prod-translations-data/models/en-hu/baseline_enhu_aY25-4fXTcuJNuMcWXUYtQ/student-finetuned?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&project=moz-fx-translations-data--303e needs a vocab copied to the folder to use for training continuation as we started copying it in the pipeline only recently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your GCP accounts has credentials to do this, FYI. That's going to the be quickest way to unblock this.
- opus_NeuLab-TedTalks/v1 # 111,107 sentences | ||
- opus_ECB/v1 # 63,716 sentences | ||
- opus_bible-uedin/v1 # 62,151 sentences | ||
- opus_WMT-News/v2019 # 44,859 sentences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove https://opus.nlpl.eu/WMT-News This is a dev/test set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, for some languages, it's pretty big so maybe it's not the test set but it's not clear from https://www.statmt.org/wmt19/translation-task.html that there was an extra dataset released that can be used for training. \
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sacrebleu wmt19 for en-lt is 1000 lines long and this one is 5,998 so it's likely something else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's also
- mtdata_Statmt-newsdev_enlt-2019-eng-lit # ~402,756 sentences (45.5 MB)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should have a step the removes any devtest or test from the training data, so we make sure we're not lying to ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's what I usually do manually. I don't think those datasets are generally on OPUS but some of them might be on mtdata. There was a dataset that we never use for evals that I left in training data but maybe it's also worth removing:
- mtdata_UN-un_dev-1-eng-rus
- mtdata_UN-un_test-1-eng-rus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are now automatically being moved to devtest and test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add the number of sentences for sacrebleu and flores datasets? + total for the devset. We don't want the devset to be too big and slow down training
acffa83
to
3011ebe
Compare
3011ebe
to
2fb5a87
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still some uncertainty about dev/test datasets. We should investigate why mtdata_Neulab-tedtalks_test
reports 3M sentences and add dataset sizes to dev/test datasets to understand what to include.
- opus_ELRC-3047-wikipedia_health/v1 # 205 sentences | ||
- opus_ELRC-wikipedia_health/v1 # 205 sentences | ||
- opus_ELRC_2922/v1 # 204 sentences | ||
- mtdata_Neulab-tedtalks_test-1-eng-bos # ~3,117,009 sentences (352.2 MB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super weird. Without the size info, I would use this dataset for test
. We should investigate why it's 3M sentences
- mtdata_Neulab-tedtalks_dev-1-eng-rus | ||
- mtdata_UN-un_dev-1-eng-rus | ||
- flores_aug-mix_dev | ||
- sacrebleu_aug-mix_mtedx/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this dataset was not found for en-ru... Can it be because we use different version of sacrebleu in find-corpus? https://firefox-ci-tc.services.mozilla.com/tasks/dKusTBI1Rg-UeXIJz_0amQ/runs/0/logs/public/logs/live.log
- sacrebleu_aug-mix_wmt18 | ||
- sacrebleu_aug-mix_wmt17 | ||
- sacrebleu_aug-mix_wmt15 | ||
- sacrebleu_aug-mix_wmt14/full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not include those /
ones unless there's only flores available. I did it for en-lt for example:
devtest:
- flores_aug-mix_dev
- sacrebleu_aug-mix_wmt19/dev
- mtdata_aug-mix_Neulab-tedtalks_dev-1-eng-lit
I guess we can either comment all such datasets or just leave it up to the user to remove them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, we'll continue iterating in follow-up PRs including manual modification of the training configs.
IMPORTANT!!! The generated configs are not ready for production as some issues around dev/test datasets are not addressed and require manual modification.
I made a meta to track issues. |
…nerator` (#620) * Create a util to automatically generate configs * Add the generated configs * Update the config generation script * Update the configs * Update the configs * Address review comments for the config generator * Fix find_corpus test
This PR adds automatic generation of training configs based on the production config.