-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite the train scripts and add config support for ctranslate2 #922
Conversation
138ef16
to
0edc3c6
Compare
0edc3c6
to
152a447
Compare
af669db
to
5813a03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Taskcluster parts of this look straightforward and fine. Nice to see more shell code getting excised!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! It looks good overall although I didn't dig deep into implementation. I have only minor comments.
Due to the number of changes and complexity I think the only way to test this all is to do 3 full training runs with the same config (old implementation, new marian decoding, ctranslate2 decoding) and compare results. There shouldn't be negative effects in the new Marian implementation and only minor ones in ctranslate2. Some medium resource languages with limited mono data would work. I don't know if you did this for the latest version of the code.
I have my experiments doing the 3 full distillation and student runs and a clean CI run. The results are all summarized in #931. |
5813a03
to
b693819
Compare
* WANDB Test failure * Rename DataDir.load to DataDir.read_text and allow for reading compressed files * Add compress and decompress common utilities * Use decompression utilities everywhere * Re-work the marian-decoder fixture to correctly output nbest * Rewrite translate.sh to python * Add a requirements file for ctranslate2 * Add support for ctranslate2 * Add gpustats to the train requirements * Add logging for translations * Remove old translate scripts * Handle review feedback
This adds Ctranslate2 support to the training config for the translate-* tasks. The patch stack is a bit bigger, but the commits are all logically written. I can break out the earlier commits to land first if needed. Resolves #933, #785.