Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the train scripts and add config support for ctranslate2 #922

Merged
merged 12 commits into from
Dec 20, 2024

Conversation

gregtatum
Copy link
Member

@gregtatum gregtatum commented Nov 6, 2024

This adds Ctranslate2 support to the training config for the translate-* tasks. The patch stack is a bit bigger, but the commits are all logically written. I can break out the earlier commits to land first if needed. Resolves #933, #785.

@gregtatum gregtatum force-pushed the ctranslate2 branch 7 times, most recently from 138ef16 to 0edc3c6 Compare November 13, 2024 14:45
@gregtatum gregtatum force-pushed the ctranslate2 branch 8 times, most recently from af669db to 5813a03 Compare December 16, 2024 19:01
@gregtatum gregtatum changed the title Ctranslate2 draft Rewrite the train scripts and add config support for ctranslate2 Dec 16, 2024
@gregtatum gregtatum marked this pull request as ready for review December 16, 2024 20:21
@gregtatum gregtatum requested review from a team as code owners December 16, 2024 20:21
@gregtatum gregtatum requested a review from jcristau December 16, 2024 20:21
Copy link
Collaborator

@bhearsum bhearsum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Taskcluster parts of this look straightforward and fine. Nice to see more shell code getting excised!

Copy link
Collaborator

@eu9ene eu9ene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! It looks good overall although I didn't dig deep into implementation. I have only minor comments.

Due to the number of changes and complexity I think the only way to test this all is to do 3 full training runs with the same config (old implementation, new marian decoding, ctranslate2 decoding) and compare results. There shouldn't be negative effects in the new Marian implementation and only minor ones in ctranslate2. Some medium resource languages with limited mono data would work. I don't know if you did this for the latest version of the code.

pipeline/common/marian.py Outdated Show resolved Hide resolved
tests/test_ctranslate2.py Show resolved Hide resolved
pipeline/common/datasets.py Show resolved Hide resolved
pipeline/translate/translate_ctranslate2.py Outdated Show resolved Hide resolved
pipeline/translate/translate_ctranslate2.py Show resolved Hide resolved
@gregtatum
Copy link
Member Author

do 3 full training runs

I have my experiments doing the 3 full distillation and student runs and a clean CI run. The results are all summarized in #931.

@gregtatum gregtatum merged commit 8977fbf into mozilla:main Dec 20, 2024
36 checks passed
ZJaume pushed a commit that referenced this pull request Jan 16, 2025
* WANDB Test failure

* Rename DataDir.load to DataDir.read_text and allow for reading compressed files

* Add compress and decompress common utilities

* Use decompression utilities everywhere

* Re-work the marian-decoder fixture to correctly output nbest

* Rewrite translate.sh to python

* Add a requirements file for ctranslate2

* Add support for ctranslate2

* Add gpustats to the train requirements

* Add logging for translations

* Remove old translate scripts

* Handle review feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

decoding-teacher config property is not being used in translate.sh or translate-nbest.sh
3 participants