Rewrite the train scripts and add config support for ctranslate2 #922

gregtatum · 2024-11-06T16:49:03Z

This adds Ctranslate2 support to the training config for the translate-* tasks. The patch stack is a bit bigger, but the commits are all logically written. I can break out the earlier commits to land first if needed. Resolves #933, #785.

bhearsum

The Taskcluster parts of this look straightforward and fine. Nice to see more shell code getting excised!

eu9ene

Nice work! It looks good overall although I didn't dig deep into implementation. I have only minor comments.

Due to the number of changes and complexity I think the only way to test this all is to do 3 full training runs with the same config (old implementation, new marian decoding, ctranslate2 decoding) and compare results. There shouldn't be negative effects in the new Marian implementation and only minor ones in ctranslate2. Some medium resource languages with limited mono data would work. I don't know if you did this for the latest version of the code.

pipeline/common/marian.py

tests/test_ctranslate2.py

pipeline/common/datasets.py

pipeline/translate/translate_ctranslate2.py

gregtatum · 2024-12-20T13:57:08Z

do 3 full training runs

I have my experiments doing the 3 full distillation and student runs and a clean CI run. The results are all summarized in #931.

…ssed files

* WANDB Test failure * Rename DataDir.load to DataDir.read_text and allow for reading compressed files * Add compress and decompress common utilities * Use decompression utilities everywhere * Re-work the marian-decoder fixture to correctly output nbest * Rewrite translate.sh to python * Add a requirements file for ctranslate2 * Add support for ctranslate2 * Add gpustats to the train requirements * Add logging for translations * Remove old translate scripts * Handle review feedback

gregtatum force-pushed the ctranslate2 branch 7 times, most recently from 138ef16 to 0edc3c6 Compare November 13, 2024 14:45

gregtatum force-pushed the ctranslate2 branch from 0edc3c6 to 152a447 Compare November 13, 2024 20:17

gregtatum force-pushed the ctranslate2 branch 8 times, most recently from af669db to 5813a03 Compare December 16, 2024 19:01

gregtatum changed the title ~~Ctranslate2 draft~~ Rewrite the train scripts and add config support for ctranslate2 Dec 16, 2024

gregtatum marked this pull request as ready for review December 16, 2024 20:21

gregtatum requested review from a team as code owners December 16, 2024 20:21

gregtatum requested a review from jcristau December 16, 2024 20:21

bhearsum approved these changes Dec 17, 2024

View reviewed changes

eu9ene approved these changes Dec 18, 2024

View reviewed changes

gregtatum mentioned this pull request Dec 20, 2024

Train a real smaller teacher to be used in CTranslate2 #970

Open

gregtatum added 5 commits December 20, 2024 08:56

WANDB Test failure

519d239

Rename DataDir.load to DataDir.read_text and allow for reading compre…

116551a

…ssed files

Add compress and decompress common utilities

4e75162

Use decompression utilities everywhere

8250e0f

Re-work the marian-decoder fixture to correctly output nbest

92e9551

gregtatum added 7 commits December 20, 2024 08:56

Rewrite translate.sh to python

caeff70

Add a requirements file for ctranslate2

cff3855

Add support for ctranslate2

67b1cac

Add gpustats to the train requirements

e12edb7

Add logging for translations

4595979

Remove old translate scripts

28232e6

Handle review feedback

b693819

gregtatum force-pushed the ctranslate2 branch from 5813a03 to b693819 Compare December 20, 2024 14:57

gregtatum merged commit 8977fbf into mozilla:main Dec 20, 2024
36 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite the train scripts and add config support for ctranslate2 #922

Rewrite the train scripts and add config support for ctranslate2 #922

gregtatum commented Nov 6, 2024 •

edited

Loading

bhearsum left a comment

eu9ene left a comment

gregtatum commented Dec 20, 2024

Rewrite the train scripts and add config support for ctranslate2 #922

Rewrite the train scripts and add config support for ctranslate2 #922

Conversation

gregtatum commented Nov 6, 2024 • edited Loading

bhearsum left a comment

Choose a reason for hiding this comment

eu9ene left a comment

Choose a reason for hiding this comment

gregtatum commented Dec 20, 2024

gregtatum commented Nov 6, 2024 •

edited

Loading