Improve translation quality of English to Chinese #1033
Labels
language-coverage
Issues related to covering specific languages
quality
Improving robustness and translation quality
The quality of the teacher model doesn't look great: COMET 85.67 vs Google Translate 89.45 (-4.41%) on flores-test.
I suspect we have some issues in the training data:
mtdata_Statmt-backtrans_enzh-wmt20-eng-zho
which is backtranslations for zh-en WMT20 task. So the zh part of it might be of poor quality. We already have our own back-translations.The question is which of those issues contribute most. It would require running separate experiments for most of them to see if it moves the needle.
Another indicator that something is wrong with the data is that one of my experiments completely diverged and I had to reduce the learning rate. Also when training in two stages there is a big bump in the cost when moving to training on the original parallel corpus only. This typically indicates that the fine-tuning data is noisier compared to the pre-training one (a mix of back-translated and original corpus).
The text was updated successfully, but these errors were encountered: