Improve translation quality of English to Chinese #1033

eu9ene · 2025-02-11T17:47:42Z

The quality of the teacher model doesn't look great: COMET 85.67 vs Google Translate 89.45 (-4.41%) on flores-test.

I suspect we have some issues in the training data:

I found that we used dataset mtdata_Statmt-backtrans_enzh-wmt20-eng-zho which is backtranslations for zh-en WMT20 task. So the zh part of it might be of poor quality. We already have our own back-translations.
We convert all the data from Traditional to Simplified with hanzi-convert. People on the internet say such conversion is imprecise due to the regional differences, vocabulary, word composition etc., so it is not equal to a proper translation. It might be the case that it pollutes data when Chinese is the target language. We can just filter all the data in Traditional and see if it makes a difference.
We use a 64k joint vocabulary for very different scripts, it's worth running an experiment with separate vocabularies 32k in size. Depends on Allow for split vocabs #913
There are known issues for some datasets: UN, HPLT, desegmentation. We can address them with ad-hoc fixes through OpusCleaner and the mono-cleaning scripts.

The question is which of those issues contribute most. It would require running separate experiments for most of them to see if it moves the needle.

Another indicator that something is wrong with the data is that one of my experiments completely diverged and I had to reduce the learning rate. Also when training in two stages there is a big bump in the cost when moving to training on the original parallel corpus only. This typically indicates that the fine-tuning data is noisier compared to the pre-training one (a mix of back-translated and original corpus).

The text was updated successfully, but these errors were encountered:

eu9ene mentioned this issue Feb 11, 2025

[meta] Train harder to segment languages, like CJK languages #425

Open

eu9ene added quality Improving robustness and translation quality language-coverage Issues related to covering specific languages labels Feb 11, 2025

eu9ene self-assigned this Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve translation quality of English to Chinese #1033

Improve translation quality of English to Chinese #1033

eu9ene commented Feb 11, 2025 •

edited

Loading

Improve translation quality of English to Chinese #1033

Improve translation quality of English to Chinese #1033

Comments

eu9ene commented Feb 11, 2025 • edited Loading

eu9ene commented Feb 11, 2025 •

edited

Loading