Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve translation quality of English to Chinese #1033

Open
Tracked by #425
eu9ene opened this issue Feb 11, 2025 · 0 comments
Open
Tracked by #425

Improve translation quality of English to Chinese #1033

eu9ene opened this issue Feb 11, 2025 · 0 comments
Assignees
Labels
language-coverage Issues related to covering specific languages quality Improving robustness and translation quality

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Feb 11, 2025

The quality of the teacher model doesn't look great: COMET 85.67 vs Google Translate 89.45 (-4.41%) on flores-test.

I suspect we have some issues in the training data:

  1. I found that we used dataset mtdata_Statmt-backtrans_enzh-wmt20-eng-zho which is backtranslations for zh-en WMT20 task. So the zh part of it might be of poor quality. We already have our own back-translations.
  2. We convert all the data from Traditional to Simplified with hanzi-convert. People on the internet say such conversion is imprecise due to the regional differences, vocabulary, word composition etc., so it is not equal to a proper translation. It might be the case that it pollutes data when Chinese is the target language. We can just filter all the data in Traditional and see if it makes a difference.
  3. We use a 64k joint vocabulary for very different scripts, it's worth running an experiment with separate vocabularies 32k in size. Depends on Allow for split vocabs #913
  4. There are known issues for some datasets: UN, HPLT, desegmentation. We can address them with ad-hoc fixes through OpusCleaner and the mono-cleaning scripts.

The question is which of those issues contribute most. It would require running separate experiments for most of them to see if it moves the needle.

Another indicator that something is wrong with the data is that one of my experiments completely diverged and I had to reduce the learning rate. Also when training in two stages there is a big bump in the cost when moving to training on the original parallel corpus only. This typically indicates that the fine-tuning data is noisier compared to the pre-training one (a mix of back-translated and original corpus).

Image
@eu9ene eu9ene added quality Improving robustness and translation quality language-coverage Issues related to covering specific languages labels Feb 11, 2025
@eu9ene eu9ene self-assigned this Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-coverage Issues related to covering specific languages quality Improving robustness and translation quality
Projects
None yet
Development

No branches or pull requests

1 participant