Lite Training for Portuguese-English Translation




In this page, we will present our work called “Lite Training Strategies for Portuguese-English and English-Portuguese Translation”. To cite our work, just add:

@inproceedings{lopes-etal-2020-lite,
title = "Lite Training Strategies for {P}ortuguese-{E}nglish and {E}nglish-{P}ortuguese Translation",
author = "Lopes, Alexandre and
Nogueira, Rodrigo and
Lotufo, Roberto and
Pedrini, Helio",
booktitle = "Proceedings of the Fifth Conference on Machine Translation",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.wmt-1.90",
pages = "833--840",
}

We performed an implementation of T5 for translation in PT-EN and EN-PT tasks using a modest hardware setup. We propose some changes in tokenizator and post-processing that improves the result and used a Portuguese pretrained model for the translation.

For the first part of the project, we used https://paracrawl.eu/ corpus. We trained using 5M+ data from ParaCrawl. We just did not use more data because it was taking too much time. In theory, it would improve the results.

After step one, we also fine-tuned in a 6M+ Portuguese-English corpora of scientifical data. We did not evaluate these results with ParaCrawl. Feel free to do so and compare the results. This translator looks better than the pracrawl one, specially for Brazilian Portuguese sentences. Therefore, don’t forget to test both translators to your project.

Another important contribution of this project is the creation a corpus called ParaCrawl 99k. The ParaCrawl 99k is composed of two small corpus of Paracrawl containing the Google Translate En – PT-pt and Google Translate PT-pt – En translations of 99k sentences. This costed around $300 each to train, so you can save some money if you want to compare your results with Google Translate (GT). Pay attention to remove such itens from ParaCrawl before running it. All descriptions to this dataset, best practices and uses are in the Readme file of ParaCrawl99k folder.




Installation and Usage

You have two options for this: Github and Huggingface Page (two links. one for each direction). The links are above:

|