Jul 17 2014
How is the level of accuracy of Google Translate affected by the fact that text translated from German into English contains five times more exclamation marks (!) than text that was originally written in English? A new study by the Computer Science Department at the University of Haifa sheds light on this matter
Have you ever input text in some language into “Google Translate” and received a translation that seemed too superficial? A new study that was conducted by the Department of Computer Science at the University of Haifa suggests a number of new discoveries relating to the unique linguistic features of text that has been translated by a person that can significantly improve the capabilities of computerized translation programs. “There are significant statistical differences between text that was originally written in a certain language, and text that was translated into that language by a person, no matter how talented the translator. The human reader may not be able to detect these differences but a computer can identify them with perfect accuracy,” says Professor Shuli Wintner, Head of the Department of Computer Science, who is heading this project.
Automatic translation software programs such as Google Translate have become useful tools for almost every home, and they yield translations ranging from reasonable to very good in a wide range of languages. However, there are quite a number of errors and inaccuracies even when translating languages that are close to each other, especially in long sentences. Attempts to develop translation software dates back to the 1950s, when the predominant method used was based on a large bilingual dictionary and a great number of grammar rules that characterize correlations between different languages.
This approach, however, failed to provide good results until the early 1990s when researchers at IBM suggested changing the method’s paradigm. Translation systems began to be based on two main statistical models that estimate two things: the probability of sequences of words in the target language — the language we wish to translate into (“language model”) — and the probability that a particular sequence of words in the source language will be translated into a particular sequence in the target language (“translation model”).
A statistical translation program needs to scan a vast number of texts in order to obtain good estimates: the language model is based on a large collection of texts in the target language, whereas the translation model is compiled from “parallel texts”. Parallel texts are texts that were translated (by professional translators) from the source language into the target language, and from which the model learns to match sequences of words in both languages. Translation programs combine these two models in order to determine which translation is the best for a given sentence: the translation model ensures a translation that is true to the source, and the language model ensures fluency in the target language.
However, findings in translation studies indicate considerable differences between texts that were originally written in a given target language, and texts that were translated into that language from another. This study, conducted at the University of Haifa, found that these differences effect how accurately the translation program translates. “No matter how good and successful the human translator is, the language in which a given text is written — the source language — leaves ‘fingerprints’ on the resulting translation.” There also seems to be cognitive load during the translation process that leads to a final product that is significantly different from texts that were originally written in the same language. The human reader may not be able to tell the difference between a document originally written in Hebrew and one that was translated from English into Hebrew — but the computer can distinguish between them,” Prof. Wintner explained.
In earlier studies that were conducted as part of the project, Prof. Wintner and his research partners, Dr. Noam Ordan and research student Vered Valensky, found which key linguistic features distinguish between source and translated texts. It turns out that the differences are not the result of language richness or of sentence length; they result from unexpected issues such as punctuation. “We discovered that text in English that was translated from German had five times more exclamation marks than source text in English,” he explained, “however, the most significant characteristics of translated text are the different syntactic structures.”
New research results were obtained by Dr. Gennadi Lembersky’s in his doctoral dissertation under the direction of Prof. Wintner, with Dr. Ordan. The study found that for a program to be more precise, the direction of translation of the parallel text that the translation model is compiled from needs to match the direction in which we wish to translate. In other words — when we want to translate text from English into Hebrew, we need to compile a translation model from texts that were translated from English into Hebrew, not from texts that were translated from Hebrew into English. While this seems obvious, the second finding is more surprising: statistical translation programs are much more accurate when their language model is based on texts that have been translated into the target language — i.e., the translation from English into Hebrew by a program with a language model compiled from texts in Hebrew that had been translated from English was better and more accurate than that of a program based on texts written originally in Hebrew. This doctoral thesis received the Best Thesis Award for 2013 from the European Association for Machine Translation (EAMT) for these findings.
Prof. Wintner says that he believes that within ten years computerized translation programs will be so accurate for a number of language pairs, that it will not be possible to distinguish computer-generated translations. “Over the past twenty years, computerized processing of languages has moved over to using only statistical models instead of the linguistic knowledge they generate. We have shown that awareness of the linguistic features of text — in our case linguistic features of human translation - can also significantly benefit applications that are essentially statistical. In the future, we will need to move towards a program that combines both these characteristics,” she concluded