Gender Bias In Machine Translation

Original article was published on Artificial Intelligence on Medium


Gender Bias In Machine translation

Machine translation models are trained on huge corpuses of text, with pairs of sentences, one a translation of another into a different language. However, there are nuances in language that often make it difficult to provide an accurate and direct translation from one language to another.

When translating from English to languages such as French or Spanish, some gender neutral nouns will be translated into gender specific nouns. For example, the word “friend” in “his friend is kind” is gender neutral in English. However, in Spanish it is gender specific, either “amiga” (feminine) or “amigo” (masculine).

In Spanish the word “friend” is gender specific, either “amiga” or “amigo”

Another example is translation from Turkish to English. Turkish is almost an entirely gender neutral language. The pronoun “o” in Turkish can be translated into English as any of “he”, “she” or “it”. Google claim that 10% of their Turkish translate queries are ambiguous, and could be correctly translated into either gender.

In both these examples, we can see how a phrase in one language can be correctly translated into another language with different variations based on gender. Neither is more correct than the other, and a human with the same translation task would be faced with the same ambiguity without being provided with further context. (The only difference is that perhaps the human would know to ask for further context, or else provide both translations.) This means that it is incorrect to assume that there is always a single correct translation for any given word, phrase or sentence when translating from one language to another.

It is now easy understand why Google Translate was having issues with gender bias. If societal biases meant more men than women had historically become doctors, there would be more examples of male doctors than female doctors in the training data, which is just an accurate historical record of that gender imbalance. The model would learn from this data, resulting in a bias, that doctors are more likely to be male.

Now, when faced with the task of finding a single translation for “o bir doktor”, “he/she is a doctor” from Turkish to English, the model will assume “o” should he translated as he, as doctors are more likely to be male.

One might see how the opposite could occur for nurses.

Photo by Online Marketing on Unsplash