Dressing up the woman of sticks: the delightful ambiguities of Czech machine translation

Ondřej Bojar, photo: David Vaughan

If you have ever used a computer translation program you will know what curious things machines can come up with, as they try to cope with the complexities and ambiguities of the language we use. So how are machines finding their way through the labyrinth of the Czech language? In this week ‘s Czech Books, David Vaughan talks to Ondřej Bojar from the Institute of Formal and Applied Linguistics at Prague’s Charles University.

Ondřej Bojar,  photo: David Vaughan
When I first started enjoying poetry as a teenager I remember finding inspiration in William Empson’s classic, “Seven Types of Ambiguity”, which explores the wealth of meaning in poetic language. But while we humans might relish the suggestiveness of words, our more rational cousins the machines are rather less enthusiastic and the problems are only magnified when it comes to translating between two such different languages as English and Czech. Ondřej Bojar is part of a team at Charles University trying to polish the science of computer translation into and out of the Czech language. He is the author of “Czech and Machine Translation”, a summary of the current state of the art in computational linguistics. The moment you start talking to him, you are left with no doubt at all that Ondřej Bojar is a man who loves his subject.

“Machine translation is an intriguing field, which personally I really like because it’s touching the human brain and it’s what people produce, but the results of the human brain are so tangible in linguistics and computational linguistics. You can work with texts. They’re written down. That’s why I like this field of study, because I’m interested in how people think, what the mind does and how the mind processes words. Machine translation is one particular application of computation linguistics, where you can show whether you have successfully explained to the computer what the meaning is. Machine translation is a quickly evolving field and for various reasons, some more obvious, some less obvious, English has always been the main focus of computational linguistics, and so it is with machine translation. So machine translation into English operates much better than into other languages. Czech is a beautiful language because of its complexity, and the complexity is in some aspects very different from the complexity observed in English and in other languages. “

I should imagine that the complexity of Czech could in some ways make translating more difficult, but in other ways make them easier: more difficult in the sense that it’s complex, but easier in the sense that Czech is a very systematic language, a grammar-based language, where there are very strict rules.

“You’re definitely right that some things are easier and some things are harder. There are more people who work on English, and therefore the tools are more evolved, better fit for the specifics of English. So, if so much effort was devoted to Czech, then possibly English would appear more difficult these days.”

Photo: archive of Radio Prague
And so, what are the specifics of Czech that make it difficult to find ways of doing reliable machine translations?

“In short, its rich morphology. In English you have the word ‘cat’ and it has two morphological variations, the singular ‘cat’ and the plural, two ‘cats’, while for Czech we have seven cases, four genders, three numbers. It’s not that each of the different combinations would have a different ending, but still we have many more variants of the word ‘cat’. If you’re building a machine translation system and the system doesn’t have a specific component that would be capable of handling this morphology, the system has to learn independently all the morphological variations of the ‘cat’. So ‘cat’ can be translated as ‘kočka‘, ‘kočkou’, etc. and it’s only the context in the sentence that tells the system which of the cases is appropriate. In the book I have a couple of well-known examples, where linguists found funny sentences that have more meanings, more interpretations, than they have words. A very famous sentence is :

‘Ženu holí stroj.’

“The first word, ‘ženu’ can be either a particular case of the word ‘žena‘ – a woman – or it can be a verb – ‘pushing’ or ‘chasing someone or something’. Then ‘holí‘ is a highly ambiguous word because of the ending in ‘i’. That’s a very ambiguous ending in Czech in general. It can either be a particular case of the word ‘stick’ – as in walking stick – or it can be the adjective ‘hairless’ or ‘bald’. Then it can also be the verb ‘shaving’. The third word in the sentence ‘stroj’ is, in its most common reading ‘a machine’ but it can also be an imperative of the verb ‘to dress someone up, to put clothes on someone’. And the endings allow for quite a number of readings for the sentence. It can indicate that, ‘The woman is being shaved by the machine’. That is the most typical reading, given the word order. Then another meaning would be, ‘I’m chasing the machine with a stick’. Another meaning is the imperative of the word ‘stroj’ – to dress up. So the meaning of the whole sentence is, ‘Dress up the woman with a stick’ – i.e. use the stick to put the clothes on the woman. Related, but different, is a meaning where the stick belongs to a woman in the same sense as we have ‘The Lord of the Rings’. So there is a ‘woman of sticks’ and this is the woman that I should dress up! And then there’s one more meaning which a colleague of mine found only after the book was printed, that it’s the ‘machine of the sticks’ that I should be chasing! And now I’m sure that you’re confused by all the combinations.”

Photo: archive of Radio Prague
And presumably, even for the most sophisticated computer, all this is just a bit too much – at this stage.

“The problem is always the ambiguity and the context that disambiguates. Current systems operate at the level of sentences, and it’s only slowly becoming the trend that researchers focus on larger spans than individual sentences. Even within the sentences, the prevailing, so-called ‘phrase-based’ approach to machine translation, actually translates separate phrases independently of other phrases. So, when the system is translating the beginning of the sentence, it doesn’t consider the words that appear at the end and vice-versa. So current machine translation systems can very easily produce sentences that seem fluent from the beginning to the end, but never contain a verb. So they do not have any meaning for the reader. Another difficult problem of prevailing machine translation systems is that the sentence is perfectly fluent, does contain a verb, it’s a valid, perfect sentence, except that the negation is reversed, because the language model was so strong. Instead of saying that someone is a robber, it preferred the reading that was more frequent in the training data and said that person is not a robber.”

You talk about this training data. Is there some way of accumulating a huge corpus of sentences from real life, from real texts out in the real world, that can somehow make the computer more intelligent, or more able to think “that’s probably saying this” or “the context makes me think that this is what this sentence is saying”.

“The data is essential, so I think that almost the only proven approach to improve translation quality is to get more training data. More data is better data.”

You use this term “training data”. What actually is that?

“The training data are texts that have been previously translated by humans. For example, the European Union is an excellent institution in this respect. We, as computational linguists, love it, because the European institutions had the goal to have all their paperwork in all official languages, and also the fact that people don’t like about the European Union, the fact that it regulates everything, is beneficial for us because the vocabulary in those texts is very wide. It describes all aspects of citizens’ lives. So it’s good for us.”

Photo: CTK
But it strikes me that you’re still a very, very long way from being able to get a machine that is going to translate reliably from English into Czech in the wide variety of different fields in which we generally tend to read texts.

“If the domain is reasonably narrow – if it’s a closed domain – then even current systems can reach near perfect quality.”

You mean specialist language in a very particular field…

“I mean rather repetitive domains. Something that already worked in the 1970s was a system that translated weather forecasts from English to French or vice-versa. That was perfect, because the domain was so restricted.”

Czech is a language spoken by only ten million people. It is surrounded by other languages. The need to translate into and out of Czech is far greater than the number of people out there, who are able to keep pace with the need to translate. Do you think in a few years that there really will be sophisticated enough systems to make it easy to communicate into and out of Czech more or less reliably?

“You’ve mentioned the word ‘reliably’. Into English the systems have reached excellent fluency, but the better the fluency the higher the risk is that you will not notice a clear error in the meaning. So that’s why I would actually be more and more cautious as the systems evolve.”

So it’s going to be a long time before air-traffic controllers start using machine translation…

“I hope they never will! Personally, I think that computers or robots will have no chance of understanding our language without actually walking with us in the streets and having similar problems as we have, like buying food and preparing dinner. Only when the robots will start living with us, can they be trained to understand us in the real meaning of that word.”

And is that just sci-fi, or could it happen?

“I think it could happen.”

[And finally, in case you’re wondering why Ondřej says there are four genders in Czech – normally you would expect just three – feminine, masculine and neuter – the reason is that there are two types of masculine nouns – animate and inanimate. Grammatically they behave quite differently!]

The episode featured today was first broadcast on March 16, 2013.