Can a computer mimic you speaking a language you do not speak? Microsoft recently demonstrated their Speech-to-Speech translation system. This is a system that takes audio in one language, translates the speech into another language, and then outputs speech in the new language. In their case, the presenter spoke English and the audience heard Mandarin. In general, I understand the three main steps.
- Speech Recognition: As the audio comes in, the recognizer converts the speech into text. Most people have probably used this at some point in their life, whether it be Siri or interacting with their bank on the phone.
- Machine Translation: Given a sentence, the system translates the text into a different language, hopefully maintaining the original meaning. Google Translate is one obvious example.
- Speech Synthesis: A synthesis system takes text and converts it into speech. This is how both Siri and Stephen Hawking are able to communicate with the world.
None of those three steps are well solved, but systems do exist that work reasonably well. Microsoft claims their system not only accomplishes those three steps, but that the translated speech will be in your own voice.
Synthesizing your Voice
The general principle behind a speech synthesis system—at least the type of system I believe they are using—is to build a general statistical model of speech. The details are not important for us, but the computer listens to speech from many speakers (in one language) and learns the general properties of speech. When the system needs to mimic your voice, it listens to a smaller amount of speech from you and adapts its model; maybe it learns you nasalize your vowels a bit or at least the general pitch of your voice.
However, when it comes to synthesizing the voice in a different language, that is when I start to have questions. The novel part to me, is that your voice supposedly speaks the foreign language. Let me try and explain why I find this so interesting.
Why is it Hard?
Not all language contain the same sounds. Mandarin has a rounded vowel that does not exist in English. Try saying the word “see”, but round your lips. No matter how many hours I have of you speaking English, I will never get an example of you making that sound—a sound that is needed for Mandarin. From other speakers of Mandarin, the system can learn how to speak that sound, but not how you speak the sound.
This issue does not make the problem impossible, just difficult. Based on the way you speak other sounds, the system may be able to extrapolate. Also, it might be able to just ignore these issues without significantly affecting the result. Although, how can you really judge the quality of the synthesis?
What does it Mean?
Using your voice to speak a foreign language can mean several things. Obviously a person can just attempt to read the foreign language and produce something that is probably unintelligible to a native speaker. I assume the goal of the Microsoft system is something different.
Should it produce something completely fluent, without even a trace of accent? My guess is this would obliterate much of the qualities that make your voice recognizable as your own. Assume you spent the next few years diligently studying a particular foreign language. My idea is that the system should be mimicking how your speech would have sounded after that study. If that sounds like an untestable and maybe impossible goal, it probably is.
How much does a person’s vocal characteristics vary from language to language? Would you even recognizer the voice of someone you know speaking a different language, a language you have never heard them speak before? I had my suspicions, but it appears more plausible than I originally thought. I found a video online that contains clips of celebrities speaking languages other than English. Two of the early ones, Arnold Schwarzanegger and Barack Obama, are clearly recognizable even though they are not speaking English.
In the end, this is a very interesting problem that I had not considered before.