How Google Knows Who’s Talking, Even Among A Noisy Crowd

Google is pretty great at figuring out what a user is saying, but is it any good at knowing who's saying it? Just look at current smart speaker technology, which can be easily fooled.

Google might have a pretty simple solution, however. Its researchers have created a deep learning system that is able to single out voices. It does this by literally looking at people's faces when they're talking.

How Google Singles Out Voices From A Crowd

First, the researchers trained its system to recognize individual people speaking alone. After which they created virtual noise — adding other people to make a fake crowd — as a way to teach the artificial intelligence to separate various audio tracks into distinct parts and thus allowing the system to recognize which is which.

The results are astounding. As seen in the video below, the AI is able to separate the voices of two stand-up comedians even if their individual speeches are overlapping, and it does this just by looking at their faces. The trick works even if the comedians' faces are only partially seen, such as when it's slightly blocked by a microphone.

Google's research is detailed in a paper called "Looking to Listen at the Cocktail Party," named after the cocktail party effect in which people are able to focus on one audio source despite the surrounding noise and distractions.

"Our method works on ordinary videos with a single audio track, and all that is required from the user is to select the face of the person in the video they want to hear, or to have such a person be selected algorithmically based on context," the researchers write in a blog post.

Can This Be Useful?

The researchers are still trying to determine how this technology may be implemented into Google's products, but that shouldn't take long to contemplate. The most obvious candidate is video services such as Hangouts or Duo, which can integrate this feature to amplify the voice of a person when they're speaking against overwhelming crowd noise. There are also big implications for accessibility, as Engadget notes: AI-powered voice tracking may lead to camera-assisted hearing aids that can make a voice louder when they're in front of the wearer.

There are privacy implications as well, however. Imagine the technology advancing enough to the point where it's able to pinpoint a specific voice from a bustling street in an urban city such as New York? Combined with security cameras, Google's new tech serves yet another fuel for panic over security. Time, however, will tell.