Machine Learns To Predict Human Behavior By Watching TV, YouTube

It is second nature for us humans to anticipate the actions of other people. When we meet a friend, do we hug, bump fists or shake hands? Whichever it is, our intuition tells how to respond.

For an artificial intelligence, however, it is much more complicated. These AI systems will find it difficult to make use of such complex knowledge, unless they are programmed.

So how could AI systems learn human behavior then? By binge-watching TV shows and YouTube videos, as proven by a newly developed algorithm.

A True Couch Potato

Scientists from the Massachusetts Institute of Technology (MIT) developed an algorithm that can anticipate interactions more precisely and accurately than ever.

The machine is schooled on YouTube videos as well as 600 hours of clips from episodes of TV shows such as Big Bang Theory, The Office and Desperate Housewives.

Because of this, the algorithm can predict whether two people will shake hands, slap five, kiss or hug. In a second situation, the algorithm can anticipate what could appear in a video after five seconds. It searches for patterns and recognizable objects such as human faces, hands and many others.

The algorithm was fed with background material and was then programmed to watch new clips. Researchers froze the clip just when something is about to happen and then asked the algorithm to predict what happened next.

About 43 percent of the time, the computer was able to correctly identify the next action.

Although the prediction rate is lower than for humans, who accurately identified the action 71 percent of the time, scientists say the result is still pretty good for a computer. It is better than 36 percent rate found in other experiments.

Why The Research Is Important

It may seem like human greetings are too mundane or arbitrary to predict, but researchers say the task serves as a more easily controllable test case for them to investigate.

Carl Vondrick, a PhD student at MIT, says they wanted to show that just by binge-watching large amounts of video, computers can collect and absorb enough knowledge to accurately make predictions about their environment.

"Humans automatically learn to anticipate actions through experience," says Vondrick, "which is what made us interested in trying to imbue computers with the same sort of common sense."

Although it will be a long time before the algorithm is put into a practical use, researchers say future and more sophisticated versions could be applied in different fields: from robots that create better action plans to security cameras that can alert responders whenever a person has gotten injured.

It could also be used to improve the navigational abilities of robots or in Google Glass-style headsets that can offer suggestions on what a person could do next.

Details of the MIT research, which was conducted at the Computer Science and Artificial Intelligence Laboratory (CSAIL), will be presented [PDF] at the International Conference on Computer Vision and Pattern Recognition.

The work, which was supported by the National Science Foundation, was co-authored by MIT Professor Antonio Torralba and University of Maryland Professor Hamed Pirsiavash.

Watch the video below to see how the algorithm works.