Computers can now identify objects in images, but can they tell you what those objects are doing?

Mostly, the answer to that is no, although a team of researchers hope that someday computers will pass their new "visual Turing test" that looks at images and understands them more like a human would.

For example, if you show an object recognition algorithm a photo, the computer can tell you if there is more than one person in the photo and even identify certain objects within it. However, that computer can't tell you if those two people are talking to each other or not.

Artificial intelligence improves every day, though, so perhaps someday, a computer will understand the action taking place in a photo, much like a human would. Perhaps a computer could eventually even understand emotion, such as if the two people in the previous example are yelling at each other or laughing at something.

"There have been some impressive advances in computer vision in recent years," says Stuart Geman, the James Manning Professor of Applied Mathematics at Brown University. "We felt that it might be time to raise the bar in terms of how these systems are evaluated and benchmarked."

The system designed by Geman and his team looks for contextual understanding of photos. The test goes something like this: for example, in the photo with two people, the test might ask the computer if there's a person in a certain area of the photo and then if there's another person in that area, too. Then if the computer answers "yes," to both, it asks, "Are these two people talking?"

A computer that can correctly answer that question, followed by more in-depth questions, passes the test.

Basically, the test figures out if a computer has a contextual understanding of what it's seeing in a scene and figuring out the scene's storyline. The questions themselves are also generated by computers, so it makes the test more objective than having a human ask a computer about an image. The only role humans play is deciding when a question is unanswerable. For example, if one person in the photograph has their body hidden, a computer could not answer a question about if they're carrying something.

Of course, there aren't many computers that have this sort of contextualized learning just yet, but putting a test system in place should spur those creating AI algorithms to develop machine learning techniques that allow computers to better understand what they see in photographs.

"As researchers, we tend to 'teach to the test,'" says Geman. "If there are certain contests that everybody's entering and those are the measures of success, then that's what we focus on. So it might be wise to change the test, to put it just out of reach of current vision systems."

Photo: Tech Cocktail | Flickr

ⓒ 2021 All rights reserved. Do not reproduce without permission.