This new dataset shows that AI still lacks commonsense reasoning

This new dataset shows that AI still lacks commonsense reasoning

Join today's leading executives online at the Data Summit on March 9th. Register here.



Abductive reasoning, frequently misidentified as deductive reasoning, is the process of making a plausible prediction when faced with incomplete information. For example, given a photo showing a toppled truck and a police cruiser on a snowy freeway, abductive reasoning may lead someone to infer that dangerous road conditions caused an accident.

Humans can quickly consider this sort of context to arrive at a hypothesis. But AI struggles, despite recent technical advances. Motivated to explore the challenge, researchers at the Allen Institute for Artificial Intelligence, the University of California, Berkeley, and the MIT-IBM Watson AI lab created a dataset called Sherlock, a collection of over 100,000 images of scenes paired with clues a viewer could use to answer questions about the scenes. As project contributor Jack Hessel, an Allen Institute research scientist, explains, Sherlock is designed to test the ability of AI systems to abductively reason by having the systems observe text and visual clues.

“With Sherlock, we aimed to study visual abductive reasoning: i.e., probable and salient conclusions that go beyond what’s literally depicted in an image. We named the dataset after Sherlock Holmes, who iconically embodies abduction,” Hessel told VentureBeat via email. “This type of commonsense reasoning is an important part of human cognition; Jerry Hobbs, distinguished computational linguist and ACL 2013 Lifetime Achievement Award winner, perhaps put it best in his acceptance speech: ‘The brain is an abduction machine, continuously trying to prove abductively that the observables in its environment constitute a coherent situation.'”

An example of a scene in the Sherlock abductive reasoning dataset.

An AI system with commonsense reasoning has long been a goal of the AI research community, particularly for those who subscribe to belief that artificial general intelligence — AI that can perform any task as well as a human — is possible. “Commonsense” usually refers to implicit background knowledge, like how dropping a glass in a restaurant could cause it to loudly — and embarrassingly — shatter. While it’s easy for humans to understand, this background knowledge requires a sophisticated physical, social, and causal understanding of the world.

Even the best AI systems today fall short. For example, asking a language model like OpenAI’s GPT-3 — which can write poetry and short stories — to finish the sentence “If I drop a glass in while eating in a restaurant” can lead to the implausible result “everyone around me will panic, and the restaurant will likely be shut down for the night.” The Allen Institute’s recently open-sourced Macaw, meanwhile, a model designed to answer and generate questions, absurdly answers the question “What happens if I drop a glass on a bed of feathers?” with “The glass shatters.”

“Sherlock … attempts to put a number to the gap between human understanding and machine performance,” Hessel said. “Our best [systems] can perform some abductive reasoning, although there is still a significant gap between human and machine performance.”

Over the course of several experiments, Hessel and colleagues tested AI systems’ abilities to abductively reason from the Sherlock examples. They had the systems look at specific regions within images in the dataset — for example, a coffee cup sitting on a table in a café — and asked questions like “What do you think of this coffee cup?”

According to Hessel, the best systems could make abductions like a pair of running shoes might mean that the owner is a runner and that a cup of coffee neatly served on a plate might imply that it’s in a restaurant. But the systems fell short of human-level reasoning. In one instance, they couldn’t tell that a picture of a boat with a sign that read “buy now! plants and flowers” was likely a florist shop — perhaps because they’d never seen a florist on a boat before.

“AI models work much better than they did five years ago for commonsense reasoning and beyond; this is exciting progress! The current mainstream recipe for doing well on benchmarks (training enormous, billion-parameter models on giant datasets), to me, represents the furthest practical logical extension of learning from observation. But: 1) this is clearly not how humans learn about the world; and 2) training in this way raises new practical and ethical concerns,” Hessel said.

One proposed alternative is “neurosymbolic” reasoning, which combines reasoning with the ability to learn from experience. In a recent study, researchers from MIT, IBM, and Alphabet’s DeepMind developed a system programmed to understand concepts like “objects” and “spatial relationship” in text. One part of the system observed many different scenes containing objects while another learned to answer questions in English. The fully trained system could answer questions about different scenes by recognizing visual concepts in those questions. and — as an added benefit — the system required far less data than conventional models.

Hessel hopes that Sherlock will spur progress toward systems — neurosymbolic or no — with stronger commonsense reasoning. He concedes that the dataset contains best-guess, probable conclusions about scenes drawn by humans who have their own biases and perspectives (abduction, by its nature, carries with it uncertainty). But he believes that Sherlock could be an important step toward systems that communicate “fluently and meaningfully” with humans.

“While prior work has shown humans readily correct mistaken assumptions in the face of new information (for example, if we also notice in the ’20mph sign’ photo from before a ‘wildlife crossing’ warning, we might change our mind about the residential area), our models currently don’t have the same capacity, though,” Hessel said. “[T]his would make for promising future work!”


VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn More