Falsehoods more likely with large language models

The Transform Technology Summits start October 13th with Low-Code/No Code: Enabling Enterprise Agility. Register now!

There’s growing interest in using AI language models to generate text for business applications. Large companies are deploying their own systems while others are leveraging models like OpenAI’s GPT-3 via APIs. According to OpenAI, GPT-3 is now being used in over 300 apps by thousands of developers, producing an average of more than 4.5 billion novel words per day.

But while recent language models are impressively fluent, they have a tendency to write falsehoods ranging from factual inaccuracies to potentially harmful disinformation. To quantify the risks associated with “deceptive” models, researchers at the University of Oxford and OpenAI created a dataset called TruthfulQA containing questions that some humans might answer incorrectly due to false beliefs or misconceptions. They found that while the best-performing model was truthful on 58% of questions, it fell short of human performance at 94%.

TruthfulQA

In the subfield of AI known as natural language processing (NLP), robustness testing can be the exception rather than the norm. One report found that 60% to 70% of answers given by NLP models were embedded somewhere in the benchmark training sets, indicating that the models were usually simply memorizing answers. Another study found that metrics used to benchmark AI and machine learning models tended to be inconsistent, irregularly tracked, and not particularly informative.

TruthfulQA aims to avoid these benchmarking pitfalls with a bank of questions about health, law, finance, and politics that requires models to avoid generating false answers learned from text. The dataset spans 817 questions in 38 different categories, all of which were written by the researchers in way that some humans and models might answer falsely.

The researchers tested several different models on TruthfulQA, including GPT-3; GPT-3’s predecessor, GPT-2; open source versions of GPT-3 called GPT-Neo and GPT-J; and UnifiedQA, a model fine-tuned on question-answer tasks. To classify answers from the models as either true or false, they developed “GPT-judge,” an algorithm trained on answers to questions from TruthfulQA from all of the evaluated models.

Above: Examples of falsehoods generated by models tested on the dataset.

Interestingly, the results show that larger models generally perform worse than smaller models in the same family. The size of a model is measured by its number of parameters, the variables internal to the model that the model learns from historical training data. For example, the largest GPT-Neo and GPT-J models were 17% less truthful (as measured by TruthfulQA) than a model 60 times as small. Meanwhile, UnifiedQA did better on truthfulness than the three GPT families, with the largest model performing only slightly worse than the smallest.

When forced to choose from multiple answers rather than generate them, larger models also did worse on TruthfulQA than smaller ones. No models significantly outperformed random guessing. And even the “best” model gave false answers 42% of the time versus 6% for human participants. (Eighty-seven percent of the humans’ answers were true on TruthfulQA.)

The researchers speculate that the models haven’t learned the training distribution well enough or that the models’ training objectives actually incentivizes a false answer. “We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web,” the researchers wrote in a preprint paper, “TruthfulQA: Measuring How Models Mimic Human Falsehood.” They added: “[Our preliminary work] find[s] that today’s large models are much less truthful than humans.”

Large language models

The work adds to growing skepticism that the size of language models — and their training datasets — correspond to performance. Earlier this month, a team of Google researchers published a study claiming that a model much smaller than GPT-3, fine-tuned language net (FLAN), bests GPT-3 by a large margin on a number of challenging benchmarks. And scientists at the Institute for Artificial Intelligence at the Medical University of Vienna, Austria found that GPT-3 underperforms in domains like biomedicine compared with smaller, less architecturally complex but carefully fine-tuned model

Maria Antoniak, a natural language processing researcher and data scientist at Cornell University, says when it comes to natural language, it’s an open question whether larger models are the right approach. While some of the best benchmark performance scores today come from large datasets and models, the payoff from dumping enormous amounts of data into models is uncertain.

“The current structure of the field is task-focused, where the community gathers together to try to solve specific problems on specific datasets,” Antoniak told VentureBeat in a previous interview. “These tasks are usually very structured and can have their own weaknesses, so while they help our field move forward in some ways, they can also constrain us. Large models perform well on these tasks, but whether these tasks can ultimately lead us to any true language understanding is up for debate.”

VentureBeat

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

up-to-date information on the subjects of interest to you
our newsletters
gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
networking features, and more

Become a member