With Quiet-STaR, language models learn to think before speaking

With Quiet-STaR, language models learn to think before speaking

Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.


Humans are gifted with the ability to reason: “if” and “why” and the ability to “read between the lines” and infer unstated information are all critical to our problem-solving capabilities. 

Up until now, AI models have, naturally, struggled in this area. But researchers from Stanford University and Notbad AI, Inc., have now revealed that they have taught AI models to think before they respond to prompts — just as (most) people consider what to say before speaking. 

The researchers have introduced Quiet-STaR — an extension of the Self-Taught Reasoner (STaR) model — which is trained on a wide corpus of internet data and learns to generate rationales at each token to explain future text and improve predictions.

Quiet-STaR was applied to Mistral 7B, showing improvements to zero-shot direct reasoning abilities on the CommonsenseQA question-answering challenge (from 36.3% base to 47.2%) and the GSM8K grade school math word problems dataset (from 5.9% base to 10.9%). And, these improvements consistently increased with the number of tokens used in the model’s “internal thoughts.”

VB Event

The AI Impact Tour – Atlanta










Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.










Request an invite


“Quiet-STaR marks a step towards LMs that can learn to reason in a more general and scalable way,” the researchers write. 

Where AI reasoning has so far come up short

Previous methods that have helped language models learn from their reasoning have been more hyper-focused and less generalized: AIs have been trained to solve individual tasks or predefined sets of tasks that rely on carefully curated datasets. 

For instance, a pre-trained language model fine-tuned to output on human reasoning traces before answering multiple-choice questions outperformed an AI trained directly on answers, the Quiet-STaR developers pointed out. Other models, when provided with “scaffolding,” can generate chain-of-thought solutions without additional supervision. Further, researchers have “forced” models to use chain-of-thought reasoning by preventing them from answering unless completely confident. 

“However, once again, these approaches only work for a question-answer dataset,” the Stanford University and Notbad AI, Inc., researchers contend. 

STaR, particularly, proved that models could “bootstrap” their reasoning abilities on question-answering datasets. They could sample rationales to attempt to answer questions, train on those rationales if they led to correct answers and repeat iteratively to solve more and more difficult problems. 

However, the Quiet-STaR researchers point out, that training from curated datasets limits the “scale and generalizability” of rationales. High-quality datasets will “inherently only ever cover a subset of reasoning tasks.”

Inferring rationales from few-shot examples in question-answering is a “highly-constrained setting,” the researchers assert. “Ideally, a language model could instead learn to infer unstated rationales in arbitrary text.”

By extending STaR, “we allow the LM to learn from the diverse tasks present in the language. To our knowledge, this is the first work explicitly training LMs to reason generally from text, rather than on curated reasoning tasks or collections of reasoning tasks.”

‘Quietly’ thinking

The Stanford University and Notbad AI, Inc. researchers refer to their technique as Quiet-STaR because it applies STaR “quietly.” 

The method generates many inner thoughts in parallel, at every token, to explain future text before responding to a prompt (i.e., the process of “thinking”). When the AI finally answers, it produces a mixture of predictions with and without rationales. 

The REINFORCE algorithm was then applied; in reinforcement learning, this collects samples in an episode to update policy parameters as well as start-of-thought and end-of-thought embeddings. Researchers explain that this helps increase the likelihood that the AI will accurately predict future text. As part of this, the model also discards incorrect predictions. 

“By iteratively optimizing these parameters, Quiet-STaR trains the model to generate more useful rationales throughout training,” the researchers write. 

Because their goal was generalist reasoning, they used a zero-shot prompt (“Let’s think step by step”) without in-context examples. Quiet-STaR was applied to Mistral 7B using the web text datasets OpenWebMath and Colossal Clean Crawled Corpus. 

“Quiet-STaR… allows a model to think quietly at every token, with a distribution trained to be useful,” researchers write. 

They add that, “by training on the rich spectrum of reasoning tasks implicit in diverse web text, rather than narrowly specializing for particular datasets, Quiet-STaR points the way to more robust and adaptable language models.”

Closing the gap between model and human reasoning capabilities

Notably, researchers created a parallel sampling algorithm that generates rationales from all tokens in a string. This allowed the tokens to “pay attention to themselves,” all preceding tokens with the same thought and the preceding text. This allows for “continuations of all of the thoughts in parallel,” and each inference call generates an additional token for all tokens. 

Researchers introduced custom meta-tokens at the beginning and the end of each thought. <|startofthought|> and <|endofthought|> were initialized with the em dash, ”—”, which is often used to denote a pause. 

“Intuitively, the start thought tokens can be understood as putting the model into a ‘thinking mode,’” the researchers explain, “and the end thought token can be understood as telling the model when it’s done thinking.”

The next step incorporated what’s known as a “mixing head,” a “shallow” multilayer perceptron. This helped researchers retrospectively determine how much to incorporate the next-token prediction from a given thought into the current next-token prediction.

Finally, researchers optimized parameters to increase the likelihood of more probable future text. Reinforcement techniques provide a “learning signal” to rationales based on their impact on future predictions. To help reduce variance, researchers also introduced a “teacher forcing” trick, which ensures that neural networks stay as close as possible to ground truth sequences. 

Ultimately, “Quiet-STaR represents a step towards language models that can learn to reason in a general and scalable way,” the researchers conclude. “Future work can build on these insights to further close the gap between language model and human-like reasoning capabilities.”