Quiet-STaR teaches language models to think before they speak

March 22, 2024
  • Researchers from Stanford University were able to train an LM to think before generating outputs
  • Quiet-STaR helps the model generate and evaluate rationales to improve next token prediction
  • The technique delivers improvements in perplexity, as well as in zero-shot math and reasoning benchmarks

Researchers from Stanford University and Notbad AI developed Quiet-STaR, a technique that trains a language model (LM) to reason internally before generating an output.

When humans speak, we normally have an inner dialogue that shapes the words we eventually verbalize. The more we think before speaking, the better the quality of our spoken words.

In their paper, the researchers describe how they trained an LM (Mistral-7B) to learn how to imitate this process in a generalized way. Quiet-STaR is a progression of another technique called STaR, or Self-Taught Reasoner.

STaR is a method of training a model with a few examples of questions with explanations (rationales) for the answers. The model uses these chain-of-thought examples to try answering questions on its own, figuring out the rationales itself.

STaR evaluates whether or not the rationales it comes up with result in correct answers and refines its rationales.

As impressive as STaR is, its ability to reason is limited to the question-answering (QA) contexts during training. The goal of Quiet-STaR is to provide an LM with a generalized ability to learn how to reason or develop rationales, across a broader range of texts, not just QA datasets.

How does Quiet-STaR work?

One of the key innovations in Quiet-STaR is that it generates rationales, or thoughts, in parallel, following all tokens in the text it is processing. It doesn’t output these chain-of-thought reasonings, hence the “Quiet” part of the algorithm’s name.

The algorithm processes the rationales through a “mixing head”. Each rationale is evaluated based on the accuracy of the next-token prediction it produced compared to the prediction made by the base model.

If the base model (without Quiet-STaR) delivers a better prediction, then the rationale wasn’t a good one. If the rationale results in a more accurate next-token prediction, then the algorithm knows it’s on to a good thing.

It then uses a reinforcement learning algorithm (REINFORCE) to learn which rationales help and which ones hinder the model’s performance. The result is that the model learns a generalized ability to think before predicting the next token.

Quiet-STaR results

The researchers tested the Quiet-STaR trained Mistral-7B model on the GSM8K math and CommonsenseQA common sense reasoning benchmarks. They found that Quiet-STaR improved perplexity and zero-shot direct reasoning abilities on both CommonsenseQA (36.3% to 47.2%) and GSM8K (5.9% to 10.9%) benchmarks.

Quiet-STaR results on GMSK8 grade school math, and CommonsenseQA common sense reasoning benchmarks. Each line represents an iteration of Quiet-STaR with varying thought token lengths, and how many tokens ahead it reasoned. The baseline is Mistral-7B without Quiet-STaR. Source: arXiv

While Mistral-7B’s math reasoning still isn’t great, Quiet-STaR delivered an improvement of almost 85% over the base model, and this was without any dataset-specific fine-tuning.”

Test results also showed that improvements in performance were directly related to how many tokens were allocated to the model’s internal thoughts. The more it thought before answering, the better the answer.

These improvements come at the cost of a substantial computing overhead. The inner monologue the model engages in during the thought process generates a lot of tokens.

Improvements in hardware will eventually make the additional overhead that comes with techniques like these less consequential.

The researchers conclude that future work on optimizing Quiet-STaR could help too. Dynamically predicting if a thought process is required, or how long it should be, could cut down on unnecessary thought tokens.

The results from training a small model like Mistral-7B with Quiet-STaR are promising. The researchers believe that “the same techniques applied to a better model would likely yield disproportionately better results.”

Ethical questions

Making a language model reason more like a human comes with some interesting issues and ethical questions.

The researchers note that “it is impossible to know that the reasoning expressed by the model in language accurately represents the internal processing of the model.” The rationales the model generates are natural language representations of its inner reasoning. Are they an accurate reflection?

They further note that “there are no safeguards against harmful or biased reasoning patterns if the model finds them useful.”

We may be happy with an AI model’s answer, but we might not like, or even understand, the thinking process that delivered it.

One of the paper’s lead authors, Eric Zelikman, just joined Elon Musk’s xAI this week. He may find that Grok is less concerned with these ethical questions and more excited by the prospect of AI advancement.


Join The Future


Clear, concise, comprehensive. Get a grip on AI developments with DailyAI

Eugene van der Watt

Eugene comes from an electronic engineering background and loves all things tech. When he takes a break from consuming AI news you'll find him at the snooker table.


Stay Ahead with DailyAI


Sign up for our weekly newsletter and receive exclusive access to DailyAI's Latest eBook: 'Mastering AI Tools: Your 2024 Guide to Enhanced Productivity'.


*By subscribing to our newsletter you accept our Privacy Policy and our Terms and Conditions