Graham Neubig
A model trained (usually using RL) to leverage long chains of thought to perform better on tasks
Early work on training LLMs to reason
RL-style policy gradient objective:
Key insight: Filter discards gradients for incorrect reasoning chains
Setup: GPT-J (6B) on arithmetic, CommonsenseQA, GSM8K
Input:
6 2 4 + 2 5 9
Target:
<scratch>
6 2 4 + 2 5 9 , C: 0
2 + 5 , 3 C: 1
6 + 2 , 8 3 C: 0
, 8 8 3 C: 0
0 8 8 3
</scratch>
8 8 3
Training: Iterative filtering and fine-tuning
Without Rationalization
With Rationalization
Generate outputs in a group:
Use group statistics to compute advantage:
Calculate loss:
Training template:
Template |
---|
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively. User: prompt. Assistant: |
Design principles: Minimal structural constraints, no content bias
AIME 2024 performance: 15.6% → 71.0% (pass@1), 86.7% (majority voting)
Thinking time naturally increases from hundreds to thousands of tokens
Core question: Why do some models improve with RL while others plateau?
Qwen naturally exhibits these; Llama initially lacks them
Priming enables improvement: Llama matches Qwen when primed with reasoning behaviors
Cosine reward to stabilize length
Results
Key insight: Emergence requires careful alignment of multiple factors
Do improved math reasoning abilities transfer to general LLM capabilities?
Key question: Do gains in solving math problems transfer to:
Motivation: Math has become the poster child of LLM progress, but real-world tasks extend far beyond math
Controlled experiment on Qwen3-14B:
Evaluation: 20+ open-weight reasoning models across diverse tasks
RL models consistently outperform SFT models in transferability across all model sizes and architectures
Analysis: SFT induces significant drift in token distributions, while RL preserves general-domain structure
Performance highlights:
Core idea: Control reasoning length with simple prompts like “Think for N tokens”
Key insight: Explicit length control improves reasoning quality and efficiency
Stream of Search: Represent search process as flattened language strings
spawn(msgs)
: Create parallel child threadsjoin(msg)
: Return results to parent threadKey themes in reasoning models:
Future directions: Better understanding, more efficient methods, broader applications