Best-First Search for Language Models
Graham Neubig
Overview: Greedy Search and Beam Search
Greedy Search:
- Select highest probability token at each step
- Fast but myopic
- No backtracking possible
Beam Search:
- Maintain top-k hypotheses
- Explores multiple paths simultaneously
- Better quality than greedy
- Still limited exploration
Search Spaces as Graphs
We can represent a search space as a weighted finite-state automaton \((S, \Sigma, \delta, s_0, F, w)\) where:
- \(S\) is a finite set of states
- \(\Sigma\) is a finite alphabet
- \(\delta \subseteq S \times \Sigma \times S\) is the transition relation
- \(s_0 \in S\) is the initial state
- \(F \subseteq S\) is the set of final states
- \(w: \delta \rightarrow \mathbb{R}\) assigns weights to transitions
How to Express Edge Weights?
1. Probabilities
- Range: [0, 1]
- Path scoring: Multiply weights
- Best path: arg max (highest product)
- Example: P(path) = 0.5 × 0.3 × 0.2 = 0.03
2. Log Probabilities
- Range: (-∞, 0]
- Path scoring: Add weights
- Best path: arg max (highest sum)
- Example: log P(path) = log(0.5) + log(0.3) + log(0.2) = -3.51
3. Negative Log Probabilities
- Range: [0, ∞)
- Path scoring: Add weights
- Best path: argmin (lowest sum)
- Example: -log P(path) = -log(0.5) + -log(0.3) + -log(0.2) = 3.51
Example Graph w/ Negative Log Probabilities
Depth-First Search: Optimal but Inefficient
Depth-First Search (DFS):
- Systematically explores all paths
- Guaranteed to find optimal solution
- Explores paths to completion before backtracking
Optimality: DFS will eventually find the path with minimum cost
Why optimal?
- Exhaustive search of the entire space
- No pruning of potentially optimal paths
- Complete exploration guarantees finding global optimum
A* Search and Admissible Heuristics
A Search Algorithm:*
- f(n) = g(n) + h(n)
- g(n): actual cost from start to n
- h(n): heuristic cost from n to goal
- Optimal when heuristic is admissible
Admissible Heuristic h(n):
- Never overestimates the true cost to goal
- h(n) ≤ h*(n) where h*(n) is true cost to goal
- Provides lower bound on remaining cost
Key Property: A* finds optimal solution efficiently by using heuristic to guide search toward goal
Challenges for Transformer Language Models
Two Major Problems
1. Ever-Expanding Search Graph
- No hypothesis recombination possible
- Each partial sequence has unique hidden state
- Identical token sequences ≠ identical states
- Exponential growth in hypotheses
2. No Admissible Heuristic
- Cannot bound log probability from above
- Future tokens can have arbitrarily low probability
- No way to guarantee h(n) ≤ h*(n)
- A* optimality guarantees lost
Result: Classical optimal search algorithms don’t directly apply to transformer-based language generation
Hypothesis Recombination
Key Insight: Group similar hidden states for approximate recombination
Algorithm: Cluster hypotheses with similar hidden representations, keep best from each cluster
Benefits:
- Reduces exponential growth
- Maintains search quality
- Enables practical application of search algorithms
Hypothesis Recombination: Example
Hypothesis Recombination Criteria
1. n-gram Based Clustering
Criterion: Shared recent word contexts
\[\text{Cluster if: } \mathbf{y}_{t-n+1:t} = \mathbf{y}'_{t-n+1:t}\]
Properties:
- Surface form similarity
- Easy to implement and cache
- Works with beam search decoders
- Truncation length controls precision
2. Hidden Vector Distance
Criterion: Similarity in neural representations
\[\text{Cluster if: } D(\mathbf{h}_t, \mathbf{h}'_t) \leq \gamma\]
Distance Measures:
- Euclidean: \(D = \sqrt{\sum_k (h_{t,k} - h'_{t,k})^2}\)
- Cosine similarity
- KL divergence between distributions
Properties:
- Captures semantic similarity
- Tunable precision via threshold \(\gamma\)
- More computationally expensive
Reference: Liu et al. (2014) “Efficient Lattice Rescoring Using Recurrent Neural Network Language Models”
Future Cost: Inadmissible but Useful Heuristic
Paper: Modeling Future Cost for Neural Machine Translation (Duan et al., 2020)
Key Idea: Learn to predict the cost of completing a partial sequence
Future Cost Prediction:
\[h(\mathbf{y}_{\leq t}, \mathbf{x}) \approx -\log p_\theta(\text{completion}|\mathbf{y}_{\leq t}, \mathbf{x})\]
Search Integration:
\[\text{score}(\mathbf{y}_{\leq t}) = \log p_\theta(\mathbf{y}_{\leq t}|\mathbf{x}) + \lambda \cdot h(\mathbf{y}_{\leq t}, \mathbf{x})\]
Training: Auxiliary loss to predict future completion difficulty
Future Cost: Training Approaches
Two Ways to Train Future Cost Prediction:
1. Auxiliary Loss During Training
- Joint training: Main translation loss + future cost loss
- Target: Predict remaining cost from current state
- Advantage: Integrated learning, shared representations
- Challenge: Requires modification of training pipeline
2. Post-hoc Training
- Separate model: Train future cost predictor after main model
- Data: Use completed sequences to learn cost patterns
- Advantage: Can be applied to existing models
- Challenge: May not capture model-specific patterns as well
Key Insight: Both approaches learn to estimate \(h(\mathbf{y}_{\leq t}, \mathbf{x}) \approx -\log p_\theta(\text{completion}|\mathbf{y}_{\leq t}, \mathbf{x})\)
Future Cost: Properties and Results
Core Properties:
- Not admissible: Can overestimate true completion cost
- Empirically effective: Guides search toward better completions
- Learnable: Adapts to specific task patterns through training
Experimental Results:
- +1.5-2.0 BLEU improvement over baseline models
- Better long sequence coherence
- Reduced exposure bias effects
Best-First Beam Search: Core Insight
Paper: Best-first beam search (Meister et al., 2020)
Problem with Standard Beam Search:
- Must analyze \(k\) hypotheses of given length before considering longer ones
- Length-based prioritization is not necessary for finding \(k\)-optimal hypothesis
- Computational inefficiency from rigid breadth-first expansion
Key Insight: Use score-based prioritization like A* while maintaining beam constraints
Main Contribution: Up to 10x speedup over standard beam search with identical results
Best-First Beam Search: Algorithm
Key Insight: Use score-based prioritization like A* while maintaining beam constraints
Monotonicity Property: Scores can only decrease when extended
\[\text{score}(\mathbf{x}, \mathbf{y}_{\leq t}) \geq \text{score}(\mathbf{x}, \mathbf{y}_{\leq t} \circ y_{t+1})\]
Two Key Optimizations:
- Early Pruning: Remove hypotheses guaranteed to fall off beam
- Early Termination: Stop when first complete hypothesis found
Search Methods Comparison
Best-First Beam Search: Results
Performance Improvements:
- 2-10x speedup on neural machine translation
- Identical BLEU scores to standard beam search
- Greater speedups for larger beam sizes
When It Helps Most:
- Large beam sizes (\(k \geq 10\))
- Long sequences
- High-quality models with good score separation
Trade-offs:
- Memory overhead: \(O(k \cdot n_{\max})\) vs \(O(k)\)
- Implementation complexity: More sophisticated data structures
A*esque Decoding Formulation
Paper: NeuroLogic Aesque Decoding: Constrained Text Generation with Lookahead Heuristics* (Lu et al., 2021)
- States: Partial prefixes \(\mathbf{y}_{\leq t}\)
- Actions: Tokens \(y_{t+1} \in \mathcal{V}\) (vocabulary)
- Transitions: \(\mathbf{y}_{\leq t} \circ y_{t+1}\) (append token to prefix)
Objective: Find optimal sequence maximizing
\[\mathbf{y}_* = \arg \max_{\mathbf{y} \in \mathcal{Y}} f(\mathbf{y})\]
where \(f(\mathbf{y}) = s(\mathbf{y}) + h(\mathbf{y})\)
- \(s(\mathbf{y}) = \log p_\theta(\mathbf{y})\) (log probability)
- \(h(\mathbf{y})\) = constraint satisfaction score
A*-Inspired Lookahead Framework
Core Insight: Incorporate future estimates into candidate selection
Connection to Future Cost: Both use \(\text{score} = \text{current} + \lambda \cdot \text{heuristic}\)
Future Cost Approach
- Heuristic: Learned predictor \(h(\mathbf{y}, \mathbf{x})\)
- Training: Auxiliary loss to learn completion cost
- Computation: Single forward pass
- Generalization: Requires retraining for new domains
NeuroLogic A*esque Approach
- Heuristic: Generated continuations \(h(\mathbf{y}_{\leq t+\ell})\)
- Training: No additional training needed
- Computation: Multiple forward passes for lookahead
- Generalization: Works with any pre-trained model
A*esque Formulation:
\[Y_t = \arg \max_{k} \left\{s(\mathbf{y}_{\leq t}) + h(\mathbf{y}_{\leq t+\ell})\right\}\]
where
\(h(\mathbf{y}_{\leq t+\ell}) = \lambda \log p_\theta(\mathbf{y}_{t+1:t+\ell}|\mathbf{y}_{\leq t}, \mathbf{x})\)
NeuroLogic A*esque: Constrained Generation
Key Innovation: Future constraint satisfaction estimation
Constraint Format: Include/exclude specific phrases
- Example: Include “cat” and “fish” in generated text
Enhanced Scoring:
\[f(\mathbf{y}_{\leq t+\ell}) = \text{probability} + \text{current constraints} + \text{future constraints}\]
Future Constraint Heuristic: Estimate probability of satisfying constraints in lookahead continuations
Benefits: Early detection of constraint violations, better planning toward satisfiable paths
NeuroLogic A*esque: Algorithm & Results
Algorithm Steps:
- Expand: Generate candidate continuations
- Lookahead: For each candidate, generate length-\(\ell\) continuations
- Score: \(f(\mathbf{y}_{\leq t+\ell}) = s(\mathbf{y}_{\leq t}) + h(\mathbf{y}_{t+1:t+\ell})\)
- Select: Top-k candidates based on combined score
- Prune: Remove constraint-violating candidates
Key Results:
- +1.2-2.1 BLEU improvement over beam search
- 95%+ constraint satisfaction vs. 60-70% for baselines
- Better coherence in long-form generation
- Reduced myopia: Considers future implications vs. greedy decisions
Summary
Key Takeaways:
- Classical search algorithms face challenges with transformer LMs
- Approximate methods enable practical optimal search
- Heuristic design is crucial for search quality
- Best-first approaches often outperform standard beam search
- Constraint satisfaction can be integrated into search
Next Class: Other controlled generation methods and their search implications