Best-First Search for Language Models

Graham Neubig

Carnegie Mellon University Language Technologies Institute

Overview: Greedy Search and Beam Search

Greedy Search:

  • Select highest probability token at each step
  • Fast but myopic
  • No backtracking possible

Beam Search:

  • Maintain top-k hypotheses
  • Explores multiple paths simultaneously
  • Better quality than greedy
  • Still limited exploration

Search Spaces as Graphs

We can represent a search space as a weighted finite-state automaton \((S, \Sigma, \delta, s_0, F, w)\) where:

  • \(S\) is a finite set of states
  • \(\Sigma\) is a finite alphabet
  • \(\delta \subseteq S \times \Sigma \times S\) is the transition relation
  • \(s_0 \in S\) is the initial state
  • \(F \subseteq S\) is the set of final states
  • \(w: \delta \rightarrow \mathbb{R}\) assigns weights to transitions

An Example Graph

How to Express Edge Weights?

1. Probabilities

  • Range: [0, 1]
  • Path scoring: Multiply weights
  • Best path: arg max (highest product)
  • Example: P(path) = 0.5 × 0.3 × 0.2 = 0.03

2. Log Probabilities

  • Range: (-∞, 0]
  • Path scoring: Add weights
  • Best path: arg max (highest sum)
  • Example: log P(path) = log(0.5) + log(0.3) + log(0.2) = -3.51

3. Negative Log Probabilities

  • Range: [0, ∞)
  • Path scoring: Add weights
  • Best path: argmin (lowest sum)
  • Example: -log P(path) = -log(0.5) + -log(0.3) + -log(0.2) = 3.51

Example Graph w/ Negative Log Probabilities

Greedy Search Example

Beam Search Example

Depth-First Search: Optimal but Inefficient

Depth-First Search (DFS):

  • Systematically explores all paths
  • Guaranteed to find optimal solution
  • Explores paths to completion before backtracking

Optimality: DFS will eventually find the path with minimum cost

Why optimal?

  • Exhaustive search of the entire space
  • No pruning of potentially optimal paths
  • Complete exploration guarantees finding global optimum

DFS Search Example

A* Search and Admissible Heuristics

A Search Algorithm:*

  • f(n) = g(n) + h(n)
  • g(n): actual cost from start to n
  • h(n): heuristic cost from n to goal
  • Optimal when heuristic is admissible

Admissible Heuristic h(n):

  • Never overestimates the true cost to goal
  • h(n) ≤ h*(n) where h*(n) is true cost to goal
  • Provides lower bound on remaining cost

Key Property: A* finds optimal solution efficiently by using heuristic to guide search toward goal

A* Search Example

Challenges for Transformer Language Models

Two Major Problems

1. Ever-Expanding Search Graph

  • No hypothesis recombination possible
  • Each partial sequence has unique hidden state
  • Identical token sequences ≠ identical states
  • Exponential growth in hypotheses

2. No Admissible Heuristic

  • Cannot bound log probability from above
  • Future tokens can have arbitrarily low probability
  • No way to guarantee h(n) ≤ h*(n)
  • A* optimality guarantees lost

Result: Classical optimal search algorithms don’t directly apply to transformer-based language generation

Hypothesis Recombination

Key Insight: Group similar hidden states for approximate recombination

Algorithm: Cluster hypotheses with similar hidden representations, keep best from each cluster

Benefits:

  • Reduces exponential growth
  • Maintains search quality
  • Enables practical application of search algorithms

Hypothesis Recombination: Example

Hypothesis Recombination Criteria

1. n-gram Based Clustering

Criterion: Shared recent word contexts

\[\text{Cluster if: } \mathbf{y}_{t-n+1:t} = \mathbf{y}'_{t-n+1:t}\]

Properties:

  • Surface form similarity
  • Easy to implement and cache
  • Works with beam search decoders
  • Truncation length controls precision

2. Hidden Vector Distance

Criterion: Similarity in neural representations

\[\text{Cluster if: } D(\mathbf{h}_t, \mathbf{h}'_t) \leq \gamma\]

Distance Measures:

  • Euclidean: \(D = \sqrt{\sum_k (h_{t,k} - h'_{t,k})^2}\)
  • Cosine similarity
  • KL divergence between distributions

Properties:

  • Captures semantic similarity
  • Tunable precision via threshold \(\gamma\)
  • More computationally expensive

Reference: Liu et al. (2014) “Efficient Lattice Rescoring Using Recurrent Neural Network Language Models”

Future Cost: Inadmissible but Useful Heuristic

Paper: Modeling Future Cost for Neural Machine Translation (Duan et al., 2020)

Key Idea: Learn to predict the cost of completing a partial sequence

Future Cost Prediction:

\[h(\mathbf{y}_{\leq t}, \mathbf{x}) \approx -\log p_\theta(\text{completion}|\mathbf{y}_{\leq t}, \mathbf{x})\]

Search Integration:

\[\text{score}(\mathbf{y}_{\leq t}) = \log p_\theta(\mathbf{y}_{\leq t}|\mathbf{x}) + \lambda \cdot h(\mathbf{y}_{\leq t}, \mathbf{x})\]

Training: Auxiliary loss to predict future completion difficulty

Future Cost: Training Approaches

Two Ways to Train Future Cost Prediction:

1. Auxiliary Loss During Training

  • Joint training: Main translation loss + future cost loss
  • Target: Predict remaining cost from current state
  • Advantage: Integrated learning, shared representations
  • Challenge: Requires modification of training pipeline

2. Post-hoc Training

  • Separate model: Train future cost predictor after main model
  • Data: Use completed sequences to learn cost patterns
  • Advantage: Can be applied to existing models
  • Challenge: May not capture model-specific patterns as well

Key Insight: Both approaches learn to estimate \(h(\mathbf{y}_{\leq t}, \mathbf{x}) \approx -\log p_\theta(\text{completion}|\mathbf{y}_{\leq t}, \mathbf{x})\)

Future Cost: Properties and Results

Core Properties:

  • Not admissible: Can overestimate true completion cost
  • Empirically effective: Guides search toward better completions
  • Learnable: Adapts to specific task patterns through training

Experimental Results:

  • +1.5-2.0 BLEU improvement over baseline models
  • Better long sequence coherence
  • Reduced exposure bias effects

Best-First Beam Search: Core Insight

Paper: Best-first beam search (Meister et al., 2020)

Problem with Standard Beam Search:

  • Must analyze \(k\) hypotheses of given length before considering longer ones
  • Length-based prioritization is not necessary for finding \(k\)-optimal hypothesis
  • Computational inefficiency from rigid breadth-first expansion

Key Insight: Use score-based prioritization like A* while maintaining beam constraints

Main Contribution: Up to 10x speedup over standard beam search with identical results

Best-First Beam Search: Algorithm

Key Insight: Use score-based prioritization like A* while maintaining beam constraints

Monotonicity Property: Scores can only decrease when extended

\[\text{score}(\mathbf{x}, \mathbf{y}_{\leq t}) \geq \text{score}(\mathbf{x}, \mathbf{y}_{\leq t} \circ y_{t+1})\]

Two Key Optimizations:

  1. Early Pruning: Remove hypotheses guaranteed to fall off beam
  2. Early Termination: Stop when first complete hypothesis found
Best-First Beam Search Algorithm

Search Methods Comparison

Comparison of Search Methods

Best-First Beam Search: Results

Performance Improvements:

  • 2-10x speedup on neural machine translation
  • Identical BLEU scores to standard beam search
  • Greater speedups for larger beam sizes

When It Helps Most:

  • Large beam sizes (\(k \geq 10\))
  • Long sequences
  • High-quality models with good score separation

Trade-offs:

  • Memory overhead: \(O(k \cdot n_{\max})\) vs \(O(k)\)
  • Implementation complexity: More sophisticated data structures

A*esque Decoding Formulation

Paper: NeuroLogic Aesque Decoding: Constrained Text Generation with Lookahead Heuristics* (Lu et al., 2021)

  • States: Partial prefixes \(\mathbf{y}_{\leq t}\)
  • Actions: Tokens \(y_{t+1} \in \mathcal{V}\) (vocabulary)
  • Transitions: \(\mathbf{y}_{\leq t} \circ y_{t+1}\) (append token to prefix)

Objective: Find optimal sequence maximizing

\[\mathbf{y}_* = \arg \max_{\mathbf{y} \in \mathcal{Y}} f(\mathbf{y})\]

where \(f(\mathbf{y}) = s(\mathbf{y}) + h(\mathbf{y})\)

  • \(s(\mathbf{y}) = \log p_\theta(\mathbf{y})\) (log probability)
  • \(h(\mathbf{y})\) = constraint satisfaction score

A*-Inspired Lookahead Framework

Core Insight: Incorporate future estimates into candidate selection

Connection to Future Cost: Both use \(\text{score} = \text{current} + \lambda \cdot \text{heuristic}\)

Future Cost Approach

  • Heuristic: Learned predictor \(h(\mathbf{y}, \mathbf{x})\)
  • Training: Auxiliary loss to learn completion cost
  • Computation: Single forward pass
  • Generalization: Requires retraining for new domains

NeuroLogic A*esque Approach

  • Heuristic: Generated continuations \(h(\mathbf{y}_{\leq t+\ell})\)
  • Training: No additional training needed
  • Computation: Multiple forward passes for lookahead
  • Generalization: Works with any pre-trained model

A*esque Formulation:

\[Y_t = \arg \max_{k} \left\{s(\mathbf{y}_{\leq t}) + h(\mathbf{y}_{\leq t+\ell})\right\}\]
where \(h(\mathbf{y}_{\leq t+\ell}) = \lambda \log p_\theta(\mathbf{y}_{t+1:t+\ell}|\mathbf{y}_{\leq t}, \mathbf{x})\)

NeuroLogic A*esque: Constrained Generation

Key Innovation: Future constraint satisfaction estimation

Constraint Format: Include/exclude specific phrases

  • Example: Include “cat” and “fish” in generated text

Enhanced Scoring:

\[f(\mathbf{y}_{\leq t+\ell}) = \text{probability} + \text{current constraints} + \text{future constraints}\]

Future Constraint Heuristic: Estimate probability of satisfying constraints in lookahead continuations

Benefits: Early detection of constraint violations, better planning toward satisfiable paths

NeuroLogic A*esque: Algorithm & Results

Algorithm Steps:

  1. Expand: Generate candidate continuations
  2. Lookahead: For each candidate, generate length-\(\ell\) continuations
  3. Score: \(f(\mathbf{y}_{\leq t+\ell}) = s(\mathbf{y}_{\leq t}) + h(\mathbf{y}_{t+1:t+\ell})\)
  4. Select: Top-k candidates based on combined score
  5. Prune: Remove constraint-violating candidates

Key Results:

  • +1.2-2.1 BLEU improvement over beam search
  • 95%+ constraint satisfaction vs. 60-70% for baselines
  • Better coherence in long-form generation
  • Reduced myopia: Considers future implications vs. greedy decisions

Summary

Key Takeaways:

  1. Classical search algorithms face challenges with transformer LMs
  2. Approximate methods enable practical optimal search
  3. Heuristic design is crucial for search quality
  4. Best-first approaches often outperform standard beam search
  5. Constraint satisfaction can be integrated into search

Next Class: Other controlled generation methods and their search implications

Questions + Discussion