Schedule
Introduction to Language Models and Inference (Aug 26)
Content:
- What is a language model?
- What is an inference algorithm?
- What will we not cover?
- What are transformers?
- How do modern LMs work?
- Modeling errors and search errors
- Prompting as a means of model control
- Instruction following behavior
Code: Code
Reading Material
- Reference: Sections 1+2 from From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models (arXiv)
Probability Review (Aug 28)
Content:
- Probability review
- Transformer implementation
- Generation and evaluation
- Meta-generation
Code: Code
Reading Material
None
Common Sampling Methods for Modern NLP (Sep 2)
Content:
- Common sampling methods for modern NLP
- Diversity-quality tradeoffs
Slides: Google slides
Code: n/a
Reading Material
- Reference: A Thorough Examination of Decoding Methods in the Era of LLMs
- Reference: Trading Off Diversity and Quality in Natural Language Generation
- Optional: Calibration of Pre-trained Transformers
- Optional: Locally Typical Sampling
- Optional: Forking Paths in Neural Text Generation
- Optional: Truncation Sampling as Language Model Desmoothing
- Optional: Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity
- Optional: The Curious Case of Neural Text Degeneration
- Optional: Calibrated Language Models Must Hallucinate
- Optional: An Empirical Investigation of Global and Local Normalization for Recurrent Neural Sequence Models Using a Continuous Relaxation to Beam Search
Beam Search and Variants (Sep 4)
Content:
- Beam search and variants
- Inadequacies of the mode
Slides: Google slides
Code: TBA
Reading Material
- Reference: Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
- Reference: Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement
- Optional: Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation
- Optional: If beam search is the answer, what was the question?
- Optional: Gumbel-max trick and weighted reservoir sampling
Intro to A* and Best First Search (Sep 9)
Content:
- Introduction to A* and best first search
- A* methods for controlled generation
Reading Material
- Reference: Efficient Lattice Rescoring Using Recurrent Neural Network Language Models (PDF)
- Reference: Modeling Future Cost for Neural Machine Translation (arXiv)
- Reference: Best-First Beam Search (arXiv)
- Reference: NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics (arXiv)
Assignments
Other Controlled Generation Methods (Sep 11)
Content:
- Other controlled generation methods
- Decoding-time distributional modifiers
Slides: Google Slides
Code: TBA
Reading Material
- Reference: Llama.cpp README on formal-grammar-based constraints
- Reference: FUDGE: Controlled Text Generation With Future Discriminators
- Reference: Controlled Decoding from Language Models
- Optional: Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
- Optional: Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning
Chain of Thought and Intermediate Steps (Sep 16)
Content:
- Chain of thought / scratchpad, intermediate steps
- Why does chain of thought work?
- Self-consistency and variants
Reading Material
Core Papers:
- Reference: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (arXiv)
- Reference: Large Language Models are Zero-Shot Reasoners (arXiv)
- Reference: Self-Consistency Improves Chain of Thought Reasoning in Language Models (arXiv)
Additional Research:
- Reference: Adaptive Computation Time for Recurrent Neural Networks (arXiv)
- Reference: PonderNet: Learning to Ponder (arXiv)
- Reference: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (arXiv)
- Reference: Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws (arXiv)
- Reference: Adaptive-Consistency: A Cost-Efficient, Model-Agnostic Technique (arXiv)
- Reference: To CoT or not to CoT? Chain-of-thought helps mainly on math and logic (arXiv)
- Reference: Language Models Don’t Always Say What They Think (arXiv)
- Reference: Complexity-Based Prompting for Multi-step Reasoning (arXiv)
- Reference: Multimodal Chain-of-Thought Reasoning in Language Models (arXiv)
Paper Presentations
Self-Refine and Self-Correction Methods (Sep 18)
Content:
- Self-refine and iterative refinement with self-feedback
- Learning to self debug for code generation
- Reflexion: verbal reinforcement learning for agents
- Limitations and challenges of self-correction
- Tool-interactive critiquing and external feedback
Code: TBA
Reading Material
- Primary: Self-Refine: Iterative Refinement with Self-Feedback (arXiv)
- Primary: Learning to Self Debug (arXiv)
- Reference: Reflexion: Language Agents with Verbal Reinforcement Learning (arXiv)
- Reference: Large Language Models Cannot Self-Correct Reasoning Yet (arXiv)
- Reference: CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing (arXiv)
- Reference: SCoRe: Self-Correction via Reinforcement Learning (arXiv)
Student Paper Presentations
Reasoning Models (Sep 23)
Content:
- What is a reasoning model?
- Training reasoning models with reinforcement learning
- STaR: Self-Taught Reasoner
- DeepSeek R1 and GRPO
- Understanding long chain-of-thought reasoning
- Reasoning transfer across domains
- Advanced reasoning algorithms (S1, L1, Stream of Search, LAPS)
Reading Material
- Reference: STaR: Bootstrapping Reasoning With Reasoning (arXiv)
- Reference: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (arXiv)
- Reference: Demystifying Long Chain-of-Thought Reasoning: An Empirical Study (arXiv)
- Reference: SimpleRL-Zoo: Evaluating Reinforcement Learning on Simple Reasoning Tasks (arXiv)
- Reference: Does learning math help language models reason better? (arXiv)
- Reference: S1: Simple Scaling Laws for Reasoning (arXiv)
- Reference: L1: Scaling Test-Time Compute with Simple Sampling (arXiv)
- Reference: Stream of Search (SoS): Learning to Search in Language (arXiv)
- Reference: Learning Adaptive Parallel Search for Reasoning (arXiv)
Incorporating Tools (Sep 25)
Content:
- What are tools? Definition and taxonomy
- Basic tool use paradigm
- Key approaches: PAL, Toolformer, Gorilla, WebGPT
- Tool creation: TroVE and Large Language Models as Tool Makers
- Tool robustness: Benchmarking failures in tool-augmented language models
- Standardized function calling (JSON Schema)
- Parallel function calling
- Model Context Protocol (MCP) and MCP registries
- FastMCP framework for rapid MCP development
- Sandboxed code execution for secure tool use
- Tool use scenarios and trade-offs
- Evaluation challenges and best practices
Reading Material
Main Survey:
- Wang et al., “What Are Tools Anyway? A Survey from the Language Model Perspective” (2024)
Key Papers:
- Gao et al., “PAL: Program-aided Language Models” (2022)
- Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools” (2023)
- Patil et al., “Gorilla: Large Language Model Connected with Massive APIs” (2023)
- Nakano et al., “WebGPT: Browser-assisted question-answering with human feedback” (2021)
- Cai et al., “Large Language Models as Tool Makers” (2023)
- Wang et al., “TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks” (2024)
- Treviño et al., “Benchmarking Failures in Tool-Augmented Language Models” (2025)
Practical Resources:
Agents and Multi-Agent Communication (Sep 30)
Reward Models and Best-of-N (Oct 2)
Content:
- Reward models, best-of-n theory and practice
- Monte Carlo Tree Search
Slides: TBA
Code: TBA
Reading Material
Reference: Why reward models are key for alignment (by Nathan Lambert)
Reference: Theoretical guarantees on the best-of-n alignment policy (arXiv)
Assignments
Minimum Bayes Risk and Multi-Sample Strategies (Oct 7)
Content:
- What do we get when we sample more?
- Minimum Bayes Risk and similar methods
Slides: TBA
Code: TBA
Reading Material
Reference: It’s MBR All the Way Down: Modern Generation Techniques Through the Lens of Minimum Bayes Risk (arXiv)
Reference: Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation (arXiv)
Assignments
- Homework 3 out: build an LLM system that has a code interpreter and small reward model and visualize the system; benchmark a set of variants of this method on the shared tasks
Systems not Models (Oct 9)
Content:
- Parallels to older “pipeline NLP”
- Ensembling
- Visualizing and evaluating systems
- Human-in-the-loop decoding
- Brief discussion of HCI perspectives
Slides: TBA
Code: TBA
Reading Material
No Class - Fall Break (Oct 14)
No Class - Fall Break (Oct 16)
Inference Scaling vs Model Size (Oct 21)
Content:
- Inference scaling versus scaling model size
- Differences in cost and latency considerations
- Modeling scaling behavior
Slides: TBA
Code: TBA
Reading Material
- Reference: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (arXiv)
Token Budgets and Training-Time Distillation (Oct 23)
Content:
- Token budgets
- Training-time distillation of inference algorithms
- Draft CoT
- Early exit voting
Slides: TBA
Code: TBA
Reading Material
Reference: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (arXiv)
Reference: Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (arXiv)
Reference: Direct Preference Optimization for Neural Machine Translation with Minimum Bayes Risk Decoding (arXiv)
Reference: MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods (arXiv)
Diffusion Models (Oct 28)
Content:
- Introduction to diffusion models
- Denoising diffusion probabilistic models (DDPM)
- Score-based generative models
- Diffusion models for text generation
- Comparison with autoregressive models
- Inference techniques for diffusion models
- Applications in multimodal generation
Slides: TBA
Code: TBA
Reading Material
Defining Efficiency (Oct 30)
Content:
- How do we define efficiency?
- Different places where a method can be efficient (e.g. memory, latency, token cost for APIs)
- Brief review of hardware for inference
Slides: TBA
Code: TBA
Reading Material
No Class - Democracy Day (Nov 4)
Inference and Hardware (Nov 6)
Content:
- Overview of hardware relevant to LLM inference (GPUs, TPUs, accelerators)
- Memory bandwidth, compute, and latency considerations
- Parallelism strategies and deployment tradeoffs
Slides: TBA
Code: TBA
Reading Material
Library Implementations and Optimizations (Nov 11)
Content:
- Library implementations
- Lazy softmax
- Flash attention
- How do vLLM/SGLang/similar speed up generation?
Slides: TBA
Code: TBA
Reading Material
Reference: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (arXiv)
Reference: SELF-ATTENTION DOES NOT NEED O(n2) MEMORY (arXiv)
Assignments
Prefix Sharing and KV Cache Optimizations (Nov 13)
Content:
- Prefix sharing
- KV cache reuse
- Key-value cache compression
- Model compression
- Brief quantization overview
Slides: TBA
Code: TBA
Reading Material
Draft Models and Speculative Decoding (Nov 18)
Content:
- Draft models
- Speculative decoding
- Other latency improving methods
Slides: TBA
Code: TBA
Reading Material
Linearizing Attention and Sparse Models (Nov 20)
Transformer Alternatives (Nov 25)
Content:
- Transformer alternatives
Slides: TBA
Code: TBA
Reading Material
- Reference: The Annotated S4