Schedule
Introduction to Language Models and Inference (Aug 26)
Content:
- What is a language model?
- What is an inference algorithm?
- What will we not cover?
- What are transformers?
- How do modern LMs work?
- Modeling errors and search errors
- Prompting as a means of model control
- Instruction following behavior
Code: Code
Reading Material
- Reference: Sections 1+2 from From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models (arXiv)
Probability Review (Aug 28)
Content:
- Probability review
- Transformer implementation
- Generation and evaluation
- Meta-generation
Code: Code
Reading Material
None
Common Sampling Methods for Modern NLP (Sep 2)
Content:
- Common sampling methods for modern NLP
- Diversity-quality tradeoffs
Slides: Google slides
Code: n/a
Reading Material
- Reference: A Thorough Examination of Decoding Methods in the Era of LLMs
- Reference: Trading Off Diversity and Quality in Natural Language Generation
- Optional: Calibration of Pre-trained Transformers
- Optional: Locally Typical Sampling
- Optional: Forking Paths in Neural Text Generation
- Optional: Truncation Sampling as Language Model Desmoothing
- Optional: Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity
- Optional: The Curious Case of Neural Text Degeneration
- Optional: Calibrated Language Models Must Hallucinate
- Optional: An Empirical Investigation of Global and Local Normalization for Recurrent Neural Sequence Models Using a Continuous Relaxation to Beam Search
Beam Search and Variants (Sep 4)
Content:
- Beam search and variants
- Inadequacies of the mode
Slides: Google slides
Code: TBA
Reading Material
- Reference: Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
- Reference: Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement
- Optional: Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation
- Optional: If beam search is the answer, what was the question?
- Optional: Gumbel-max trick and weighted reservoir sampling
Intro to A* and Best First Search (Sep 9)
Content:
- Introduction to A* and best first search
- A* methods for controlled generation
Reading Material
- Reference: Efficient Lattice Rescoring Using Recurrent Neural Network Language Models (PDF)
- Reference: Modeling Future Cost for Neural Machine Translation (arXiv)
- Reference: Best-First Beam Search (arXiv)
- Reference: NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics (arXiv)
Assignments
Other Controlled Generation Methods (Sep 11)
Content:
- Other controlled generation methods
- Decoding-time distributional modifiers
Slides: Google Slides
Code: TBA
Reading Material
- Reference: Llama.cpp README on formal-grammar-based constraints
- Reference: FUDGE: Controlled Text Generation With Future Discriminators
- Reference: Controlled Decoding from Language Models
- Optional: Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
- Optional: Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning
Chain of Thought and Intermediate Steps (Sep 16)
Content:
- Chain of thought / scratchpad, intermediate steps
- Why does chain of thought work?
- Self-consistency and variants
Reading Material
Core Papers:
- Reference: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (arXiv)
- Reference: Large Language Models are Zero-Shot Reasoners (arXiv)
- Reference: Self-Consistency Improves Chain of Thought Reasoning in Language Models (arXiv)
Additional Research:
- Reference: Adaptive Computation Time for Recurrent Neural Networks (arXiv)
- Reference: PonderNet: Learning to Ponder (arXiv)
- Reference: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (arXiv)
- Reference: Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws (arXiv)
- Reference: Adaptive-Consistency: A Cost-Efficient, Model-Agnostic Technique (arXiv)
- Reference: To CoT or not to CoT? Chain-of-thought helps mainly on math and logic (arXiv)
- Reference: Language Models Don’t Always Say What They Think (arXiv)
- Reference: Complexity-Based Prompting for Multi-step Reasoning (arXiv)
- Reference: Multimodal Chain-of-Thought Reasoning in Language Models (arXiv)
Paper Presentations
Self-Refine and Self-Correction Methods (Sep 18)
Content:
- Self-refine and iterative refinement with self-feedback
- Learning to self debug for code generation
- Reflexion: verbal reinforcement learning for agents
- Limitations and challenges of self-correction
- Tool-interactive critiquing and external feedback
Code: TBA
Reading Material
- Primary: Self-Refine: Iterative Refinement with Self-Feedback (arXiv)
- Primary: Learning to Self Debug (arXiv)
- Reference: Reflexion: Language Agents with Verbal Reinforcement Learning (arXiv)
- Reference: Large Language Models Cannot Self-Correct Reasoning Yet (arXiv)
- Reference: CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing (arXiv)
- Reference: SCoRe: Self-Correction via Reinforcement Learning (arXiv)
Student Paper Presentations
Reasoning Models (Sep 23)
Content:
- What is a reasoning model?
- Training reasoning models with reinforcement learning
- STaR: Self-Taught Reasoner
- DeepSeek R1 and GRPO
- Understanding long chain-of-thought reasoning
- Reasoning transfer across domains
- Advanced reasoning algorithms (S1, L1, Stream of Search, LAPS)
Reading Material
- Reference: STaR: Bootstrapping Reasoning With Reasoning (arXiv)
- Reference: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (arXiv)
- Reference: Demystifying Long Chain-of-Thought Reasoning: An Empirical Study (arXiv)
- Reference: SimpleRL-Zoo: Evaluating Reinforcement Learning on Simple Reasoning Tasks (arXiv)
- Reference: Does learning math help language models reason better? (arXiv)
- Reference: S1: Simple Scaling Laws for Reasoning (arXiv)
- Reference: L1: Scaling Test-Time Compute with Simple Sampling (arXiv)
- Reference: Stream of Search (SoS): Learning to Search in Language (arXiv)
- Reference: Learning Adaptive Parallel Search for Reasoning (arXiv)
Incorporating Tools (Sep 25)
Content:
- What are tools? Definition and taxonomy
- Basic tool use paradigm
- Key approaches: PAL, Toolformer, Gorilla, WebGPT
- Tool creation: TroVE and Large Language Models as Tool Makers
- Tool robustness: Benchmarking failures in tool-augmented language models
- Standardized function calling (JSON Schema)
- Parallel function calling
- Model Context Protocol (MCP) and MCP registries
- FastMCP framework for rapid MCP development
- Sandboxed code execution for secure tool use
- Tool use scenarios and trade-offs
- Evaluation challenges and best practices
Reading Material
Main Survey:
- Wang et al., “What Are Tools Anyway? A Survey from the Language Model Perspective” (2024)
Key Papers:
- Gao et al., “PAL: Program-aided Language Models” (2022)
- Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools” (2023)
- Patil et al., “Gorilla: Large Language Model Connected with Massive APIs” (2023)
- Nakano et al., “WebGPT: Browser-assisted question-answering with human feedback” (2021)
- Cai et al., “Large Language Models as Tool Makers” (2023)
- Wang et al., “TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks” (2024)
- Treviño et al., “Benchmarking Failures in Tool-Augmented Language Models” (2025)
Practical Resources:
Agents and Multi-Agent Communication (Sep 30)
Content:
- Basic agent concepts and definitions
- Agent architectures and environments
- Efficiency optimizations (context management, caching)
- Safety challenges and solutions
- Multi-agent systems
Code: TBA
Reading Material
Basic Concepts and Foundations
Reference: ReAct: Synergizing Reasoning and Acting in Language Models (arXiv)
Reference: Executable Code Actions Elicit Better LLM Agents (arXiv)
Agent Architectures and Environments
Reference: WebArena: A Realistic Web Environment for Building Autonomous Agents (arXiv)
Reference: VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (arXiv)
Efficiency Optimizations
Reference: OpenHands Context Condensation for More Efficient AI Agents (All Hands AI)
Reference: Anthropic Prompt Caching (Anthropic)
Reference: Effectively use prompt caching on Amazon Bedrock (AWS)
Evaluation and Benchmarks
Reference: SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (arXiv)
Reference: GAIA: a benchmark for General AI Assistants (arXiv)
Reference: Training Software Engineering Agents and Verifiers with SWE-Gym (arXiv)
Multi-agent Systems
- Reference: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (arXiv)
Reward Models and Best-of-N (Oct 2)
Content:
- Reward models, best-of-n theory and practice
- Monte Carlo Tree Search
Slides: Google slides
Code: n/a
Reading Material
Systems not Models (Oct 7)
Content:
- Parallels to older “pipeline NLP”
- Visualizing and evaluating systems
- DSPy and system-level design
Slides: PDF
Reading Material (all optional)
NLP multi-step pipelines and agents:
- Modular Approach to Error Analysis and Evaluation for Multilingual Question Answering (LREC’06)[https://aclanthology.org/L06-1489/]
- Multi-hop Reading Comprehension through Question Decomposition and Rescoring (ACL’19)[https://aclanthology.org/P19-1613/]
- Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (NeurIPS’21)[https://proceedings.neurips.cc/paper/2021/hash/e8b1cbd05f6e6a358a81dee52493dd06-Abstract.html]
- STORM: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models (NAACL’24)[https://aclanthology.org/2024.naacl-long.347/]
- ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning (NeurIPS’25)[https://arxiv.org/abs/2503.19470] Self-Steering Language Models (COLM’25)[https://openreview.net/forum?id=XvCBtm5PgF]
Abstractions & Learning:
- Structured Programming with go to Statements (1974)[https://dl.acm.org/doi/10.1145/356635.356640]
- Neural Module Networks (CVPR’16)[https://openaccess.thecvf.com/content_cvpr_2016/html/Andreas_Neural_Module_Networks_CVPR_2016_paper.html]
- The Bitter Lesson (2019)[http://www.incompleteideas.net/IncIdeas/BitterLesson.html]
- Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP (2022)[https://arxiv.org/abs/2212.14024]
- Prompting Is Programming: A Query Language for Large Language Models (PLDI’23)[https://dl.acm.org/doi/abs/10.1145/3591300]
- DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines (ICLR’24)[https://openreview.net/forum?id=sY5N0zY5Od]
- LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data (2024)[https://arxiv.org/abs/2407.11418]
- Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together (EMNLP’24)[https://aclanthology.org/2024.emnlp-main.597/]
- TextGrad: Automatic “Differentiation” via Text (Nature’25)[https://arxiv.org/abs/2406.07496]
- GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning (2025)[https://arxiv.org/abs/2507.19457]
Minimum Bayes Risk and Multi-Sample Strategies (Oct 9)
Content:
- Minimum Bayes Risk
- Efficient MBR variants
- Post-ensemble
- Self-consistency and variants
Slides: Google slides
Code: TBA
Reading Material
Reference: It’s MBR All the Way Down: Modern Generation Techniques Through the Lens of Minimum Bayes Risk (arXiv)
Reference: Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation (arXiv)
Reference: High Quality Rather than High Model Probability: Minimum Bayes Risk Decoding with Neural Metrics (arXiv)
Reference: Faster Minimum Bayes Risk Decoding with Confidence-based Pruning (arXiv)
Reference: Sampling-Based Approximations to Minimum Bayes Risk Decoding for Neural Machine Translation (arXiv)
Reference: Efficient Minimum Bayes Risk Decoding using Low-Rank Matrix Completion Algorithms (arXiv)
Reference: Frustratingly Easy Model Ensemble for Abstractive Summarization (ACL Anthology)
Reference: Self-Consistency Improves Chain of Thought Reasoning in Language Models (arXiv)
Reference: Universal Self-Consistency for Large Language Model Generation (arXiv)
No Class - Fall Break (Oct 14)
No Class - Fall Break (Oct 16)
Inference Scaling vs Model Size (Oct 21)
Content:
- Inference scaling versus scaling model size
- Differences in cost and latency considerations
- Modeling scaling behavior
Slides: TBA
Code: TBA
Reading Material
- Reference: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (arXiv)
Token Budgets and Training-Time Distillation (Oct 23)
Content:
- Token budgets
- Training-time distillation of inference algorithms
- Draft CoT
- Early exit voting
Slides: TBA
Code: TBA
Reading Material
Reference: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (arXiv)
Reference: Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (arXiv)
Reference: Direct Preference Optimization for Neural Machine Translation with Minimum Bayes Risk Decoding (arXiv)
Reference: MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods (arXiv)
Diffusion Models (Oct 28)
Content:
- Introduction to diffusion models
- Denoising diffusion probabilistic models (DDPM)
- Score-based generative models
- Diffusion models for text generation
- Comparison with autoregressive models
- Inference techniques for diffusion models
- Applications in multimodal generation
Slides: TBA
Code: TBA
Reading Material
Defining Efficiency (Oct 30)
Content:
- How do we define efficiency?
- Different places where a method can be efficient (e.g. memory, latency, token cost for APIs)
- Brief review of hardware for inference
Slides: TBA
Code: TBA
Reading Material
No Class - Democracy Day (Nov 4)
Inference and Hardware (Nov 6)
Content:
- Overview of hardware relevant to LLM inference (GPUs, TPUs, accelerators)
- Memory bandwidth, compute, and latency considerations
- Parallelism strategies and deployment tradeoffs
Slides: TBA
Code: TBA
Reading Material
Library Implementations and Optimizations (Nov 11)
Content:
- Library implementations
- Lazy softmax
- Flash attention
- How do vLLM/SGLang/similar speed up generation?
Slides: TBA
Code: TBA
Reading Material
Reference: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (arXiv)
Reference: SELF-ATTENTION DOES NOT NEED O(n2) MEMORY (arXiv)
Assignments
Prefix Sharing and KV Cache Optimizations (Nov 13)
Content:
- Prefix sharing
- KV cache reuse
- Key-value cache compression
- Model compression
- Brief quantization overview
Slides: TBA
Code: TBA
Reading Material
Draft Models and Speculative Decoding (Nov 18)
Content:
- Draft models
- Speculative decoding
- Other latency improving methods
Slides: TBA
Code: TBA
Reading Material
Linearizing Attention and Sparse Models (Nov 20)
Transformer Alternatives (Nov 25)
Content:
- Transformer alternatives
Slides: TBA
Code: TBA
Reading Material
- Reference: The Annotated S4