Schedule

Date	Topic	Lecturer
26-Aug	Introduction to Language Models and Inference	Graham
28-Aug	Probability Review and Shared Task Introduction	Graham
02-Sep	Common Sampling Methods for Modern NLP	Amanda
04-Sep	Beam Search and Variants HW1 Released	Amanda
09-Sep	Intro to A* and Best First Search	Graham
11-Sep	Other Controlled Generation Methods	Amanda
16-Sep	Chain of Thought and Intermediate Steps	Graham
18-Sep	Self-Refine and Self-Correction Methods	Graham
23-Sep	Reasoning Models	Graham
25-Sep	Incorporating Tools HW1 Due	Graham
30-Sep	Agents and Multi-Agent Communication HW2 Released	Graham
02-Oct	Reward Models and Best-of-N	Amanda
07-Oct	Systems not Models	Omar Khattab (Guest)
09-Oct	Minimum Bayes Risk and Multi-Sample Strategies	Amanda
14-Oct	No Class - Fall Break
16-Oct	No Class - Fall Break
21-Oct	Inference Scaling vs Model Size	Amanda
23-Oct	Token Budgets and Training-Time Distillation	Amanda
28-Oct	Diffusion Models HW2 Due HW3 Released	Graham
30-Oct	Defining Efficiency	Graham
04-Nov	No Class - Democracy Day
06-Nov	Inference and Hardware	Clara
11-Nov	Library Implementations and Optimizations	Graham
13-Nov	Prefix Sharing and KV Cache Optimizations	Amanda
18-Nov	Draft Models and Speculative Decoding	Beidi Chen (Guest)
20-Nov	Linearizing Attention and Sparse Models	Amanda
25-Nov	Transformer Alternatives HW3 Due	TBD
27-Nov	No Class - Thanksgiving
01-Dec	Shared Task Poster Session	All
04-Dec	No Class Shared Task Final Submission Due

Introduction to Language Models and Inference (Aug 26)

Content:

What is a language model?
What is an inference algorithm?
What will we not cover?
What are transformers?
How do modern LMs work?
Modeling errors and search errors
Prompting as a means of model control
Instruction following behavior

Slides: HTML

PDF

Code: Code

Reading Material

Reference: Sections 1+2 from From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models (arXiv)

Probability Review (Aug 28)

Content:

Probability review
Transformer implementation
Generation and evaluation
Meta-generation

Code: Code

Reading Material

None

Common Sampling Methods for Modern NLP (Sep 2)

Content:

Common sampling methods for modern NLP
Diversity-quality tradeoffs

Slides: Google slides

Code: n/a

Reading Material

Reference: A Thorough Examination of Decoding Methods in the Era of LLMs
Reference: Trading Off Diversity and Quality in Natural Language Generation
Optional: Calibration of Pre-trained Transformers
Optional: Locally Typical Sampling
Optional: Forking Paths in Neural Text Generation
Optional: Truncation Sampling as Language Model Desmoothing
Optional: Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity
Optional: The Curious Case of Neural Text Degeneration
Optional: Calibrated Language Models Must Hallucinate
Optional: An Empirical Investigation of Global and Local Normalization for Recurrent Neural Sequence Models Using a Continuous Relaxation to Beam Search

Beam Search and Variants (Sep 4)

Content:

Beam search and variants
Inadequacies of the mode

Slides: Google slides

Code: TBA

Reading Material

Intro to A* and Best First Search (Sep 9)

Content:

Introduction to A* and best first search
A* methods for controlled generation

Slides: HTML

PDF

Reading Material

Reference: Efficient Lattice Rescoring Using Recurrent Neural Network Language Models (PDF)
Reference: Modeling Future Cost for Neural Machine Translation (arXiv)
Reference: Best-First Beam Search (arXiv)
Reference: NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics (arXiv)

Assignments

Other Controlled Generation Methods (Sep 11)

Content:

Other controlled generation methods
Decoding-time distributional modifiers

Slides: Google Slides

Code: TBA

Reading Material

Reference: Llama.cpp README on formal-grammar-based constraints
Reference: FUDGE: Controlled Text Generation With Future Discriminators
Reference: Controlled Decoding from Language Models
Optional: Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
Optional: Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning

Chain of Thought and Intermediate Steps (Sep 16)

Content:

Chain of thought / scratchpad, intermediate steps
Why does chain of thought work?
Self-consistency and variants

Slides: HTML

PDF

Reading Material

Core Papers:

Reference: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (arXiv)
Reference: Large Language Models are Zero-Shot Reasoners (arXiv)
Reference: Self-Consistency Improves Chain of Thought Reasoning in Language Models (arXiv)

Additional Research:

Reference: Adaptive Computation Time for Recurrent Neural Networks (arXiv)
Reference: PonderNet: Learning to Ponder (arXiv)
Reference: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (arXiv)
Reference: Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws (arXiv)
Reference: Adaptive-Consistency: A Cost-Efficient, Model-Agnostic Technique (arXiv)
Reference: To CoT or not to CoT? Chain-of-thought helps mainly on math and logic (arXiv)
Reference: Language Models Don’t Always Say What They Think (arXiv)
Reference: Complexity-Based Prompting for Multi-step Reasoning (arXiv)
Reference: Multimodal Chain-of-Thought Reasoning in Language Models (arXiv)

Paper Presentations

Paper: Least-to-Most Prompting Enables Complex Reasoning in Large Language Models (arXiv)
Paper: Chain-of-Thought Reasoning Without Prompting (arXiv)

Self-Refine and Self-Correction Methods (Sep 18)

Content:

Self-refine and iterative refinement with self-feedback
Learning to self debug for code generation
Reflexion: verbal reinforcement learning for agents
Limitations and challenges of self-correction
Tool-interactive critiquing and external feedback

Slides: HTML

PDF

Code: TBA

Reading Material

Primary: Self-Refine: Iterative Refinement with Self-Feedback (arXiv)
Primary: Learning to Self Debug (arXiv)
Reference: Reflexion: Language Agents with Verbal Reinforcement Learning (arXiv)
Reference: Large Language Models Cannot Self-Correct Reasoning Yet (arXiv)
Reference: CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing (arXiv)
Reference: SCoRe: Self-Correction via Reinforcement Learning (arXiv)

Student Paper Presentations

Student Presentation 1: Improving Reasoning in Language Models via Self-Correction (arXiv)
Student Presentation 2: Self-Correction in Language Models via Multi-Round Consistency Sampling (arXiv)

Reasoning Models (Sep 23)

Content:

What is a reasoning model?
Training reasoning models with reinforcement learning
STaR: Self-Taught Reasoner
DeepSeek R1 and GRPO
Understanding long chain-of-thought reasoning
Reasoning transfer across domains
Advanced reasoning algorithms (S1, L1, Stream of Search, LAPS)

Slides: HTML

PDF

Reading Material

Reference: STaR: Bootstrapping Reasoning With Reasoning (arXiv)
Reference: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (arXiv)
Reference: Demystifying Long Chain-of-Thought Reasoning: An Empirical Study (arXiv)
Reference: SimpleRL-Zoo: Evaluating Reinforcement Learning on Simple Reasoning Tasks (arXiv)
Reference: Does learning math help language models reason better? (arXiv)
Reference: S1: Simple Scaling Laws for Reasoning (arXiv)
Reference: L1: Scaling Test-Time Compute with Simple Sampling (arXiv)
Reference: Stream of Search (SoS): Learning to Search in Language (arXiv)
Reference: Learning Adaptive Parallel Search for Reasoning (arXiv)

Incorporating Tools (Sep 25)

Content:

What are tools? Definition and taxonomy
Basic tool use paradigm
Key approaches: PAL, Toolformer, Gorilla, WebGPT
Tool creation: TroVE and Large Language Models as Tool Makers
Tool robustness: Benchmarking failures in tool-augmented language models
Standardized function calling (JSON Schema)
Parallel function calling
Model Context Protocol (MCP) and MCP registries
FastMCP framework for rapid MCP development
Sandboxed code execution for secure tool use
Tool use scenarios and trade-offs
Evaluation challenges and best practices

Slides: HTML

PDF

Reading Material

Main Survey:

Wang et al., “What Are Tools Anyway? A Survey from the Language Model Perspective” (2024)

Key Papers:

Gao et al., “PAL: Program-aided Language Models” (2022)
Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools” (2023)
Patil et al., “Gorilla: Large Language Model Connected with Massive APIs” (2023)
Nakano et al., “WebGPT: Browser-assisted question-answering with human feedback” (2021)
Cai et al., “Large Language Models as Tool Makers” (2023)
Wang et al., “TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks” (2024)
Treviño et al., “Benchmarking Failures in Tool-Augmented Language Models” (2025)

Practical Resources:

Agents and Multi-Agent Communication (Sep 30)

Content:

Basic agent concepts and definitions
Agent architectures and environments
Efficiency optimizations (context management, caching)
Safety challenges and solutions
Multi-agent systems

Slides: HTML

PDF

Code: TBA

Reading Material

Basic Concepts and Foundations

Reference: ReAct: Synergizing Reasoning and Acting in Language Models (arXiv)
Reference: Executable Code Actions Elicit Better LLM Agents (arXiv)

Agent Architectures and Environments

Reference: WebArena: A Realistic Web Environment for Building Autonomous Agents (arXiv)
Reference: VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (arXiv)

Efficiency Optimizations

Reference: OpenHands Context Condensation for More Efficient AI Agents (All Hands AI)
Reference: Anthropic Prompt Caching (Anthropic)
Reference: Effectively use prompt caching on Amazon Bedrock (AWS)

Evaluation and Benchmarks

Reference: SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (arXiv)
Reference: GAIA: a benchmark for General AI Assistants (arXiv)
Reference: Training Software Engineering Agents and Verifiers with SWE-Gym (arXiv)

Multi-agent Systems

Reference: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (arXiv)

Reward Models and Best-of-N (Oct 2)

Content:

Reward models, best-of-n theory and practice
Monte Carlo Tree Search

Slides: Google slides

Code: n/a

Reading Material

Reference: Why reward models are key for alignment (blog)
Reference: Scaling Laws for Reward Model Overoptimization (arXiv)
Reference: Theoretical guarantees on the best-of-n alignment policy (arXiv)
Reference: RewardBench v2: Advancing Reward Model Evaluation (arXiv)
Assignments

Systems not Models (Oct 7)

Content:

Parallels to older “pipeline NLP”
Visualizing and evaluating systems
DSPy and system-level design

Slides: PDF

Reading Material (all optional)

NLP multi-step pipelines and agents:

Modular Approach to Error Analysis and Evaluation for Multilingual Question Answering (LREC’06)[https://aclanthology.org/L06-1489/]
Multi-hop Reading Comprehension through Question Decomposition and Rescoring (ACL’19)[https://aclanthology.org/P19-1613/]
Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (NeurIPS’21)[https://proceedings.neurips.cc/paper/2021/hash/e8b1cbd05f6e6a358a81dee52493dd06-Abstract.html]
STORM: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models (NAACL’24)[https://aclanthology.org/2024.naacl-long.347/]
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning (NeurIPS’25)[https://arxiv.org/abs/2503.19470] Self-Steering Language Models (COLM’25)[https://openreview.net/forum?id=XvCBtm5PgF]

Abstractions & Learning:

Structured Programming with go to Statements (1974)[https://dl.acm.org/doi/10.1145/356635.356640]
Neural Module Networks (CVPR’16)[https://openaccess.thecvf.com/content_cvpr_2016/html/Andreas_Neural_Module_Networks_CVPR_2016_paper.html]
The Bitter Lesson (2019)[http://www.incompleteideas.net/IncIdeas/BitterLesson.html]
Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP (2022)[https://arxiv.org/abs/2212.14024]
Prompting Is Programming: A Query Language for Large Language Models (PLDI’23)[https://dl.acm.org/doi/abs/10.1145/3591300]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines (ICLR’24)[https://openreview.net/forum?id=sY5N0zY5Od]
LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data (2024)[https://arxiv.org/abs/2407.11418]
Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together (EMNLP’24)[https://aclanthology.org/2024.emnlp-main.597/]
TextGrad: Automatic “Differentiation” via Text (Nature’25)[https://arxiv.org/abs/2406.07496]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning (2025)[https://arxiv.org/abs/2507.19457]

Minimum Bayes Risk and Multi-Sample Strategies (Oct 9)

Content:

Minimum Bayes Risk
Efficient MBR variants
Post-ensemble
Self-consistency and variants

Slides: Google slides

Code: TBA

Reading Material

Reference: It’s MBR All the Way Down: Modern Generation Techniques Through the Lens of Minimum Bayes Risk (arXiv)
Reference: Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation (arXiv)
Reference: High Quality Rather than High Model Probability: Minimum Bayes Risk Decoding with Neural Metrics (arXiv)
Reference: Faster Minimum Bayes Risk Decoding with Confidence-based Pruning (arXiv)
Reference: Sampling-Based Approximations to Minimum Bayes Risk Decoding for Neural Machine Translation (arXiv)
Reference: Efficient Minimum Bayes Risk Decoding using Low-Rank Matrix Completion Algorithms (arXiv)
Reference: Frustratingly Easy Model Ensemble for Abstractive Summarization (ACL Anthology)
Reference: Self-Consistency Improves Chain of Thought Reasoning in Language Models (arXiv)
Reference: Universal Self-Consistency for Large Language Model Generation (arXiv)

No Class - Fall Break (Oct 14)

No Class - Fall Break

No Class - Fall Break (Oct 16)

No Class - Fall Break

Inference Scaling vs Model Size (Oct 21)

Content:

Inference scaling versus scaling model size
Differences in cost and latency considerations
Modeling scaling behavior

Slides: TBA

Code: TBA

Reading Material

Reference: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (arXiv)

Token Budgets and Training-Time Distillation (Oct 23)

Content:

Token budgets
Training-time distillation of inference algorithms
Draft CoT
Early exit voting

Slides: TBA

Code: TBA

Reading Material

Reference: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (arXiv)
Reference: Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (arXiv)
Reference: Direct Preference Optimization for Neural Machine Translation with Minimum Bayes Risk Decoding (arXiv)
Reference: MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods (arXiv)

Diffusion Models (Oct 28)

Content:

Introduction to diffusion models
Denoising diffusion probabilistic models (DDPM)
Score-based generative models
Diffusion models for text generation
Comparison with autoregressive models
Inference techniques for diffusion models
Applications in multimodal generation

Slides: TBA

Code: TBA

Reading Material

Reference: Denoising Diffusion Probabilistic Models (Ho et al., 2020) (arXiv)
Reference: Diffusion Models Beat GANs on Image Synthesis (Dhariwal & Nichol, 2021) (arXiv)
Optional: Diffusion Models: A Comprehensive Survey of Methods and Applications (Yang et al., 2022) (arXiv)

Defining Efficiency (Oct 30)

Content:

How do we define efficiency?
Different places where a method can be efficient (e.g. memory, latency, token cost for APIs)
Brief review of hardware for inference

Slides: TBA

Code: TBA

Reading Material

No Class - Democracy Day (Nov 4)

No Class - Democracy Day

Inference and Hardware (Nov 6)

Content:

Overview of hardware relevant to LLM inference (GPUs, TPUs, accelerators)
Memory bandwidth, compute, and latency considerations
Parallelism strategies and deployment tradeoffs

Slides: TBA

Code: TBA

Reading Material

Library Implementations and Optimizations (Nov 11)

Content:

Library implementations
Lazy softmax
Flash attention
How do vLLM/SGLang/similar speed up generation?

Slides: TBA

Code: TBA

Reading Material

Reference: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (arXiv)
Reference: SELF-ATTENTION DOES NOT NEED O(n2) MEMORY (arXiv)

Assignments

Content:

Prefix sharing
KV cache reuse
Key-value cache compression
Model compression
Brief quantization overview

Slides: TBA

Code: TBA

Reading Material

Reference: Keep the Cost Down: A Review on Methods to Optimize LLM’s KV-Cache Consumption (arXiv)
Reference: Model Compression and Efficient Inference for Large Language Models: A Survey (arXiv)

Draft Models and Speculative Decoding (Nov 18)

Content:

Draft models
Speculative decoding
Other latency improving methods

Slides: TBA

Code: TBA

Reading Material

Linearizing Attention and Sparse Models (Nov 20)

Content:

Linearizing attention
Sparse models

Slides: TBA

Code: TBA

Reading Material

Assignments

Transformer Alternatives (Nov 25)

Content:

Transformer alternatives

Slides: TBA

Code: TBA

Reading Material

Reference: The Annotated S4

Assignments

No Class - Thanksgiving (Nov 27)

No Class - Thanksgiving

Shared Task Results and Poster Sessions (Dec 1)

Content:

Shared task results
Poster sessions

Slides: N/A

Code: N/A

Schedule

Introduction to Language Models and Inference (Aug 26)

Reading Material

Probability Review (Aug 28)

Reading Material

Common Sampling Methods for Modern NLP (Sep 2)

Reading Material

Beam Search and Variants (Sep 4)

Reading Material

Intro to A* and Best First Search (Sep 9)

Reading Material

Assignments

Other Controlled Generation Methods (Sep 11)

Reading Material

Chain of Thought and Intermediate Steps (Sep 16)

Reading Material

Paper Presentations

Self-Refine and Self-Correction Methods (Sep 18)

Reading Material

Student Paper Presentations

Reasoning Models (Sep 23)

Reading Material

Incorporating Tools (Sep 25)

Reading Material

Agents and Multi-Agent Communication (Sep 30)

Reading Material

Basic Concepts and Foundations

Agent Architectures and Environments

Efficiency Optimizations

Evaluation and Benchmarks

Multi-agent Systems

Reward Models and Best-of-N (Oct 2)

Reading Material

Assignments

Systems not Models (Oct 7)

Reading Material (all optional)

Minimum Bayes Risk and Multi-Sample Strategies (Oct 9)

Reading Material

No Class - Fall Break (Oct 14)

No Class - Fall Break

No Class - Fall Break (Oct 16)

No Class - Fall Break

Inference Scaling vs Model Size (Oct 21)

Reading Material

Token Budgets and Training-Time Distillation (Oct 23)

Reading Material

Diffusion Models (Oct 28)

Reading Material

Defining Efficiency (Oct 30)

Reading Material

No Class - Democracy Day (Nov 4)

No Class - Democracy Day

Inference and Hardware (Nov 6)

Reading Material

Library Implementations and Optimizations (Nov 11)

Reading Material

Assignments

Prefix Sharing and KV Cache Optimizations (Nov 13)

Reading Material

Draft Models and Speculative Decoding (Nov 18)

Reading Material

Linearizing Attention and Sparse Models (Nov 20)

Reading Material

Assignments

Transformer Alternatives (Nov 25)

Reading Material

Assignments

No Class - Thanksgiving (Nov 27)

No Class - Thanksgiving

Shared Task Results and Poster Sessions (Dec 1)

Assignments