Schedule

Introduction to Language Models and Inference (Aug 26)

Content:

  • What is a language model?
  • What is an inference algorithm?
  • What will we not cover?
  • What are transformers?
  • How do modern LMs work?
  • Modeling errors and search errors
  • Prompting as a means of model control
  • Instruction following behavior
Slides: HTMLPDF

Code: Code

Reading Material

  • Reference: Sections 1+2 from From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models (arXiv)

Probability Review (Aug 28)

Content:

  • Probability review
  • Transformer implementation
  • Generation and evaluation
  • Meta-generation

Code: Code

Reading Material

None

Common Sampling Methods for Modern NLP (Sep 2)

Beam Search and Variants (Sep 4)

Intro to A* and Best First Search (Sep 9)

Content:

  • Introduction to A* and best first search
  • A* methods for controlled generation
Slides: HTMLPDF

Reading Material

  • Reference: Efficient Lattice Rescoring Using Recurrent Neural Network Language Models (PDF)
  • Reference: Modeling Future Cost for Neural Machine Translation (arXiv)
  • Reference: Best-First Beam Search (arXiv)
  • Reference: NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics (arXiv)

Assignments

Other Controlled Generation Methods (Sep 11)

Chain of Thought and Intermediate Steps (Sep 16)

Content:

  • Chain of thought / scratchpad, intermediate steps
  • Why does chain of thought work?
  • Self-consistency and variants
Slides: HTMLPDF

Reading Material

Core Papers:

  • Reference: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (arXiv)
  • Reference: Large Language Models are Zero-Shot Reasoners (arXiv)
  • Reference: Self-Consistency Improves Chain of Thought Reasoning in Language Models (arXiv)

Additional Research:

  • Reference: Adaptive Computation Time for Recurrent Neural Networks (arXiv)
  • Reference: PonderNet: Learning to Ponder (arXiv)
  • Reference: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (arXiv)
  • Reference: Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws (arXiv)
  • Reference: Adaptive-Consistency: A Cost-Efficient, Model-Agnostic Technique (arXiv)
  • Reference: To CoT or not to CoT? Chain-of-thought helps mainly on math and logic (arXiv)
  • Reference: Language Models Don’t Always Say What They Think (arXiv)
  • Reference: Complexity-Based Prompting for Multi-step Reasoning (arXiv)
  • Reference: Multimodal Chain-of-Thought Reasoning in Language Models (arXiv)

Paper Presentations

  • Paper: Least-to-Most Prompting Enables Complex Reasoning in Large Language Models (arXiv)
  • Paper: Chain-of-Thought Reasoning Without Prompting (arXiv)

Self-Refine and Self-Correction Methods (Sep 18)

Content:

  • Self-refine and iterative refinement with self-feedback
  • Learning to self debug for code generation
  • Reflexion: verbal reinforcement learning for agents
  • Limitations and challenges of self-correction
  • Tool-interactive critiquing and external feedback
Slides: HTMLPDF

Code: TBA

Reading Material

  • Primary: Self-Refine: Iterative Refinement with Self-Feedback (arXiv)
  • Primary: Learning to Self Debug (arXiv)
  • Reference: Reflexion: Language Agents with Verbal Reinforcement Learning (arXiv)
  • Reference: Large Language Models Cannot Self-Correct Reasoning Yet (arXiv)
  • Reference: CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing (arXiv)
  • Reference: SCoRe: Self-Correction via Reinforcement Learning (arXiv)

Student Paper Presentations

  • Student Presentation 1: Improving Reasoning in Language Models via Self-Correction (arXiv)
  • Student Presentation 2: Self-Correction in Language Models via Multi-Round Consistency Sampling (arXiv)

Reasoning Models (Sep 23)

Content:

  • What is a reasoning model?
  • Training reasoning models with reinforcement learning
  • STaR: Self-Taught Reasoner
  • DeepSeek R1 and GRPO
  • Understanding long chain-of-thought reasoning
  • Reasoning transfer across domains
  • Advanced reasoning algorithms (S1, L1, Stream of Search, LAPS)
Slides: HTMLPDF

Reading Material

  • Reference: STaR: Bootstrapping Reasoning With Reasoning (arXiv)
  • Reference: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (arXiv)
  • Reference: Demystifying Long Chain-of-Thought Reasoning: An Empirical Study (arXiv)
  • Reference: SimpleRL-Zoo: Evaluating Reinforcement Learning on Simple Reasoning Tasks (arXiv)
  • Reference: Does learning math help language models reason better? (arXiv)
  • Reference: S1: Simple Scaling Laws for Reasoning (arXiv)
  • Reference: L1: Scaling Test-Time Compute with Simple Sampling (arXiv)
  • Reference: Stream of Search (SoS): Learning to Search in Language (arXiv)
  • Reference: Learning Adaptive Parallel Search for Reasoning (arXiv)

Incorporating Tools (Sep 25)

Content:

  • What are tools? Definition and taxonomy
  • Basic tool use paradigm
  • Key approaches: PAL, Toolformer, Gorilla, WebGPT
  • Tool creation: TroVE and Large Language Models as Tool Makers
  • Tool robustness: Benchmarking failures in tool-augmented language models
  • Standardized function calling (JSON Schema)
  • Parallel function calling
  • Model Context Protocol (MCP) and MCP registries
  • FastMCP framework for rapid MCP development
  • Sandboxed code execution for secure tool use
  • Tool use scenarios and trade-offs
  • Evaluation challenges and best practices
Slides: HTMLPDF

Reading Material

Main Survey:

Key Papers:

Practical Resources:

Agents and Multi-Agent Communication (Sep 30)

Content:

  • Basic agent concepts and definitions
  • Agent architectures and environments
  • Efficiency optimizations (context management, caching)
  • Safety challenges and solutions
  • Multi-agent systems
Slides: HTMLPDF

Code: TBA

Reading Material

Basic Concepts and Foundations

  • Reference: ReAct: Synergizing Reasoning and Acting in Language Models (arXiv)

  • Reference: Executable Code Actions Elicit Better LLM Agents (arXiv)

Agent Architectures and Environments

  • Reference: WebArena: A Realistic Web Environment for Building Autonomous Agents (arXiv)

  • Reference: VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (arXiv)

Efficiency Optimizations

  • Reference: OpenHands Context Condensation for More Efficient AI Agents (All Hands AI)

  • Reference: Anthropic Prompt Caching (Anthropic)

  • Reference: Effectively use prompt caching on Amazon Bedrock (AWS)

Evaluation and Benchmarks

  • Reference: SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (arXiv)

  • Reference: GAIA: a benchmark for General AI Assistants (arXiv)

  • Reference: Training Software Engineering Agents and Verifiers with SWE-Gym (arXiv)

Multi-agent Systems

  • Reference: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (arXiv)

Reward Models and Best-of-N (Oct 2)

Content:

  • Reward models, best-of-n theory and practice
  • Monte Carlo Tree Search

Slides: Google slides

Code: n/a

Reading Material

  • Reference: Why reward models are key for alignment (blog)
  • Reference: Scaling Laws for Reward Model Overoptimization (arXiv)
  • Reference: Theoretical guarantees on the best-of-n alignment policy (arXiv)
  • Reference: RewardBench v2: Advancing Reward Model Evaluation (arXiv)

    Assignments

Systems not Models (Oct 7)

Content:

  • Parallels to older “pipeline NLP”
  • Visualizing and evaluating systems
  • DSPy and system-level design

Slides: PDF

Reading Material (all optional)

NLP multi-step pipelines and agents:

  • Modular Approach to Error Analysis and Evaluation for Multilingual Question Answering (LREC’06)[https://aclanthology.org/L06-1489/]
  • Multi-hop Reading Comprehension through Question Decomposition and Rescoring (ACL’19)[https://aclanthology.org/P19-1613/]
  • Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (NeurIPS’21)[https://proceedings.neurips.cc/paper/2021/hash/e8b1cbd05f6e6a358a81dee52493dd06-Abstract.html]
  • STORM: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models (NAACL’24)[https://aclanthology.org/2024.naacl-long.347/]
  • ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning (NeurIPS’25)[https://arxiv.org/abs/2503.19470] Self-Steering Language Models (COLM’25)[https://openreview.net/forum?id=XvCBtm5PgF]

Abstractions & Learning:

  • Structured Programming with go to Statements (1974)[https://dl.acm.org/doi/10.1145/356635.356640]
  • Neural Module Networks (CVPR’16)[https://openaccess.thecvf.com/content_cvpr_2016/html/Andreas_Neural_Module_Networks_CVPR_2016_paper.html]
  • The Bitter Lesson (2019)[http://www.incompleteideas.net/IncIdeas/BitterLesson.html]
  • Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP (2022)[https://arxiv.org/abs/2212.14024]
  • Prompting Is Programming: A Query Language for Large Language Models (PLDI’23)[https://dl.acm.org/doi/abs/10.1145/3591300]
  • DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines (ICLR’24)[https://openreview.net/forum?id=sY5N0zY5Od]
  • LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data (2024)[https://arxiv.org/abs/2407.11418]
  • Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together (EMNLP’24)[https://aclanthology.org/2024.emnlp-main.597/]
  • TextGrad: Automatic “Differentiation” via Text (Nature’25)[https://arxiv.org/abs/2406.07496]
  • GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning (2025)[https://arxiv.org/abs/2507.19457]

Minimum Bayes Risk and Multi-Sample Strategies (Oct 9)

Content:

  • Minimum Bayes Risk
  • Efficient MBR variants
  • Post-ensemble
  • Self-consistency and variants

Slides: Google slides

Code: TBA

Reading Material

  • Reference: It’s MBR All the Way Down: Modern Generation Techniques Through the Lens of Minimum Bayes Risk (arXiv)

  • Reference: Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation (arXiv)

  • Reference: High Quality Rather than High Model Probability: Minimum Bayes Risk Decoding with Neural Metrics (arXiv)

  • Reference: Faster Minimum Bayes Risk Decoding with Confidence-based Pruning (arXiv)

  • Reference: Sampling-Based Approximations to Minimum Bayes Risk Decoding for Neural Machine Translation (arXiv)

  • Reference: Efficient Minimum Bayes Risk Decoding using Low-Rank Matrix Completion Algorithms (arXiv)

  • Reference: Frustratingly Easy Model Ensemble for Abstractive Summarization (ACL Anthology)

  • Reference: Self-Consistency Improves Chain of Thought Reasoning in Language Models (arXiv)

  • Reference: Universal Self-Consistency for Large Language Model Generation (arXiv)

No Class - Fall Break (Oct 14)

No Class - Fall Break

No Class - Fall Break (Oct 16)

No Class - Fall Break

Inference Scaling vs Model Size (Oct 21)

Content:

  • Inference scaling versus scaling model size
  • Differences in cost and latency considerations
  • Modeling scaling behavior

Slides: TBA

Code: TBA

Reading Material

  • Reference: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (arXiv)

Token Budgets and Training-Time Distillation (Oct 23)

Content:

  • Token budgets
  • Training-time distillation of inference algorithms
  • Draft CoT
  • Early exit voting

Slides: TBA

Code: TBA

Reading Material

  • Reference: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (arXiv)

  • Reference: Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (arXiv)

  • Reference: Direct Preference Optimization for Neural Machine Translation with Minimum Bayes Risk Decoding (arXiv)

  • Reference: MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods (arXiv)

Diffusion Models (Oct 28)

Content:

  • Introduction to diffusion models
  • Denoising diffusion probabilistic models (DDPM)
  • Score-based generative models
  • Diffusion models for text generation
  • Comparison with autoregressive models
  • Inference techniques for diffusion models
  • Applications in multimodal generation

Slides: TBA

Code: TBA

Reading Material

  • Reference: Denoising Diffusion Probabilistic Models (Ho et al., 2020) (arXiv)
  • Reference: Diffusion Models Beat GANs on Image Synthesis (Dhariwal & Nichol, 2021) (arXiv)

  • Optional: Diffusion Models: A Comprehensive Survey of Methods and Applications (Yang et al., 2022) (arXiv)

Defining Efficiency (Oct 30)

Content:

  • How do we define efficiency?
  • Different places where a method can be efficient (e.g. memory, latency, token cost for APIs)
  • Brief review of hardware for inference

Slides: TBA

Code: TBA

Reading Material

No Class - Democracy Day (Nov 4)

No Class - Democracy Day

Inference and Hardware (Nov 6)

Content:

  • Overview of hardware relevant to LLM inference (GPUs, TPUs, accelerators)
  • Memory bandwidth, compute, and latency considerations
  • Parallelism strategies and deployment tradeoffs

Slides: TBA

Code: TBA

Reading Material

Library Implementations and Optimizations (Nov 11)

Content:

  • Library implementations
  • Lazy softmax
  • Flash attention
  • How do vLLM/SGLang/similar speed up generation?

Slides: TBA

Code: TBA

Reading Material

  • Reference: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (arXiv)

  • Reference: SELF-ATTENTION DOES NOT NEED O(n2) MEMORY (arXiv)

Assignments

Prefix Sharing and KV Cache Optimizations (Nov 13)

Content:

  • Prefix sharing
  • KV cache reuse
  • Key-value cache compression
  • Model compression
  • Brief quantization overview

Slides: TBA

Code: TBA

Reading Material

  • Reference: Keep the Cost Down: A Review on Methods to Optimize LLM’s KV-Cache Consumption (arXiv)

  • Reference: Model Compression and Efficient Inference for Large Language Models: A Survey (arXiv)

Draft Models and Speculative Decoding (Nov 18)

Content:

  • Draft models
  • Speculative decoding
  • Other latency improving methods

Slides: TBA

Code: TBA

Reading Material

Linearizing Attention and Sparse Models (Nov 20)

Content:

  • Linearizing attention
  • Sparse models

Slides: TBA

Code: TBA

Reading Material

  • TBA

Assignments

Transformer Alternatives (Nov 25)

Content:

  • Transformer alternatives

Slides: TBA

Code: TBA

Reading Material

  • Reference: The Annotated S4

Assignments

No Class - Thanksgiving (Nov 27)

No Class - Thanksgiving

Shared Task Results and Poster Sessions (Dec 1)

Content:

  • Shared task results
  • Poster sessions

Slides: N/A

Code: N/A

Assignments