Schedule

Introduction to Language Models and Inference (Aug 26)

Content:

  • What is a language model?
  • What is an inference algorithm?
  • What will we not cover?
  • What are transformers?
  • How do modern LMs work?
  • Modeling errors and search errors
  • Prompting as a means of model control
  • Instruction following behavior
Slides: HTMLPDF

Code: Code

Reading Material

  • Reference: Sections 1+2 from From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models (arXiv)

Probability Review (Aug 28)

Content:

  • Probability review
  • Transformer implementation
  • Generation and evaluation
  • Meta-generation

Code: Code

Reading Material

None

Common Sampling Methods for Modern NLP (Sep 2)

Beam Search and Variants (Sep 4)

Intro to A* and Best First Search (Sep 9)

Content:

  • Introduction to A* and best first search
  • A* methods for controlled generation
Slides: HTMLPDF

Reading Material

  • Reference: Efficient Lattice Rescoring Using Recurrent Neural Network Language Models (PDF)
  • Reference: Modeling Future Cost for Neural Machine Translation (arXiv)
  • Reference: Best-First Beam Search (arXiv)
  • Reference: NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics (arXiv)

Assignments

Other Controlled Generation Methods (Sep 11)

Chain of Thought and Intermediate Steps (Sep 16)

Content:

  • Chain of thought / scratchpad, intermediate steps
  • Why does chain of thought work?
  • Self-consistency and variants
Slides: HTMLPDF

Reading Material

Core Papers:

  • Reference: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (arXiv)
  • Reference: Large Language Models are Zero-Shot Reasoners (arXiv)
  • Reference: Self-Consistency Improves Chain of Thought Reasoning in Language Models (arXiv)

Additional Research:

  • Reference: Adaptive Computation Time for Recurrent Neural Networks (arXiv)
  • Reference: PonderNet: Learning to Ponder (arXiv)
  • Reference: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (arXiv)
  • Reference: Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws (arXiv)
  • Reference: Adaptive-Consistency: A Cost-Efficient, Model-Agnostic Technique (arXiv)
  • Reference: To CoT or not to CoT? Chain-of-thought helps mainly on math and logic (arXiv)
  • Reference: Language Models Don’t Always Say What They Think (arXiv)
  • Reference: Complexity-Based Prompting for Multi-step Reasoning (arXiv)
  • Reference: Multimodal Chain-of-Thought Reasoning in Language Models (arXiv)

Paper Presentations

  • Paper: Least-to-Most Prompting Enables Complex Reasoning in Large Language Models (arXiv)
  • Paper: Chain-of-Thought Reasoning Without Prompting (arXiv)

Self-Refine and Self-Correction Methods (Sep 18)

Content:

  • Self-refine and iterative refinement with self-feedback
  • Learning to self debug for code generation
  • Reflexion: verbal reinforcement learning for agents
  • Limitations and challenges of self-correction
  • Tool-interactive critiquing and external feedback
Slides: HTMLPDF

Code: TBA

Reading Material

  • Primary: Self-Refine: Iterative Refinement with Self-Feedback (arXiv)
  • Primary: Learning to Self Debug (arXiv)
  • Reference: Reflexion: Language Agents with Verbal Reinforcement Learning (arXiv)
  • Reference: Large Language Models Cannot Self-Correct Reasoning Yet (arXiv)
  • Reference: CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing (arXiv)
  • Reference: SCoRe: Self-Correction via Reinforcement Learning (arXiv)

Student Paper Presentations

  • Student Presentation 1: Improving Reasoning in Language Models via Self-Correction (arXiv)
  • Student Presentation 2: Self-Correction in Language Models via Multi-Round Consistency Sampling (arXiv)

Reasoning Models (Sep 23)

Content:

  • What is a reasoning model?
  • Training reasoning models with reinforcement learning
  • STaR: Self-Taught Reasoner
  • DeepSeek R1 and GRPO
  • Understanding long chain-of-thought reasoning
  • Reasoning transfer across domains
  • Advanced reasoning algorithms (S1, L1, Stream of Search, LAPS)
Slides: HTMLPDF

Reading Material

  • Reference: STaR: Bootstrapping Reasoning With Reasoning (arXiv)
  • Reference: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (arXiv)
  • Reference: Demystifying Long Chain-of-Thought Reasoning: An Empirical Study (arXiv)
  • Reference: SimpleRL-Zoo: Evaluating Reinforcement Learning on Simple Reasoning Tasks (arXiv)
  • Reference: Does learning math help language models reason better? (arXiv)
  • Reference: S1: Simple Scaling Laws for Reasoning (arXiv)
  • Reference: L1: Scaling Test-Time Compute with Simple Sampling (arXiv)
  • Reference: Stream of Search (SoS): Learning to Search in Language (arXiv)
  • Reference: Learning Adaptive Parallel Search for Reasoning (arXiv)

Incorporating Tools (Sep 25)

Content:

  • What are tools? Definition and taxonomy
  • Basic tool use paradigm
  • Key approaches: PAL, Toolformer, Gorilla, WebGPT
  • Tool creation: TroVE and Large Language Models as Tool Makers
  • Tool robustness: Benchmarking failures in tool-augmented language models
  • Standardized function calling (JSON Schema)
  • Parallel function calling
  • Model Context Protocol (MCP) and MCP registries
  • FastMCP framework for rapid MCP development
  • Sandboxed code execution for secure tool use
  • Tool use scenarios and trade-offs
  • Evaluation challenges and best practices
Slides: HTMLPDF

Reading Material

Main Survey:

Key Papers:

Practical Resources:

Agents and Multi-Agent Communication (Sep 30)

Content:

  • Agents and multi-agent communication

Slides: TBA

Code: TBA

Reading Material

  • Reference: DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines (arXiv)

  • Reference: A Survey on LLM-based Multi-Agent Systems: Workflow, Infrastructure, and Applications (Springer)

Reward Models and Best-of-N (Oct 2)

Content:

  • Reward models, best-of-n theory and practice
  • Monte Carlo Tree Search

Slides: TBA

Code: TBA

Reading Material

  • Reference: Why reward models are key for alignment (by Nathan Lambert)

  • Reference: Theoretical guarantees on the best-of-n alignment policy (arXiv)

Assignments

Minimum Bayes Risk and Multi-Sample Strategies (Oct 7)

Content:

  • What do we get when we sample more?
  • Minimum Bayes Risk and similar methods

Slides: TBA

Code: TBA

Reading Material

  • Reference: It’s MBR All the Way Down: Modern Generation Techniques Through the Lens of Minimum Bayes Risk (arXiv)

  • Reference: Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation (arXiv)

Assignments

  • Homework 3 out: build an LLM system that has a code interpreter and small reward model and visualize the system; benchmark a set of variants of this method on the shared tasks

Systems not Models (Oct 9)

Content:

  • Parallels to older “pipeline NLP”
  • Ensembling
  • Visualizing and evaluating systems
  • Human-in-the-loop decoding
  • Brief discussion of HCI perspectives

Slides: TBA

Code: TBA

Reading Material

No Class - Fall Break (Oct 14)

No Class - Fall Break

No Class - Fall Break (Oct 16)

No Class - Fall Break

Inference Scaling vs Model Size (Oct 21)

Content:

  • Inference scaling versus scaling model size
  • Differences in cost and latency considerations
  • Modeling scaling behavior

Slides: TBA

Code: TBA

Reading Material

  • Reference: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (arXiv)

Token Budgets and Training-Time Distillation (Oct 23)

Content:

  • Token budgets
  • Training-time distillation of inference algorithms
  • Draft CoT
  • Early exit voting

Slides: TBA

Code: TBA

Reading Material

  • Reference: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (arXiv)

  • Reference: Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (arXiv)

  • Reference: Direct Preference Optimization for Neural Machine Translation with Minimum Bayes Risk Decoding (arXiv)

  • Reference: MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods (arXiv)

Diffusion Models (Oct 28)

Content:

  • Introduction to diffusion models
  • Denoising diffusion probabilistic models (DDPM)
  • Score-based generative models
  • Diffusion models for text generation
  • Comparison with autoregressive models
  • Inference techniques for diffusion models
  • Applications in multimodal generation

Slides: TBA

Code: TBA

Reading Material

  • Reference: Denoising Diffusion Probabilistic Models (Ho et al., 2020) (arXiv)
  • Reference: Diffusion Models Beat GANs on Image Synthesis (Dhariwal & Nichol, 2021) (arXiv)

  • Optional: Diffusion Models: A Comprehensive Survey of Methods and Applications (Yang et al., 2022) (arXiv)

Defining Efficiency (Oct 30)

Content:

  • How do we define efficiency?
  • Different places where a method can be efficient (e.g. memory, latency, token cost for APIs)
  • Brief review of hardware for inference

Slides: TBA

Code: TBA

Reading Material

No Class - Democracy Day (Nov 4)

No Class - Democracy Day

Inference and Hardware (Nov 6)

Content:

  • Overview of hardware relevant to LLM inference (GPUs, TPUs, accelerators)
  • Memory bandwidth, compute, and latency considerations
  • Parallelism strategies and deployment tradeoffs

Slides: TBA

Code: TBA

Reading Material

Library Implementations and Optimizations (Nov 11)

Content:

  • Library implementations
  • Lazy softmax
  • Flash attention
  • How do vLLM/SGLang/similar speed up generation?

Slides: TBA

Code: TBA

Reading Material

  • Reference: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (arXiv)

  • Reference: SELF-ATTENTION DOES NOT NEED O(n2) MEMORY (arXiv)

Assignments

Prefix Sharing and KV Cache Optimizations (Nov 13)

Content:

  • Prefix sharing
  • KV cache reuse
  • Key-value cache compression
  • Model compression
  • Brief quantization overview

Slides: TBA

Code: TBA

Reading Material

  • Reference: Keep the Cost Down: A Review on Methods to Optimize LLM’s KV-Cache Consumption (arXiv)

  • Reference: Model Compression and Efficient Inference for Large Language Models: A Survey (arXiv)

Draft Models and Speculative Decoding (Nov 18)

Content:

  • Draft models
  • Speculative decoding
  • Other latency improving methods

Slides: TBA

Code: TBA

Reading Material

Linearizing Attention and Sparse Models (Nov 20)

Content:

  • Linearizing attention
  • Sparse models

Slides: TBA

Code: TBA

Reading Material

  • TBA

Assignments

Transformer Alternatives (Nov 25)

Content:

  • Transformer alternatives

Slides: TBA

Code: TBA

Reading Material

  • Reference: The Annotated S4

Assignments

No Class - Thanksgiving (Nov 27)

No Class - Thanksgiving

Shared Task Results and Poster Sessions (Dec 2)

Content:

  • Shared task results
  • Poster sessions

Slides: N/A

Code: N/A

Assignments