Schedule

Date	Topic	Lecturer
26-Aug	Introduction to Language Models and Inference	Graham
28-Aug	Probability Review and Shared Task Introduction	Graham
02-Sep	Common Sampling Methods for Modern NLP	Amanda
04-Sep	Beam Search and Variants HW1 Released	Amanda
09-Sep	Intro to A* and Best First Search	Graham
11-Sep	Other Controlled Generation Methods	Amanda
16-Sep	Chain of Thought and Intermediate Steps	Graham
18-Sep	Self-Refine and Self-Correction Methods	Graham
23-Sep	Reasoning Models	Graham
25-Sep	Incorporating Tools HW1 Due	Graham
30-Sep	Agents and Multi-Agent Communication HW2 Released	Graham
02-Oct	Reward Models and Best-of-N	Amanda
07-Oct	Systems not Models	Omar Khattab (Guest)
09-Oct	Minimum Bayes Risk and Multi-Sample Strategies	Amanda
14-Oct	No Class - Fall Break
16-Oct	No Class - Fall Break
21-Oct	Inference Scaling vs Model Size	Amanda
23-Oct	Token Budgets and Training-Time Distillation	Amanda
28-Oct	Diffusion Models HW2 Due HW3 Released	Graham
30-Oct	Defining Efficiency	Graham
04-Nov	No Class - Democracy Day
06-Nov	Inference and Hardware	Clara
11-Nov	Prefix Sharing and KV Cache Optimizations	Amanda
13-Nov	Draft Models and Speculative Decoding	Beidi Chen (Guest)
18-Nov	Linearizing Attention and Sparse Models	Amanda
20-Nov	Building MLC-LLM, a Universal LLM Deployment Engine	Tianqi Chen (Guest)
25-Nov	Library Implementation and Optimizations HW3 Due	Graham
27-Nov	No Class - Thanksgiving
01-Dec	Shared Task Poster Session	All
04-Dec	No Class Shared Task Final Submission Due

Introduction to Language Models and Inference (Aug 26)

Content:

What is a language model?
What is an inference algorithm?
What will we not cover?
What are transformers?
How do modern LMs work?
Modeling errors and search errors
Prompting as a means of model control
Instruction following behavior

Slides: HTML

PDF

Code: Code

Reading Material

Reference: Sections 1+2 from From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models (arXiv)

Probability Review (Aug 28)

Content:

Probability review
Transformer implementation
Generation and evaluation
Meta-generation

Code: Code

Reading Material

None

Common Sampling Methods for Modern NLP (Sep 2)

Content:

Common sampling methods for modern NLP
Diversity-quality tradeoffs

Slides: Google slides

Code: n/a

Reading Material

Reference: A Thorough Examination of Decoding Methods in the Era of LLMs
Reference: Trading Off Diversity and Quality in Natural Language Generation
Optional: Calibration of Pre-trained Transformers
Optional: Locally Typical Sampling
Optional: Forking Paths in Neural Text Generation
Optional: Truncation Sampling as Language Model Desmoothing
Optional: Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity
Optional: The Curious Case of Neural Text Degeneration
Optional: Calibrated Language Models Must Hallucinate
Optional: An Empirical Investigation of Global and Local Normalization for Recurrent Neural Sequence Models Using a Continuous Relaxation to Beam Search

Beam Search and Variants (Sep 4)

Content:

Beam search and variants
Inadequacies of the mode

Slides: Google slides

Code: TBA

Reading Material

Intro to A* and Best First Search (Sep 9)

Content:

Introduction to A* and best first search
A* methods for controlled generation

Slides: HTML

PDF

Reading Material

Reference: Efficient Lattice Rescoring Using Recurrent Neural Network Language Models (PDF)
Reference: Modeling Future Cost for Neural Machine Translation (arXiv)
Reference: Best-First Beam Search (arXiv)
Reference: NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics (arXiv)

Assignments

Other Controlled Generation Methods (Sep 11)

Content:

Other controlled generation methods
Decoding-time distributional modifiers

Slides: Google Slides

Code: TBA

Reading Material

Reference: Llama.cpp README on formal-grammar-based constraints
Reference: FUDGE: Controlled Text Generation With Future Discriminators
Reference: Controlled Decoding from Language Models
Optional: Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
Optional: Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning

Chain of Thought and Intermediate Steps (Sep 16)

Content:

Chain of thought / scratchpad, intermediate steps
Why does chain of thought work?
Self-consistency and variants

Slides: HTML

PDF

Reading Material

Core Papers:

Reference: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (arXiv)
Reference: Large Language Models are Zero-Shot Reasoners (arXiv)
Reference: Self-Consistency Improves Chain of Thought Reasoning in Language Models (arXiv)

Additional Research:

Reference: Adaptive Computation Time for Recurrent Neural Networks (arXiv)
Reference: PonderNet: Learning to Ponder (arXiv)
Reference: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (arXiv)
Reference: Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws (arXiv)
Reference: Adaptive-Consistency: A Cost-Efficient, Model-Agnostic Technique (arXiv)
Reference: To CoT or not to CoT? Chain-of-thought helps mainly on math and logic (arXiv)
Reference: Language Models Don’t Always Say What They Think (arXiv)
Reference: Complexity-Based Prompting for Multi-step Reasoning (arXiv)
Reference: Multimodal Chain-of-Thought Reasoning in Language Models (arXiv)

Paper Presentations

Paper: Least-to-Most Prompting Enables Complex Reasoning in Large Language Models (arXiv)
Paper: Chain-of-Thought Reasoning Without Prompting (arXiv)

Self-Refine and Self-Correction Methods (Sep 18)

Content:

Self-refine and iterative refinement with self-feedback
Learning to self debug for code generation
Reflexion: verbal reinforcement learning for agents
Limitations and challenges of self-correction
Tool-interactive critiquing and external feedback

Slides: HTML

PDF

Code: TBA

Reading Material

Primary: Self-Refine: Iterative Refinement with Self-Feedback (arXiv)
Primary: Learning to Self Debug (arXiv)
Reference: Reflexion: Language Agents with Verbal Reinforcement Learning (arXiv)
Reference: Large Language Models Cannot Self-Correct Reasoning Yet (arXiv)
Reference: CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing (arXiv)
Reference: SCoRe: Self-Correction via Reinforcement Learning (arXiv)

Student Paper Presentations

Student Presentation 1: Improving Reasoning in Language Models via Self-Correction (arXiv)
Student Presentation 2: Self-Correction in Language Models via Multi-Round Consistency Sampling (arXiv)

Reasoning Models (Sep 23)

Content:

What is a reasoning model?
Training reasoning models with reinforcement learning
STaR: Self-Taught Reasoner
DeepSeek R1 and GRPO
Understanding long chain-of-thought reasoning
Reasoning transfer across domains
Advanced reasoning algorithms (S1, L1, Stream of Search, LAPS)

Slides: HTML

PDF

Reading Material

Reference: STaR: Bootstrapping Reasoning With Reasoning (arXiv)
Reference: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (arXiv)
Reference: Demystifying Long Chain-of-Thought Reasoning: An Empirical Study (arXiv)
Reference: SimpleRL-Zoo: Evaluating Reinforcement Learning on Simple Reasoning Tasks (arXiv)
Reference: Does learning math help language models reason better? (arXiv)
Reference: S1: Simple Scaling Laws for Reasoning (arXiv)
Reference: L1: Scaling Test-Time Compute with Simple Sampling (arXiv)
Reference: Stream of Search (SoS): Learning to Search in Language (arXiv)
Reference: Learning Adaptive Parallel Search for Reasoning (arXiv)

Incorporating Tools (Sep 25)

Content:

What are tools? Definition and taxonomy
Basic tool use paradigm
Key approaches: PAL, Toolformer, Gorilla, WebGPT
Tool creation: TroVE and Large Language Models as Tool Makers
Tool robustness: Benchmarking failures in tool-augmented language models
Standardized function calling (JSON Schema)
Parallel function calling
Model Context Protocol (MCP) and MCP registries
FastMCP framework for rapid MCP development
Sandboxed code execution for secure tool use
Tool use scenarios and trade-offs
Evaluation challenges and best practices

Slides: HTML

PDF

Reading Material

Main Survey:

Wang et al., “What Are Tools Anyway? A Survey from the Language Model Perspective” (2024)

Key Papers:

Gao et al., “PAL: Program-aided Language Models” (2022)
Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools” (2023)
Patil et al., “Gorilla: Large Language Model Connected with Massive APIs” (2023)
Nakano et al., “WebGPT: Browser-assisted question-answering with human feedback” (2021)
Cai et al., “Large Language Models as Tool Makers” (2023)
Wang et al., “TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks” (2024)
Treviño et al., “Benchmarking Failures in Tool-Augmented Language Models” (2025)

Practical Resources:

Agents and Multi-Agent Communication (Sep 30)

Content:

Basic agent concepts and definitions
Agent architectures and environments
Efficiency optimizations (context management, caching)
Safety challenges and solutions
Multi-agent systems

Slides: HTML

PDF

Code: TBA

Reading Material

Basic Concepts and Foundations

Reference: ReAct: Synergizing Reasoning and Acting in Language Models (arXiv)
Reference: Executable Code Actions Elicit Better LLM Agents (arXiv)

Agent Architectures and Environments

Reference: WebArena: A Realistic Web Environment for Building Autonomous Agents (arXiv)
Reference: VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (arXiv)

Efficiency Optimizations

Reference: OpenHands Context Condensation for More Efficient AI Agents (All Hands AI)
Reference: Anthropic Prompt Caching (Anthropic)
Reference: Effectively use prompt caching on Amazon Bedrock (AWS)

Evaluation and Benchmarks

Reference: SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (arXiv)
Reference: GAIA: a benchmark for General AI Assistants (arXiv)
Reference: Training Software Engineering Agents and Verifiers with SWE-Gym (arXiv)

Multi-agent Systems

Reference: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (arXiv)

Reward Models and Best-of-N (Oct 2)

Content:

Reward models, best-of-n theory and practice
Monte Carlo Tree Search

Slides: Google slides

Code: n/a

Reading Material

Reference: Why reward models are key for alignment (blog)
Reference: Scaling Laws for Reward Model Overoptimization (arXiv)
Reference: Theoretical guarantees on the best-of-n alignment policy (arXiv)
Reference: RewardBench v2: Advancing Reward Model Evaluation (arXiv)
Assignments

Systems not Models (Oct 7)

Content:

Parallels to older “pipeline NLP”
Visualizing and evaluating systems
DSPy and system-level design

Slides: PDF

Reading Material (all optional)

NLP multi-step pipelines and agents:

Modular Approach to Error Analysis and Evaluation for Multilingual Question Answering (LREC’06)[https://aclanthology.org/L06-1489/]
Multi-hop Reading Comprehension through Question Decomposition and Rescoring (ACL’19)[https://aclanthology.org/P19-1613/]
Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (NeurIPS’21)[https://proceedings.neurips.cc/paper/2021/hash/e8b1cbd05f6e6a358a81dee52493dd06-Abstract.html]
STORM: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models (NAACL’24)[https://aclanthology.org/2024.naacl-long.347/]
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning (NeurIPS’25)[https://arxiv.org/abs/2503.19470] Self-Steering Language Models (COLM’25)[https://openreview.net/forum?id=XvCBtm5PgF]

Abstractions & Learning:

Structured Programming with go to Statements (1974)[https://dl.acm.org/doi/10.1145/356635.356640]
Neural Module Networks (CVPR’16)[https://openaccess.thecvf.com/content_cvpr_2016/html/Andreas_Neural_Module_Networks_CVPR_2016_paper.html]
The Bitter Lesson (2019)[http://www.incompleteideas.net/IncIdeas/BitterLesson.html]
Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP (2022)[https://arxiv.org/abs/2212.14024]
Prompting Is Programming: A Query Language for Large Language Models (PLDI’23)[https://dl.acm.org/doi/abs/10.1145/3591300]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines (ICLR’24)[https://openreview.net/forum?id=sY5N0zY5Od]
LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data (2024)[https://arxiv.org/abs/2407.11418]
Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together (EMNLP’24)[https://aclanthology.org/2024.emnlp-main.597/]
TextGrad: Automatic “Differentiation” via Text (Nature’25)[https://arxiv.org/abs/2406.07496]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning (2025)[https://arxiv.org/abs/2507.19457]

Minimum Bayes Risk and Multi-Sample Strategies (Oct 9)

Content:

Minimum Bayes Risk
Efficient MBR variants
Post-ensemble
Self-consistency and variants

Slides: Google slides

Code: TBA

Reading Material

Reference: It’s MBR All the Way Down: Modern Generation Techniques Through the Lens of Minimum Bayes Risk (arXiv)
Reference: Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation (arXiv)
Reference: High Quality Rather than High Model Probability: Minimum Bayes Risk Decoding with Neural Metrics (arXiv)
Reference: Faster Minimum Bayes Risk Decoding with Confidence-based Pruning (arXiv)
Reference: Sampling-Based Approximations to Minimum Bayes Risk Decoding for Neural Machine Translation (arXiv)
Reference: Efficient Minimum Bayes Risk Decoding using Low-Rank Matrix Completion Algorithms (arXiv)
Reference: Frustratingly Easy Model Ensemble for Abstractive Summarization (ACL Anthology)
Reference: Self-Consistency Improves Chain of Thought Reasoning in Language Models (arXiv)
Reference: Universal Self-Consistency for Large Language Model Generation (arXiv)

No Class - Fall Break (Oct 14)

No Class - Fall Break

No Class - Fall Break (Oct 16)

No Class - Fall Break

Inference Scaling vs Model Size (Oct 21)

Content:

Inference scaling versus scaling model size
Differences in cost and latency considerations
Modeling scaling behavior

Slides: Google Slides

Reading Material

Reference: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (arXiv)
Reference: Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (arXiv)
Reference: Parallel Scaling Law for Language Models (arXiv)
Reference: HARP: Hesitation-Aware Reframing in Transformer Inference Pass (ACL Anthology)
Reference: AI Agents That Matter (arXiv)
Reference: A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search (arXiv)
Reference: Scaling Inference-Efficient Language Models (arXiv)
Reference: Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers (arXiv)

Token Budgets and Training-Time Distillation (Oct 23)

Content:

Token budgets
Distillation
Training-time distillation of inference algorithms
More token-efficient reasoning

Slides: Google slides

Code: TBA

Reading Material

Reference: Direct Preference Optimization for Neural Machine Translation with Minimum Bayes Risk Decoding (arXiv)
Reference: MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods (arXiv)
Optional: Better Instruction-Following Through Minimum Bayes Risk (arXiv)
Optional: Knowledge Distillation Using Frontier Open-source LLMs: Generalizability and the Role of Synthetic Data (arXiv)
Optional: A Survey on Knowledge Distillation of Large Language Models (arXiv)
Optional: Efficient Reasoning Models: A Survey (arXiv)
Optional: The Impact of Reasoning Step Length on Large Language Models (arXiv)
Optional: Chain of Draft: Thinking Faster by Writing Less (arXiv)

Diffusion Models (Oct 28)

Content:

Introduction to diffusion models
Denoising diffusion probabilistic models (DDPM)
Score-based generative models
Diffusion models for text generation
Comparison with autoregressive models
Inference techniques for diffusion models
Applications in multimodal generation

Slides: HTML

PDF

Code: TBA

Reading Material

Reference: Generative Modeling by Estimating Gradients of the Data Distribution
Reference: Denoising Diffusion Probabilistic Models (Ho et al., NeurIPS 2020) (arXiv)
Reference: Flow Matching for Generative Modeling (Lipman et al., ICLR 2023) (arXiv)
Reference: Discrete Flow Matching (Gat et al., arXiv 2024) (arXiv)
Reference: Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (Lou et al., arXiv 2023) (arXiv)
Reference: Diffusion-LM Improves Controllable Text Generation (Li et al., NeurIPS 2022) (arXiv)
Reference: DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models (Gong et al., ICLR 2023) (arXiv)
Reference: Simple and Effective Masked Diffusion Language Models (Sahoo et al., ICML 2024) (arXiv)
Optional: Diffusion Models Beat GANs on Image Synthesis (Dhariwal & Nichol, 2021) (arXiv)
Optional: Diffusion Models: A Comprehensive Survey of Methods and Applications (Yang et al., 2022) (arXiv)

Defining Efficiency (Oct 30)

Content:

How do we define efficiency?
Different places where a method can be efficient (e.g. memory, latency, token cost for APIs)
Brief review of hardware for inference

Slides: HTML

PDF

Code: N/A

Reading Material

Systems

Yu et al. (2022): “Orca: A Distributed Serving System” - OSDI 2022
Kwon et al. (2023): “Efficient Memory Management for LLM Serving with PagedAttention” - SOSP 2023
Zhong et al. (2024): “DistServe: Disaggregating Prefill and Decoding” - OSDI 2024

Analysis

Kaplan et al. (2020): “Scaling Laws for Neural Language Models” - arXiv:2001.08361
Chowdhery et al. (2022): “PaLM: Scaling Language Modeling with Pathways” - arXiv:2204.02311
Williams et al. (2009): “Roofline: An Insightful Visual Performance Model” - Communications of the ACM
Ahia et al. (2023): “Do All Languages Cost the Same?” - arXiv:2305.13707
Casson, A. (2024): “Transformer FLOPs” - adamcasson.com

Optimization

Dao et al. (2022): “FlashAttention: Fast and Memory-Efficient Exact Attention” - NeurIPS 2022
Lin et al. (2023): “AWQ: Activation-aware Weight Quantization” - arXiv:2306.00978
Dettmers et al. (2022): “LLM.int8(): 8-bit Matrix Multiplication” - arXiv:2208.07339

Energy

Ren et al. (2024): “Carbon Footprint Analysis of LLMs” - Nature Scientific Reports
Mei et al. (2013): “GPU DVFS on Energy Conservation” - IGCC 2013
Patterson et al. (2021): “Carbon Emissions and Large Neural Network Training” - arXiv:2104.10350

Cost & Infrastructure

Yuan et al. (2024): “LLM Inference Unveiled: Survey and Roofline Model Insights” - arXiv:2410.12094
OpenAI (2025): “API Pricing”
Anthropic (2025): “Prompt Caching with Claude”
Together AI (2025): “Pricing”
Google AI (2025): “Gemini API Pricing”
AWS (2025): “Amazon EC2 Pricing”
Google Cloud (2025): “GPU Pricing”
Azure (2025): “Linux Virtual Machine Pricing”
Modal (2025): “Pricing” & “NVIDIA B200 Pricing”
SF Compute (2025): “GPU Pricing”
Prime Intellect (2025): “Prime Compute”
DataCrunch (Oct 2025): “Cloud GPU Pricing Comparison in 2025”
Fin AI (2024): “Cost of Serving LLMs”

No Class - Democracy Day (Nov 4)

No Class - Democracy Day

Inference and Hardware (Nov 6)

Content:

Overview of hardware relevant to LLM inference (GPUs, TPUs, accelerators)
Memory bandwidth, compute, and latency considerations
Parallelism strategies and deployment tradeoffs

Slides: TBA

Code: TBA

Reading Material

Content:

Prefix sharing
KV cache reuse
Key-value cache compression
Model compression
Brief quantization overview

Slides: TBA

Code: TBA

Reading Material

Reference: Keep the Cost Down: A Review on Methods to Optimize LLM’s KV-Cache Consumption (arXiv)
Reference: Model Compression and Efficient Inference for Large Language Models: A Survey (arXiv)

Draft Models and Speculative Decoding (Nov 13)

Content:

Draft models
Speculative decoding
Other latency improving methods

Slides: TBA

Code: TBA

Reading Material

Linearizing Attention and Sparse Models (Nov 18)

Content:

Linearizing attention
Sparse models

Slides: TBA

Code: TBA

Reading Material

Assignments

Building MLC-LLM, a Universal LLM Deployment Engine (Nov 20)

Content:

Building MLC-LLM
Universal LLM deployment engine

Slides: TBA

Code: TBA

Reading Material

Assignments

Library Implementation and Optimizations (Nov 25)

Content:

Library implementations
Lazy softmax
Flash attention
How do vLLM/SGLang/similar speed up generation?

Slides: TBA

Code: TBA

Reading Material

Reference: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (arXiv)
Reference: SELF-ATTENTION DOES NOT NEED O(n2) MEMORY (arXiv)

Assignments

No Class - Thanksgiving (Nov 27)

No Class - Thanksgiving

Shared Task Results and Poster Sessions (Dec 1)

Content:

Shared task results
Poster sessions

Slides: N/A

Code: N/A

Schedule

Introduction to Language Models and Inference (Aug 26)

Reading Material

Probability Review (Aug 28)

Reading Material

Common Sampling Methods for Modern NLP (Sep 2)

Reading Material

Beam Search and Variants (Sep 4)

Reading Material

Intro to A* and Best First Search (Sep 9)

Reading Material

Assignments

Other Controlled Generation Methods (Sep 11)

Reading Material

Chain of Thought and Intermediate Steps (Sep 16)

Reading Material

Paper Presentations

Self-Refine and Self-Correction Methods (Sep 18)

Reading Material

Student Paper Presentations

Reasoning Models (Sep 23)

Reading Material

Incorporating Tools (Sep 25)

Reading Material

Agents and Multi-Agent Communication (Sep 30)

Reading Material

Basic Concepts and Foundations

Agent Architectures and Environments

Efficiency Optimizations

Evaluation and Benchmarks

Multi-agent Systems

Reward Models and Best-of-N (Oct 2)

Reading Material

Assignments

Systems not Models (Oct 7)

Reading Material (all optional)

Minimum Bayes Risk and Multi-Sample Strategies (Oct 9)

Reading Material

No Class - Fall Break (Oct 14)

No Class - Fall Break

No Class - Fall Break (Oct 16)

No Class - Fall Break

Inference Scaling vs Model Size (Oct 21)

Reading Material

Token Budgets and Training-Time Distillation (Oct 23)

Reading Material

Diffusion Models (Oct 28)

Reading Material

Defining Efficiency (Oct 30)

Reading Material

Systems

Analysis

Optimization

Energy

Cost & Infrastructure

No Class - Democracy Day (Nov 4)

No Class - Democracy Day

Inference and Hardware (Nov 6)

Reading Material

Prefix Sharing and KV Cache Optimizations (Nov 11)

Reading Material

Draft Models and Speculative Decoding (Nov 13)

Reading Material

Linearizing Attention and Sparse Models (Nov 18)

Reading Material

Assignments

Building MLC-LLM, a Universal LLM Deployment Engine (Nov 20)

Reading Material

Assignments

Library Implementation and Optimizations (Nov 25)

Reading Material

Assignments

No Class - Thanksgiving (Nov 27)

No Class - Thanksgiving

Shared Task Results and Poster Sessions (Dec 1)

Assignments