Measuring Efficiency in LLM Inference

Graham Neubig

Carnegie Mellon University Language Technologies Institute

Today’s Goal

Learn to measure LLM inference efficiency across key dimensions

  • Computational efficiency (FLOPs)
  • Time efficiency (latency, throughput)
  • Memory metrics
  • Energy consumption
  • Economic costs

Five Dimensions of Efficiency

  1. Computation
  2. Time
  3. Memory
  4. Energy
  5. Economics

No single model dominates all dimensions

Reference: Artificial Analysis Model Comparison

Part 1: Computational Efficiency

FLOPs (per Token)

Simple

Kaplan et al. (2020):

\[\text{FLOPs} \approx 2 \times L \times (4d^2 + 8nd^2)\]

Where:

  • \(L\) = layers, \(d\) = model dimension
  • \(n\) = tokens

FLOPs Calculation Example

LLaMA-2 7B model (Touvron et al., 2023):

  • \(N = 7 \times 10^9\) parameters
  • \(L = 32\) layers
  • \(d = 4096\) (hidden dimension)
  • Sequence length \(T = 2048\)

Using Kaplan et al. formula for forward pass:

\[\text{FLOPs} \approx 2 \times 32 \times (4 \times 4096^2 + 8 \times 2048 \times 4096^2)\]
\[= 2 \times 32 \times (67M + 275B) \approx 17.6 \text{ TFLOPs}\]

Per token (forward only): \(\approx 8.6\) GFLOPs

Model FLOPs Utilization (MFU)

\[\text{MFU} = \frac{\text{Achieved FLOPs/s}}{\text{Peak Hardware FLOPs/s}}\]

Measured Examples:

System MFU Source
GPT-3 21% Chowdhery et al. (2022)
PaLM 540B 46% Chowdhery et al. (2022)
LLaMA 7B (naive) 19% Estimated from A100 specs
LLaMA 7B (optimized) 45% Dao et al. (2022)

When we care:

  • Measuring cost efficiency (esp. at training time)

Part 2: Time Metrics

Time to First Token (TTFT)

Latency from request to first token

What affects TTFT:

  • Compute efficiency
  • Batch size
  • KV cache management

When we care:

  • Interactive streaming applications like ChatGPT
  • User starts reading the response as it’s generated

Time Per Output Token (TPOT)

Average time between tokens after the first

What affects TPOT:

  • Memory bandwidth (decode phase)
  • Batch size
  • KV cache management

When we care:

  • Streaming applications (must be reasonably fast)
  • Non-streaming applications (TPOT * tokens = total time)

Throughput Metrics

Tokens per Second:

\[\text{Throughput} = \frac{\text{Total tokens generated}}{\text{Time elapsed}}\]

When we care:

  • High-volume batch processing on fully saturated hardware
  • Serving multiple concurrent users

Goodput

Maximum request rate where ≥90% meet SLA (Zhong et al., 2024)

Example measurements:

  • System A: 100 req/s, 50% meet SLA
  • System B: 60 req/s, 95% meet SLA

Goodput captures the latency-throughput trade-off

Part 3: Memory Metrics

Prefill vs Decode Characteristics

Prefill (Input Processing):

  • Arithmetic intensity: 50-200 FLOPs/byte
  • Compute-bound (Yuan et al., 2024)
  • Use MFU for measurement

Decode (Token Generation):

  • Arithmetic intensity: 1-10 FLOPs/byte
  • Memory-bound (Yuan et al., 2024)
  • Use MBU for measurement

Memory Bandwidth Utilization (MBU)

Example: LLaMA 7B Decode

  • Model size: 14GB (FP16)
  • Time per token: 14ms (measured)
  • Required bandwidth: 1 TB/s
  • A100 peak: 2 TB/s (NVIDIA specifications)
  • MBU = 50%

Each decode step loads entire model from memory

KV Cache Memory

\[\text{KV Cache} = 2 \times L \times h \times d_h \times n \times b \times \text{bytes}\]

Example: LLaMA-2 7B (Touvron et al. 2023)

  • Batch=8, seq_len=2048
  • KV cache size: 4.3 GB
  • On 24GB GPU: ~20K tokens max across all requests

Memory Efficiency for Serving

When serving streaming requests, memory can get fragmented.

Traditional allocation:

  • Memory waste: 40-70% (Kwon et al., 2023)

PagedAttention (vLLM):

  • Memory waste: <4% (Kwon et al., 2023)
  • Effective batch size: 2-4× larger

Quantization Impact

Can squeeze model size at the cost of some accuracy loss

Format Bytes/Param 7B Model Size Bandwidth Reduction
FP32 4 28 GB
FP16 2 14 GB
INT8 1 7 GB
INT4 0.5 3.5 GB

Part 4: Energy Metrics

Energy Consumption Units

  • Joules per token
  • kWh per million tokens
  • Watts (instantaneous power)

GPT-3 Measurements:

  • Power draw: ~400W (A100) (NVIDIA specifications)
  • At 10 tokens/sec: 40 joules/token
  • For 1M tokens: ~11 kWh

Measuring Energy

Hardware level:

nvidia-smi --query-gpu=power.draw --format=csv -l 1
# Output: 380.00 W

Software level:

start_energy = get_gpu_energy()  # joules
output = model.generate(input, max_new_tokens=100)
energy_used = get_gpu_energy() - start_energy
energy_per_token = energy_used / 100

Carbon Footprint Measurements

Operational emissions vary by location:

  • Iceland: 8-24g CO₂/kWh (ON Power, 2020; LCA studies)
  • Poland: 662-750g CO₂/kWh (Ember, 2023-2024)
  • ~30-90× difference

Measured CO₂ per 500-word page (Ren et al., 2024):

  • Llama-3-70B: 15g
  • Llama-3-8B: 2.1g
  • Gemma-2B: 0.18g

Embodied carbon (from manufacturing): 17-45% of total (Microsoft, Amazon, Meta disclosures)

Part 5: Economic Metrics

Cost per Token Evolution (October 2025)

Model Input (M) Cached Input (M) Output (M) Provider
GPT-5 1.25 - 10.00 OpenAI
GPT-5 mini 0.25 - 2.00 OpenAI
GPT-5 nano 0.05 - 0.40 OpenAI
Claude Sonnet 4.5 3.00 0.30 15.00 Anthropic
Claude Haiku 4.5 1.00 0.10 5.00 Anthropic
Claude Opus 4.1 15.00 1.50 75.00 Anthropic
Gemini 2.5 Pro 1.25 0.125 10.00 Google
Gemini 2.5 Flash 0.30 0.03 2.50 Google
Gemini 2.5 Flash-Lite 0.10 0.01 0.40 Google
Llama 4 Maverick 0.27 - 0.85 Together AI
Llama 4 Scout 0.18 - 0.59 Together AI
DeepSeek-V3 0.60 - 1.70 Together AI

Token Efficiency by Language

Average tokens per 1,000 words:

Language Tokens Ratio to English
English 770 1.0×
Spanish 790 1.03×
German 920 1.19×
Russian 1,100 1.43×
Arabic 1,150 1.49×
Hindi 1,400 1.82×
Japanese 1,300 1.69×

Source: Ahia et al. (2023): “Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models”

Cost Impact of Tokenization

GPT-4 API for 100,000 words (\(10/M input,\)30/M output):

Language Tokens Total Cost Extra Cost
English 77,000 $3.08 -
Spanish 79,000 $3.16 +3%
Russian 110,000 $4.40 +43%
Hindi 140,000 $5.60 +82%

GPU Costs (October 2025, dollar/GPU/hour)

GPU Memory AWS GCP Azure Modal SF Compute Prime Intellect
L4 24 GB - - - 0.80 - -
A100 (40GB) 40 GB 2.74 4.05 - 2.10 - -
A100 (80GB) 80 GB - 6.25 3.67 2.50 - 0.79
H100 (80GB) 80 GB 6.88 11.06 6.98 3.95 1.26-1.55* 1.49
B200 180-192 GB 14.24 - - 6.25 - -
  • 5-18× price variation across providers for same GPU
  • Specialized/spot providers (SF Compute, Prime Intellect) 75-90% cheaper than hyperscalers
  • Serverless competitive as well (vs 6.88-11.06 hyperscalers)
  • For high-traffic applications (>100M tokens/month), self-hosting can reduce costs by 50-80% vs. commercial APIs.

Cost Components

  • GPU / accelerator hardware depreciation & amortized cost
  • Energy (power + cooling + PUE overhead)
  • Storage (model weights, caches, logs backups)
  • Networking & data transfer (ingress, egress, inter-node links)
  • Memory / host CPU / DRAM overhead
  • Deployment infrastructure & overhead (rack space, ops staff, monitoring)
  • Idle / warm-up / inefficiencies (under-utilised hardware, cold start)

Questions + Discussion