Measuring Efficiency in LLM Inference

Graham Neubig

Today’s Goal

Learn to measure LLM inference efficiency across key dimensions

Computational efficiency (FLOPs)
Time efficiency (latency, throughput)
Memory metrics
Energy consumption
Economic costs

Five Dimensions of Efficiency

Computation
Time
Memory
Energy
Economics

No single model dominates all dimensions

Reference: Artificial Analysis Model Comparison

Part 1: Computational Efficiency

FLOPs (per Token)

Simple

Kaplan et al. (2020):

\[\text{FLOPs} \approx 2 \times L \times (4d^2 + 8nd^2)\]

Where:

$L$ = layers, $d$ = model dimension
$n$ = tokens

FLOPs Calculation Example

LLaMA-2 7B model (Touvron et al., 2023):

$N = 7 \times 10^9$ parameters
$L = 32$ layers
$d = 4096$ (hidden dimension)
Sequence length $T = 2048$

Using Kaplan et al. formula for forward pass:

\[\text{FLOPs} \approx 2 \times 32 \times (4 \times 4096^2 + 8 \times 2048 \times 4096^2)\]

\[= 2 \times 32 \times (67M + 275B) \approx 17.6 \text{ TFLOPs}\]

Per token (forward only): $\approx 8.6$ GFLOPs

Model FLOPs Utilization (MFU)

\[\text{MFU} = \frac{\text{Achieved FLOPs/s}}{\text{Peak Hardware FLOPs/s}}\]

Measured Examples:

System	MFU	Source
GPT-3	21%	Chowdhery et al. (2022)
PaLM 540B	46%	Chowdhery et al. (2022)
LLaMA 7B (naive)	19%	Estimated from A100 specs
LLaMA 7B (optimized)	45%	Dao et al. (2022)

When we care:

Measuring cost efficiency (esp. at training time)

Part 2: Time Metrics

Time to First Token (TTFT)

Latency from request to first token

What affects TTFT:

Compute efficiency
Batch size
KV cache management

When we care:

Interactive streaming applications like ChatGPT
User starts reading the response as it’s generated

Time Per Output Token (TPOT)

Average time between tokens after the first

What affects TPOT:

Memory bandwidth (decode phase)
Batch size
KV cache management

When we care:

Streaming applications (must be reasonably fast)
Non-streaming applications (TPOT * tokens = total time)

Throughput Metrics

Tokens per Second:

\[\text{Throughput} = \frac{\text{Total tokens generated}}{\text{Time elapsed}}\]

When we care:

High-volume batch processing on fully saturated hardware
Serving multiple concurrent users

Goodput

Maximum request rate where ≥90% meet SLA (Zhong et al., 2024)

Example measurements:

System A: 100 req/s, 50% meet SLA
System B: 60 req/s, 95% meet SLA

Goodput captures the latency-throughput trade-off

Part 3: Memory Metrics

Prefill vs Decode Characteristics

Prefill (Input Processing):

Arithmetic intensity: 50-200 FLOPs/byte
Compute-bound (Yuan et al., 2024)
Use MFU for measurement

Decode (Token Generation):

Arithmetic intensity: 1-10 FLOPs/byte
Memory-bound (Yuan et al., 2024)
Use MBU for measurement

Memory Bandwidth Utilization (MBU)

Example: LLaMA 7B Decode

Model size: 14GB (FP16)
Time per token: 14ms (measured)
Required bandwidth: 1 TB/s
A100 peak: 2 TB/s (NVIDIA specifications)
MBU = 50%

Each decode step loads entire model from memory

KV Cache Memory

\[\text{KV Cache} = 2 \times L \times h \times d_h \times n \times b \times \text{bytes}\]

Example: LLaMA-2 7B (Touvron et al. 2023)

Batch=8, seq_len=2048
KV cache size: 4.3 GB
On 24GB GPU: ~20K tokens max across all requests

Memory Efficiency for Serving

When serving streaming requests, memory can get fragmented.

Traditional allocation:

Memory waste: 40-70% (Kwon et al., 2023)

PagedAttention (vLLM):

Memory waste: <4% (Kwon et al., 2023)
Effective batch size: 2-4× larger

Quantization Impact

Can squeeze model size at the cost of some accuracy loss

Format	Bytes/Param	7B Model Size	Bandwidth Reduction
FP32	4	28 GB	1×
FP16	2	14 GB	2×
INT8	1	7 GB	4×
INT4	0.5	3.5 GB	8×

Part 4: Energy Metrics

Energy Consumption Units

Joules per token
kWh per million tokens
Watts (instantaneous power)

GPT-3 Measurements:

Power draw: ~400W (A100) (NVIDIA specifications)
At 10 tokens/sec: 40 joules/token
For 1M tokens: ~11 kWh

Measuring Energy

Hardware level:

nvidia-smi --query-gpu=power.draw --format=csv -l 1
# Output: 380.00 W

Software level:

start_energy = get_gpu_energy()  # joules
output = model.generate(input, max_new_tokens=100)
energy_used = get_gpu_energy() - start_energy
energy_per_token = energy_used / 100

Carbon Footprint Measurements

Operational emissions vary by location:

Iceland: 8-24g CO₂/kWh (ON Power, 2020; LCA studies)
Poland: 662-750g CO₂/kWh (Ember, 2023-2024)
~30-90× difference

Measured CO₂ per 500-word page (Ren et al., 2024):

Llama-3-70B: 15g
Llama-3-8B: 2.1g
Gemma-2B: 0.18g

Embodied carbon (from manufacturing): 17-45% of total (Microsoft, Amazon, Meta disclosures)

Part 5: Economic Metrics

Cost per Token Evolution (October 2025)

Model	Input (M)	Cached Input (M)	Output (M)	Provider
GPT-5	1.25	-	10.00	OpenAI
GPT-5 mini	0.25	-	2.00	OpenAI
GPT-5 nano	0.05	-	0.40	OpenAI
Claude Sonnet 4.5	3.00	0.30	15.00	Anthropic
Claude Haiku 4.5	1.00	0.10	5.00	Anthropic
Claude Opus 4.1	15.00	1.50	75.00	Anthropic
Gemini 2.5 Pro	1.25	0.125	10.00	Google
Gemini 2.5 Flash	0.30	0.03	2.50	Google
Gemini 2.5 Flash-Lite	0.10	0.01	0.40	Google
Llama 4 Maverick	0.27	-	0.85	Together AI
Llama 4 Scout	0.18	-	0.59	Together AI
DeepSeek-V3	0.60	-	1.70	Together AI

Token Efficiency by Language

Average tokens per 1,000 words:

Language	Tokens	Ratio to English
English	770	1.0×
Spanish	790	1.03×
German	920	1.19×
Russian	1,100	1.43×
Arabic	1,150	1.49×
Hindi	1,400	1.82×
Japanese	1,300	1.69×

Source: Ahia et al. (2023): “Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models”

Cost Impact of Tokenization

GPT-4 API for 100,000 words ($10/M input,$30/M output):

Language	Tokens	Total Cost	Extra Cost
English	77,000	$3.08	-
Spanish	79,000	$3.16	+3%
Russian	110,000	$4.40	+43%
Hindi	140,000	$5.60	+82%

GPU Costs (October 2025, dollar/GPU/hour)

GPU	Memory	AWS	GCP	Azure	Modal	SF Compute	Prime Intellect
L4	24 GB	-	-	-	0.80	-	-
A100 (40GB)	40 GB	2.74	4.05	-	2.10	-	-
A100 (80GB)	80 GB	-	6.25	3.67	2.50	-	0.79
H100 (80GB)	80 GB	6.88	11.06	6.98	3.95	1.26-1.55*	1.49
B200	180-192 GB	14.24	-	-	6.25	-	-

5-18× price variation across providers for same GPU
Specialized/spot providers (SF Compute, Prime Intellect) 75-90% cheaper than hyperscalers
Serverless competitive as well (vs 6.88-11.06 hyperscalers)
For high-traffic applications (>100M tokens/month), self-hosting can reduce costs by 50-80% vs. commercial APIs.

Cost Components

GPU / accelerator hardware depreciation & amortized cost
Energy (power + cooling + PUE overhead)
Storage (model weights, caches, logs backups)
Networking & data transfer (ingress, egress, inter-node links)
Memory / host CPU / DRAM overhead
Deployment infrastructure & overhead (rack space, ops staff, monitoring)
Idle / warm-up / inefficiencies (under-utilised hardware, cold start)

Measuring Efficiency in LLM Inference

Today’s Goal

Five Dimensions of Efficiency

Part 1: Computational Efficiency

FLOPs (per Token)

FLOPs Calculation Example

Model FLOPs Utilization (MFU)

Part 2: Time Metrics

Time to First Token (TTFT)

Time Per Output Token (TPOT)

Throughput Metrics

Goodput

Part 3: Memory Metrics

Prefill vs Decode Characteristics

Memory Bandwidth Utilization (MBU)

KV Cache Memory

Memory Efficiency for Serving

Quantization Impact

Part 4: Energy Metrics

Energy Consumption Units

Measuring Energy

Carbon Footprint Measurements

Part 5: Economic Metrics

Cost per Token Evolution (October 2025)

Token Efficiency by Language

Cost Impact of Tokenization

GPU Costs (October 2025, dollar/GPU/hour)

Cost Components

Questions + Discussion