Measuring Efficiency in LLM Inference
Graham Neubig
Today’s Goal
Learn to measure LLM inference efficiency across key dimensions
- Computational efficiency (FLOPs)
- Time efficiency (latency, throughput)
- Memory metrics
- Energy consumption
- Economic costs
Five Dimensions of Efficiency
- Computation
- Time
- Memory
- Energy
- Economics
No single model dominates all dimensions
Reference: Artificial Analysis Model Comparison
Part 1: Computational Efficiency
FLOPs (per Token)
Simple
Kaplan et al. (2020):
\[\text{FLOPs} \approx 2 \times L \times (4d^2 + 8nd^2)\]
Where:
- \(L\) = layers, \(d\) = model dimension
- \(n\) = tokens
FLOPs Calculation Example
LLaMA-2 7B model (Touvron et al., 2023):
- \(N = 7 \times 10^9\) parameters
- \(L = 32\) layers
- \(d = 4096\) (hidden dimension)
- Sequence length \(T = 2048\)
Using Kaplan et al. formula for forward pass:
\[\text{FLOPs} \approx 2 \times 32 \times (4 \times 4096^2 + 8 \times 2048 \times 4096^2)\]
\[= 2 \times 32 \times (67M + 275B) \approx 17.6 \text{ TFLOPs}\]
Per token (forward only): \(\approx 8.6\) GFLOPs
Model FLOPs Utilization (MFU)
\[\text{MFU} = \frac{\text{Achieved FLOPs/s}}{\text{Peak Hardware FLOPs/s}}\]
Measured Examples:
| System |
MFU |
Source |
| GPT-3 |
21% |
Chowdhery et al. (2022) |
| PaLM 540B |
46% |
Chowdhery et al. (2022) |
| LLaMA 7B (naive) |
19% |
Estimated from A100 specs |
| LLaMA 7B (optimized) |
45% |
Dao et al. (2022) |
When we care:
- Measuring cost efficiency (esp. at training time)
Time to First Token (TTFT)
Latency from request to first token
What affects TTFT:
- Compute efficiency
- Batch size
- KV cache management
When we care:
- Interactive streaming applications like ChatGPT
- User starts reading the response as it’s generated
Time Per Output Token (TPOT)
Average time between tokens after the first
What affects TPOT:
- Memory bandwidth (decode phase)
- Batch size
- KV cache management
When we care:
- Streaming applications (must be reasonably fast)
- Non-streaming applications (TPOT * tokens = total time)
Throughput Metrics
Tokens per Second:
\[\text{Throughput} = \frac{\text{Total tokens generated}}{\text{Time elapsed}}\]
When we care:
- High-volume batch processing on fully saturated hardware
- Serving multiple concurrent users
Goodput
Maximum request rate where ≥90% meet SLA (Zhong et al., 2024)
Example measurements:
- System A: 100 req/s, 50% meet SLA
- System B: 60 req/s, 95% meet SLA
Goodput captures the latency-throughput trade-off
Prefill vs Decode Characteristics
Prefill (Input Processing):
- Arithmetic intensity: 50-200 FLOPs/byte
- Compute-bound (Yuan et al., 2024)
- Use MFU for measurement
Decode (Token Generation):
- Arithmetic intensity: 1-10 FLOPs/byte
- Memory-bound (Yuan et al., 2024)
- Use MBU for measurement
Memory Bandwidth Utilization (MBU)
Example: LLaMA 7B Decode
- Model size: 14GB (FP16)
- Time per token: 14ms (measured)
- Required bandwidth: 1 TB/s
- A100 peak: 2 TB/s (NVIDIA specifications)
- MBU = 50%
Each decode step loads entire model from memory
KV Cache Memory
\[\text{KV Cache} = 2 \times L \times h \times d_h \times n \times b \times \text{bytes}\]
Example: LLaMA-2 7B (Touvron et al. 2023)
- Batch=8, seq_len=2048
- KV cache size: 4.3 GB
- On 24GB GPU: ~20K tokens max across all requests
Memory Efficiency for Serving
When serving streaming requests, memory can get fragmented.
Traditional allocation:
- Memory waste: 40-70% (Kwon et al., 2023)
PagedAttention (vLLM):
- Memory waste: <4% (Kwon et al., 2023)
- Effective batch size: 2-4× larger
Quantization Impact
Can squeeze model size at the cost of some accuracy loss
| Format |
Bytes/Param |
7B Model Size |
Bandwidth Reduction |
| FP32 |
4 |
28 GB |
1× |
| FP16 |
2 |
14 GB |
2× |
| INT8 |
1 |
7 GB |
4× |
| INT4 |
0.5 |
3.5 GB |
8× |
Energy Consumption Units
- Joules per token
- kWh per million tokens
- Watts (instantaneous power)
GPT-3 Measurements:
- Power draw: ~400W (A100) (NVIDIA specifications)
- At 10 tokens/sec: 40 joules/token
- For 1M tokens: ~11 kWh
Measuring Energy
Hardware level:
nvidia-smi --query-gpu=power.draw --format=csv -l 1
# Output: 380.00 W
Software level:
start_energy = get_gpu_energy() # joules
output = model.generate(input, max_new_tokens=100)
energy_used = get_gpu_energy() - start_energy
energy_per_token = energy_used / 100
Carbon Footprint Measurements
Operational emissions vary by location:
- Iceland: 8-24g CO₂/kWh (ON Power, 2020; LCA studies)
- Poland: 662-750g CO₂/kWh (Ember, 2023-2024)
- ~30-90× difference
Measured CO₂ per 500-word page (Ren et al., 2024):
- Llama-3-70B: 15g
- Llama-3-8B: 2.1g
- Gemma-2B: 0.18g
Embodied carbon (from manufacturing): 17-45% of total (Microsoft, Amazon, Meta disclosures)
Cost per Token Evolution (October 2025)
| Model |
Input (M) |
Cached Input (M) |
Output (M) |
Provider |
| GPT-5 |
1.25 |
- |
10.00 |
OpenAI |
| GPT-5 mini |
0.25 |
- |
2.00 |
OpenAI |
| GPT-5 nano |
0.05 |
- |
0.40 |
OpenAI |
| Claude Sonnet 4.5 |
3.00 |
0.30 |
15.00 |
Anthropic |
| Claude Haiku 4.5 |
1.00 |
0.10 |
5.00 |
Anthropic |
| Claude Opus 4.1 |
15.00 |
1.50 |
75.00 |
Anthropic |
| Gemini 2.5 Pro |
1.25 |
0.125 |
10.00 |
Google |
| Gemini 2.5 Flash |
0.30 |
0.03 |
2.50 |
Google |
| Gemini 2.5 Flash-Lite |
0.10 |
0.01 |
0.40 |
Google |
| Llama 4 Maverick |
0.27 |
- |
0.85 |
Together AI |
| Llama 4 Scout |
0.18 |
- |
0.59 |
Together AI |
| DeepSeek-V3 |
0.60 |
- |
1.70 |
Together AI |
Token Efficiency by Language
Average tokens per 1,000 words:
| Language |
Tokens |
Ratio to English |
| English |
770 |
1.0× |
| Spanish |
790 |
1.03× |
| German |
920 |
1.19× |
| Russian |
1,100 |
1.43× |
| Arabic |
1,150 |
1.49× |
| Hindi |
1,400 |
1.82× |
| Japanese |
1,300 |
1.69× |
Source: Ahia et al. (2023): “Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models”
Cost Impact of Tokenization
GPT-4 API for 100,000 words (\(10/M input,\)30/M output):
| Language |
Tokens |
Total Cost |
Extra Cost |
| English |
77,000 |
$3.08 |
- |
| Spanish |
79,000 |
$3.16 |
+3% |
| Russian |
110,000 |
$4.40 |
+43% |
| Hindi |
140,000 |
$5.60 |
+82% |
GPU Costs (October 2025, dollar/GPU/hour)
| GPU |
Memory |
AWS |
GCP |
Azure |
Modal |
SF Compute |
Prime Intellect |
| L4 |
24 GB |
- |
- |
- |
0.80 |
- |
- |
| A100 (40GB) |
40 GB |
2.74 |
4.05 |
- |
2.10 |
- |
- |
| A100 (80GB) |
80 GB |
- |
6.25 |
3.67 |
2.50 |
- |
0.79 |
| H100 (80GB) |
80 GB |
6.88 |
11.06 |
6.98 |
3.95 |
1.26-1.55* |
1.49 |
| B200 |
180-192 GB |
14.24 |
- |
- |
6.25 |
- |
- |
- 5-18× price variation across providers for same GPU
- Specialized/spot providers (SF Compute, Prime Intellect) 75-90% cheaper than hyperscalers
- Serverless competitive as well (vs 6.88-11.06 hyperscalers)
- For high-traffic applications (>100M tokens/month), self-hosting can reduce costs by 50-80% vs. commercial APIs.
Cost Components
- GPU / accelerator hardware depreciation & amortized cost
- Energy (power + cooling + PUE overhead)
- Storage (model weights, caches, logs backups)
- Networking & data transfer (ingress, egress, inter-node links)
- Memory / host CPU / DRAM overhead
- Deployment infrastructure & overhead (rack space, ops staff, monitoring)
- Idle / warm-up / inefficiencies (under-utilised hardware, cold start)