While building the nano-train project, I created a dump_model_info() function that started as a simple model architecture dumper but gradually evolved into a comprehensive performance analysis tool capable of identifying roofline bottlenecks and providing actionable optimization guidance.
Initial Motivation: Simple Model Architecture Dumping
The function began with a modest goal: just dump the model architecture. I wanted to understand:
- Model structure and layers
- Parameter counts per module
- Basic configuration details
- Memory footprint estimates
This was purely for documentation and understanding the models I was working with.
Evolution: Discovering AI Could Do More
As I worked with the function, I realized that once I had access to the model’s computational graph and architecture, I could extract much more valuable information:
What Became Possible
-
FLOPs Calculation: Count theoretical floating-point operations for each phase
- Training: Full forward + backward passes
- Prefill: Initial prompt processing
- Decode: Auto-regressive token generation
-
Byte Traffic Analysis: Model memory access patterns
- Weight streaming (bytes read from HBM for parameters)
- Activations (input/output tensors for each layer)
- KV cache (read/write for attention mechanisms)
- Temporary buffers (intermediate computations)
-
Roofline Analysis: Determine performance bottlenecks
- Arithmetic intensity: FLOPs per byte of memory traffic
- Ridge point: Where compute-bound becomes memory-bound
- Regime classification: compute-bound vs HBM-bound vs network-bound
-
Time Modeling: Predict step times under different scenarios
- Compute time: Based on FLOPs and peak compute throughput
- HBM time: Based on bytes and memory bandwidth
- Network time: For distributed training/inference
- Bottleneck identification: Which resource limits performance
-
Sensitivity Analysis: Understand what knobs actually matter
- Batch size effects on throughput
- Sequence length impacts on prefill/decode
- KV cache dtype trade-offs
- MoE routing parameters
Comprehensive Report Structure
The final dump_model_info() generates detailed markdown reports like deepseek_model_report.md with:
1. Architecture Overview
- Model fingerprint (family, attention type, MoE configuration)
- Parameter distribution by module
- Memory requirements breakdown
- Architecture diagrams (Mermaid)
2. Analytical Model
- FLOPs decomposition: Per-module operation counts
- Byte accounting: Weights, activations, KV, temporaries
- Execution assumptions: Naive vs efficient kernels (WRF, fusion factors)
- Time estimation:
T_est = max(T_comp, T_hbm, T_net)
3. Roofline Analysis
- Regime classification: Compute-bound vs memory-bound for each phase
- Arithmetic intensity:
AI_hbm = FLOPs / bytes_hbm - Chip ceilings: Plot points against hardware roofline curves
- Batch sweeps: How decode behavior changes with batch size
- Sequence sweeps: How prefill behavior changes with prompt length
4. Sensitivity Analysis
- Knob ranking: Which parameters most affect performance
- Combinatorial grid: Full factorial sweep of configuration space
- Regime transitions: When bottlenecks shift from compute to memory
5. Optimization Guidance
- Next prioritizations: What to optimize based on bottleneck analysis
- Byte dominance tests: Which memory terms matter most
- Critical batch sizes: Where compute-bound transitions occur
Key Technical Insights
1. Separating Theory from Realizable Performance
Critical distinction: F_theory (algorithmic FLOPs) vs F_realizable (peak-equivalent compute cost after utilization model)
# Theory: pure mathematical operations
F_theory = 2 * M * K * N # GEMM FLOPs
# Realizable: accounts for tensor core utilization
eta_tc = min(1.0, M_eff / B_sat) # Utilization factor
F_tensorcore = ... # Tensor-core-eligible FLOPs
F_realizable = F_tensorcore / eta_tc + (F_theory - F_tensorcore) / eta_scalar
Why it matters: Tiny-batch decode and thin GEMMs can leave tensor cores under-saturated, so F_theory overestimates achievable performance.
2. Weight Residency Factor (WRF) Model
Effective streamed weight bytes depend on how many times weights are reused:
W_eff = W / WRF # Effective streamed bytes
Different module families have different reuse patterns:
- Attention:
WRF_attn = 4.0(efficient mode) - weights reused across sequence - Dense:
WRF_dense = 4.0- reused across batch - MoE:
WRF_moe = 2.0- less reuse due to routing
Impact: Weight traffic can dominate memory bandwidth, especially for large models.
3. KV Cache Dominance in Long Context
Decode arithmetic intensity declines with KV length:
# As L → ∞, KV reads dominate
OI_inf_hbm(L) ~ 1/L # Asymptotic intensity
Implication: Long-context serving can never become compute-bound if KV cache bytes dominate memory traffic.
4. Experts ≠ Runtime Cost
Common fallacy: Parameter share equals runtime cost share.
Reality from analysis:
- Experts hold ~97.8% of parameters (DeepSeek-V3)
- But attention/activation terms can dominate runtime
- Weight streaming depends on WRF, not parameter count
- KV cache traffic scales with sequence length, not parameters
5. Naive vs Efficient Execution Models
| Aspect | Naive | Efficient |
|---|---|---|
| Attention | Materializes S×S score/prob matrices | Flash attention (no S×S HBM traffic) |
| Activation fusion | No fusion | Fused kernels reduce HBM trips |
| Elementwise ops | Full temporaries | Reduced temporary buffers |
| WRF | 1.0 (no reuse) | 2-4× (weight reuse) |
Result: Efficient mode can shift regimes from HBM-bound to compute-bound.
Implementation Patterns
Modular FLOP Computation
def _compute_attention_flops_prefill_mla(
B, S, H, h, r_q, r_kv, d_nope, d_rope, d_v
):
"""MLA-specific FLOP calculation for prefill phase"""
# Q projection: B*S*H*r_q
# KV projection: 2*B*S*H*r_kv
# Attention: 8*B*S*H^2 + 4*B*S^2*H + ...
return total_flops
Byte Decomposition
bytes_hbm = (
bytes_weights + # Streaming parameter reads
bytes_activations + # Input/output tensors
bytes_kv + # KV cache read/write
bytes_temporary # Intermediate buffers
)
Time Model
T_comp = F_realizable / P_peak
T_hbm = bytes_hbm / BW_hbm
T_net = bytes_net / BW_net
T_est = max(T_comp, T_hbm, T_net)
regime = argmax(T_comp, T_hbm, T_net)
Roofline Workflow
Step 1: Classify Limiting Regime
- Compare
AI_hbmtoOI_knee = P_peak / BW_hbm - Inspect
T_comp,T_hbm,T_netto find bottleneck
Step 2: Locate Dominant Byte Term
- Calculate
share(weights),share(kv),share(temporary) - Identify which memory traffic dominates
Step 3: Map to Optimization Family
- Compute-bound: Improve utilization, fusion
- HBM(weight)-bound: Increase residency, compression
- HBM(KV)-bound: KV format/dtype/layout optimization
- Network-bound: Topology, compression, overlap
Practical Examples
DeepSeek-V3 Analysis
Training:
- Efficient mode:
AI_hbm = 652.03>OI_knee = 412.29→ compute-bound - Naive mode:
AI_hbm = 319.1<OI_knee→ HBM-bound - Conclusion: Efficient kernels (flash attention, fusion) are critical
Prefill:
- Efficient mode:
AI_hbm = 434.68>OI_knee→ compute-bound - Similar to training: optimization focus on compute path
Decode:
- Efficient mode:
AI_hbm = 7.70<<OI_knee→ HBM-bound - Remains memory-bound across all sampled batches
- Conclusion: Batching policy and KV management are key
Sensitivity Insights
For decode, ranked by effect size on T_est:
- top-k experts: 189.4% impact
- hidden scale: 153.4% impact
- KV length (L): 9.4% impact
- KV dtype bytes: 3.3% impact
Actionable insight: MoE routing parameters matter more than KV cache dtype for this model.
Key Takeaways
Tool Evolution Lessons
- Start simple, iterate fast: Basic architecture dump → comprehensive analysis
- Follow the data: Once you have FLOPs/bytes, analysis naturally extends
- Static analysis is powerful: No runtime profiling needed for bottleneck identification
- Automation scales: Manual analysis for one model → automated for any model
Technical Insights
- Roofline is actionable: Tells you exactly what to optimize
- Bottlenecks differ by phase: Training/prefill often compute-bound, decode often memory-bound
- Bytes matter as much as FLOPs: Memory traffic can limit performance even with fast compute
- Model architecture determines regime: MLA attention, MoE routing, tensor parallelism all affect bottlenecks
Design Patterns
- Modular computation: Separate FLOP/byte calculators per module type
- Configurable assumptions: Pluggable execution models (naive/efficient)
- Sweep-based analysis: Grid search over configuration space
- Visualization for insight: Roofline plots, regime tables, sensitivity charts
Future Enhancements
- Measured performance: Integrate actual profiler data
- Cost modeling: Dollar estimates for cloud deployments
- Power modeling: Energy consumption predictions
- Autotuning: Automatic optimization parameter search
- Multi-chip analysis: TP/DP/PP scaling behavior
Resources
- Example report: deepseek_model_report.md
- Implementation: nano-train