Learned two key patterns for how SGLang supports models: (1) Native support for architectures like GLM-5 by mapping them to DeepSeek-V2 implementations, and (2) Transformers backend fallback for unsupported models with performance-critical hot-path replacement.
Part 1: GLM-5 Native Support via DeepSeek-V2 Mapping
Both SGLang and vLLM support GLM-5 models by mapping the GlmMoeDsaForCausalLM architecture onto their existing DeepSeek-V2 family implementations, enabling native DeepSeek-style attention (MLA/DSA), MoE routing, and KV cache paths without requiring a Transformers backend.
Core Architecture Mapping
Both inference engines use the same strategy: map GLM-5 to DeepSeek-V2 code paths. The HF config for GLM-5 specifies:
architectures = ["GlmMoeDsaForCausalLM"]model_type = "glm_moe_dsa"
Both engines recognize this architecture and route it to their DeepSeek-V2 implementation:
class GlmMoeDsaForCausalLM(DeepseekV2ForCausalLM):
pass
This simple inheritance gives GLM-5 full access to DeepSeek’s optimized MLA/DSA attention, MoE routing, and KV cache management.
SGLang: Registration-Based Native Support
Registration Flow
- Model Registry Auto-Discovery: SGLang’s
ModelRegistry.register("sglang.srt.models")loads all model modules and theirEntryClassexports - GLM-5 Native Class:
sglang/python/sglang/srt/models/glm4_moe.pydefinesGlmMoeDsaForCausalLMand exports it viaEntryClass - Architecture Detection: Model loader checks if
config.architecturesis in the registry—if found, uses native class; otherwise falls back toTransformersForCausalLM
DSA/NSA Attention Path
# sglang/python/sglang/srt/configs/model_config.py
is_deepseek_nsa(hf_config) # Returns True for GlmMoeDsaForCausalLM
When is_deepseek_nsa() returns true, SGLang:
- Selects
nsaattention backend in server args - Adds GLM-5-specific handling for SM100 devices
- Routes through
DeepseekV2AttentionMLAwith DSA/NSA logic
Forward Path
DeepseekV2ForCausalLM.forward()
→ DeepseekV2Model.forward()
→ DecoderLayer loop:
- Input layernorm
- DeepseekV2AttentionMLA (DSA/NSA, RadixAttention)
- Post-attn layernorm
- MLP or MoE (based on n_routed_experts)
→ LogitsProcessor + lm_head
Files:
sglang/python/sglang/srt/models/registry.pysglang/python/sglang/srt/models/glm4_moe.pysglang/python/sglang/srt/models/deepseek_v2.pysglang/python/sglang/srt/model_loader/utils.py
vLLM: Registry Mapping Resolution
Registry Mapping
# vllm/vllm/model_executor/models/registry.py
_TEXT_GENERATION_MODELS = {
"GlmMoeDsaForCausalLM": ("deepseek_v2", "GlmMoeDsaForCausalLM"),
...
}
This mapping tells vLLM: when you see GlmMoeDsaForCausalLM in the config, load the deepseek_v2 module and use the GlmMoeDsaForCausalLM class defined there.
Config Conversion
# vllm/vllm/transformers_utils/model_arch_config_convertor.py
is_deepseek_mla(hf_config) # Returns True for model_type == "glm_moe_dsa"
When is_deepseek_mla() returns true, vLLM enables the correct head dimension logic for MLA attention.
Speculative Decoding Override
For speculative decoding, GLM-5 gets additional special handling:
# vllm/vllm/config/speculative.py
if hf_config.model_type in ("deepseek_v3", "deepseek_v32", "glm_moe_dsa"):
# Convert to deepseek_mtp with architectures=["DeepSeekMTPModel"]
Forward Path
DeepseekV2ForCausalLM.forward()
→ DeepseekV2Model.forward()
→ DecoderLayer loop:
- Input RMSNorm
- DeepseekAttention or DeepseekV2MLAAttention
- Post-attn RMSNorm
- MLP or MoE (based on n_routed_experts)
→ LogitsProcessor (last PP rank)
Files:
vllm/vllm/model_executor/models/registry.pyvllm/vllm/model_executor/models/deepseek_v2.pyvllm/vllm/transformers_utils/model_arch_config_convertor.pyvllm/vllm/config/speculative.py
Key Differences
| Aspect | SGLang | vLLM |
|---|---|---|
| Registration | EntryClass export in model module | _TEXT_GENERATION_MODELS dictionary mapping |
| Class location | glm4_moe.py (separate file) | deepseek_v2.py (inline definition) |
| Attention selection | is_deepseek_nsa() + server args | use_mla flag + cache_config |
| Special handling | SM100 device logic | Speculative decoding MTP mapping |
| Fallback condition | Arch not registered OR --model-impl transformers | model_impl=transformers OR arch not resolved |
Verification Checklist
SGLang
from transformers import AutoConfig
cfg = AutoConfig.from_pretrained("zai-org/GLM-5-FP8", trust_remote_code=True)
print("architectures:", cfg.architectures) # Should include GlmMoeDsaForCausalLM
print("model_type:", cfg.model_type) # Should be glm_moe_dsa
from sglang.srt.models.registry import ModelRegistry
print("GlmMoeDsaForCausalLM" in ModelRegistry.get_supported_archs()) # Should be True
vLLM
from vllm.model_executor.models import registry
print(registry._TEXT_GENERATION_MODELS.get("GlmMoeDsaForCausalLM"))
# Should output: ("deepseek_v2", "GlmMoeDsaForCausalLM")
Practical Implications
- No Transformers overhead: GLM-5 runs through native, optimized DeepSeek code paths
- Full MoE support: Inherits DeepSeek-V2’s expert routing and load balancing
- MLA/DSA attention: Uses compressed KV cache and sparse attention optimizations
- Tensor parallelism: Works with DeepSeek-V2’s distributed inference strategies
- When fallback occurs: Only if HF config changes architecture name OR explicit
model_impl=transformersflag
Architecture Flow Diagram
Part 2: Transformers Backend Fallback for Unsupported Models
When a model’s HF architectures is not registered in SGLang, SGLang falls back to TransformersForCausalLM, builds the model via Transformers, then replaces hot paths (attention and linear layers) with SGLang implementations for performance.
Design Pattern: Adapter/Facade + Hot-Path Swap
Core idea: Transformers builds the structure, SGLang swaps hot modules.
The decisive factors are:
- HF
architecturesresolves to a usable Transformers class - SGLang can wrap it with its backend
From --model to Model Init (Unsupported Model)
Example: --model acme-ai/AcmeLM-1 where HF config has architectures = ["AcmeLMForCausalLM"] (not in SGLang registry)
Step-by-step:
- CLI parse:
--model→ServerArgs.model_path - ModelConfig:
get_config()pulls HF config, determinesarchitectures - Model class selection: Not registered → fallback to
TransformersForCausalLM - Weights: Downloaded from HF and loaded
Files:
python/sglang/srt/server_args.pypython/sglang/srt/configs/model_config.pypython/sglang/srt/model_loader/utils.pypython/sglang/srt/models/registry.py
Three-Layer Replacement Strategy
1. Model Construction via Transformers
AutoModel.from_config(
config,
attn_implementation="sglang",
trust_remote_code=True,
)
AutoModel.from_configbuilds the model from HF configattn_implementation="sglang"redirects attention calls to SGLang
File: python/sglang/srt/models/transformers.py
2. Attention: Call-Site Routing (Not Class Replacement)
SGLang registers a custom attention function and routes HF attention through it into RadixAttention:
ALL_ATTENTION_FUNCTIONS["sglang"] = sglang_flash_attention_forward
This is entry-point swapping, not class rewriting.
File: python/sglang/srt/models/transformers.py
3. Linear Layers: In-Place Object Replacement
tensor_parallel() walks the model and replaces nn.Linear instances with:
ColumnParallelLinearRowParallelLinear
This requires base_model_tp_plan in config—no plan means no TP replacement.
File: python/sglang/srt/models/transformers.py
4. Embedding + LM Head: Parallel Versions
- Input embedding →
VocabParallelEmbedding - Output head →
ParallelLMHead - Weight tying supported
File: python/sglang/srt/models/transformers.py
Performance Impact by Module
| Module | Performance impact | SGLang handling (Transformers backend) | Notes |
|---|---|---|---|
| Attention | Dominant compute + KV traffic | Hook → RadixAttention | Entry-point swap, not class rewrite |
| Linear/MLP | Large matmuls, TP critical | Replace via base_model_tp_plan | No TP plan → no replacement |
| Embedding/LM head | Large vocab, comm heavy | Parallel embedding + head | Only I/O layers replaced |
| MoE routing | Gating/dispatch overhead | Not replaced | Remains HF logic |
| KV cache dtype | Memory/bandwidth tradeoff | Global config | Independent of model class |
| CUDA Graph | Reduce launch overhead | Capture/replay for decode | Can be disabled |
| Scheduling | Throughput/latency | Scheduler-level | Backend-agnostic |
Performance Bottleneck Analysis
| Area | Impact | Handling | Knobs |
|---|---|---|---|
| Prefill vs decode | Prefill compute, decode bandwidth | Scheduler separates | --chunked-prefill-size, --max-prefill-tokens |
| Batch sizing | Throughput vs latency | Scheduler policies | --schedule-policy, --max-running-requests |
| TP/DP comms | All-reduce cost | TP linears | --tp-size, --disable-custom-all-reduce |
| KV cache | Memory & bandwidth | Configured centrally | --kv-cache-dtype |
| CUDA Graph | Kernel launch overhead | ModelRunner capture | --disable-cuda-graph, --cuda-graph-bs |
| MoE experts | Routing + dispatch | Not replaced | Needs native model class |
| Quantization | Load + speed | Generic loader | --quantization, --load-format |
| Memory headroom | Limits batch | Memory management | --mem-fraction-static |
Runtime Forward Path (Condensed)
- Request → Tokenizer → Scheduler → ModelRunner
- HF model forward executes:
- Attention routed to
RadixAttention - MLP uses TP linears
- Attention routed to
- Output logits via
ParallelLMHead
What This Means for Unsupported Models
- No native class → No model-specific kernels
- Load depends on Transformers (local install or
--trust-remote-code) - Performance is generic: TP + RadixAttention, but no custom MoE or model-specific tuning
- MoE limitation: Transformers backend doesn’t replace MoE routing
Performance Optimization Advice
- Enable CUDA Graph for decode (don’t disable it)
- Choose TP size to avoid communication bottlenecks
- Tune KV cache dtype for concurrency/VRAM tradeoff
- Adjust prefill chunking to reduce tail latency
- For best MoE performance, add a native model class
Transformers Backend vs Native Support
| Aspect | Native Support (GLM-5) | Transformers Backend (Unsupported) |
|---|---|---|
| Model class | Registered in SGLang registry | Falls back to TransformersForCausalLM |
| Implementation | Direct DeepSeek-V2 inheritance | Wrapped via Transformers, then hot-path swap |
| Attention | Native DeepSeek MLA/DSA | RadixAttention via call-site routing |
| MoE routing | Native DeepSeek MoE | Not replaced (HF native logic) |
| TP support | Native DeepSeek TP | Via base_model_tp_plan if available |
| Custom kernels | Model-specific optimizations | Generic optimizations only |
Key Code Locations
- CLI args:
python/sglang/srt/server_args.py - Config load:
python/sglang/srt/configs/model_config.py - HF config utils:
python/sglang/srt/utils/hf_transformers_utils.py - Model selection:
python/sglang/srt/model_loader/utils.py - Transformers backend:
python/sglang/srt/models/transformers.py - Weight loader:
python/sglang/srt/model_loader/loader.py
Performance Hot Paths
Overall Key Takeaways
For GLM-5 (Native Support)
- GLM-5 ≈ DeepSeek-V2: Both engines treat GLM-5 as a DeepSeek-V2 family model through simple class inheritance
- Native over Transformers: Preference for native implementation with Transformers fallback only when necessary
- Architecture name matters: The
GlmMoeDsaForCausalLMstring in HF config unlocks native support - Full optimization pipeline: GLM-5 gets all DeepSeek optimizations—MLA attention, DSA sparsity, MoE routing, TP distribution
- Simple but powerful: A one-line class definition enables entire DeepSeek optimization stack
For Unsupported Models (Transformers Backend)
- Adapter/Facade pattern: TransformersForCausalLM wraps arbitrary HF models
- Hot-path swap: Attention (call routing), Linear (object replacement), Embedding (parallel)
- Performance is generic: Works for all models, but model-specific features (MoE) get HF treatment
- TP requires config: Linear layer replacement only works if
base_model_tp_planexists in config - When to add native class: For model-specific optimizations (especially MoE routing)