A decoder-only language model combining Mamba SSM, Multi-Head Attention with RoPE, and Fine-Grained Mixture-of-Experts. One of only three published architectures unifying all three paradigms.
Linear-time selective state updates for efficient local sequence modelling. 12 layers with SwiGLU FFN.
16-head attention with Rotary Position Embeddings for global context. 4 layers in Jamba-style interleaving.
32 routed + 2 shared experts with top-2 routing and loss-free EMA load balancing. Zero dead experts.
| Model | Parameters | Attention | SSM | MoE | Compute Budget | Organisation |
|---|---|---|---|---|---|---|
| Jamba | 52B | โ | โ Mamba | โ 16 experts | Enterprise | AI21 Labs |
| Zamba | 7B | โ Shared | โ Mamba | โ | Institutional | Zyphra |
| Mixtral | 46.7B | โ | โ | โ 8 experts | Enterprise | Mistral AI |
| HybridMoE Titan | 450M | โ RoPE | โ Mamba | โ 32+2 experts | ~$50 | Independent ๐งโ๐ป |
Replaces dynamic torch.nonzero() with deterministic fixed-shape tensor operations, enabling full gradient checkpointing across all 16 layers. Reduced VRAM from >24 GB to 5.84 GB.
EMA-based bias updates on router logits maintain perfect expert utilisation (std < 0.032) without competing auxiliary loss terms in the training objective.
1:3 pattern โ every 4th layer pairs attention with MoE for global processing; intervening layers use Mamba for efficient local computation.
Complete pre-training on a single NVIDIA L4 GPU (24 GB) on AWS g6.2xlarge. Demonstrates that cutting-edge hybrid architectures don't require institutional compute.
This paper is pending arXiv submission for cs.LG (Machine Learning). As a first-time independent submitter, the author needs just 1 endorsement from an established arXiv author.
Contact Author on HuggingFace โ