🔬 Independent ML Research — Project Inkblot

HybridMoE Titan

A decoder-only language model combining Mamba SSM, Multi-Head Attention with RoPE, and Fine-Grained Mixture-of-Experts. One of only three published architectures unifying all three paradigms.

450M

Parameters

Paradigms

32+2

Experts

~$50

Total Compute

Three Paradigms, One Model

🌀

Mamba SSM

Linear-time selective state updates for efficient local sequence modelling. 12 layers with SwiGLU FFN.

🔭

Multi-Head Attention

16-head attention with Rotary Position Embeddings for global context. 4 layers in Jamba-style interleaving.

⚡

Mixture-of-Experts

32 routed + 2 shared experts with top-2 routing and loss-free EMA load balancing. Zero dead experts.

How It Compares

Model	Parameters	Attention	SSM	MoE	Compute Budget	Organisation
Jamba	52B	✅	✅ Mamba	✅ 16 experts	Enterprise	AI21 Labs
Zamba	7B	✅ Shared	✅ Mamba	❌	Institutional	Zyphra
Mixtral	46.7B	✅	❌	✅ 8 experts	Enterprise	Mistral AI
HybridMoE Titan	450M	✅ RoPE	✅ Mamba	✅ 32+2 experts	~$50	Independent 🧑‍💻

Key Innovations

🧩 Fixed-Shape Dispatch

Replaces dynamic torch.nonzero() with deterministic fixed-shape tensor operations, enabling full gradient checkpointing across all 16 layers. Reduced VRAM from >24 GB to 5.84 GB.

⚖️ Loss-Free Load Balancing

EMA-based bias updates on router logits maintain perfect expert utilisation (std < 0.032) without competing auxiliary loss terms in the training objective.

🔄 Jamba-Style Interleaving

1:3 pattern — every 4th layer pairs attention with MoE for global processing; intervening layers use Mamba for efficient local computation.

💰 Sub-$50 Training

Complete pre-training on a single NVIDIA L4 GPU (24 GB) on AWS g6.2xlarge. Demonstrates that cutting-edge hybrid architectures don't require institutional compute.

Project Timeline

Early 2026

Architecture Design

Designed the triple-hybrid architecture from scratch — Mamba + Attention + MoE with Jamba-style interleaving.

Q1 2026

Titan v1 Training

Pre-trained 450M model on multilingual EN-PL corpus. Achieved PPL 27.5 at step 42,850 on single L4 GPU.

Q2 2026

Paper & Publication

Full research paper written. Published on HuggingFace. Zenodo DOI pending. arXiv submission awaiting endorsement.

Now

Titan v2 Training

Custom 64K tokeniser, cleaned dataset with domain mixing, MuonClip optimiser. Actively training.

Future

Specialisation

SFT on Polish biomedical instructions. Clinical reasoning, omics analysis, metabolic pathway interpretation.