๐Ÿ”ฌ Independent ML Research โ€” Project Inkblot

HybridMoE Titan

A decoder-only language model combining Mamba SSM, Multi-Head Attention with RoPE, and Fine-Grained Mixture-of-Experts. One of only three published architectures unifying all three paradigms.

450M
Parameters
3
Paradigms
32+2
Experts
~$50
Total Compute

Three Paradigms, One Model

๐ŸŒ€

Mamba SSM

Linear-time selective state updates for efficient local sequence modelling. 12 layers with SwiGLU FFN.

๐Ÿ”ญ

Multi-Head Attention

16-head attention with Rotary Position Embeddings for global context. 4 layers in Jamba-style interleaving.

โšก

Mixture-of-Experts

32 routed + 2 shared experts with top-2 routing and loss-free EMA load balancing. Zero dead experts.

How It Compares

Model Parameters Attention SSM MoE Compute Budget Organisation
Jamba 52B โœ… โœ… Mamba โœ… 16 experts Enterprise AI21 Labs
Zamba 7B โœ… Shared โœ… Mamba โŒ Institutional Zyphra
Mixtral 46.7B โœ… โŒ โœ… 8 experts Enterprise Mistral AI
HybridMoE Titan 450M โœ… RoPE โœ… Mamba โœ… 32+2 experts ~$50 Independent ๐Ÿง‘โ€๐Ÿ’ป

Key Innovations

๐Ÿงฉ Fixed-Shape Dispatch

Replaces dynamic torch.nonzero() with deterministic fixed-shape tensor operations, enabling full gradient checkpointing across all 16 layers. Reduced VRAM from >24 GB to 5.84 GB.

โš–๏ธ Loss-Free Load Balancing

EMA-based bias updates on router logits maintain perfect expert utilisation (std < 0.032) without competing auxiliary loss terms in the training objective.

๐Ÿ”„ Jamba-Style Interleaving

1:3 pattern โ€” every 4th layer pairs attention with MoE for global processing; intervening layers use Mamba for efficient local computation.

๐Ÿ’ฐ Sub-$50 Training

Complete pre-training on a single NVIDIA L4 GPU (24 GB) on AWS g6.2xlarge. Demonstrates that cutting-edge hybrid architectures don't require institutional compute.

Project Timeline

Early 2026
Architecture Design
Designed the triple-hybrid architecture from scratch โ€” Mamba + Attention + MoE with Jamba-style interleaving.
Q1 2026
Titan v1 Training
Pre-trained 450M model on multilingual EN-PL corpus. Achieved PPL 27.5 at step 42,850 on single L4 GPU.
Q2 2026
Paper & Publication
Full research paper written. Published on HuggingFace. Zenodo DOI pending. arXiv submission awaiting endorsement.
Now
Titan v2 Training
Custom 64K tokeniser, cleaned dataset with domain mixing, MuonClip optimiser. Actively training.
Future
Specialisation
SFT on Polish biomedical instructions. Clinical reasoning, omics analysis, metabolic pathway interpretation.

Explore the Project

๐Ÿ™ Help Get This on arXiv

This paper is pending arXiv submission for cs.LG (Machine Learning). As a first-time independent submitter, the author needs just 1 endorsement from an established arXiv author.

Contact Author on HuggingFace โ†’