LLM Pre-training, Fine-Tuning and Reinforcement Learning at practical hardware limits
(C++/CUDA core, Python wrapper, BF16, FP8, NF4, NVFP4)
What is Surogate?
Surogate is an extremely fast production-grade LLM training framework engineered to operate at practical hardware limits, delivering near–speed-of-light throughput, low-latency execution, and predictable multi-GPU/multi-Node scaling at scale.
By combining a native C++/CUDA execution engine, a low-overhead Python DSL, an AOT-based Auto Differentiantion engine and a highly optimized multi-threaded scheduler, Surogate achieves industry-leading Speed-Of-Light (SOL) utilization on NVIDIA GPUs — outperforming existing training toolkits by a wide margin.
✨ Highlights
Surogate is built for developers and enterprises that need fast experimentation scalability and predictable outcomes — whether running on-premise, in private clouds, or inside turnkey systems such as the DenseMAX Appliance.
- 🔧 Pre-training + Fine-tuning: full fine-tuning, LoRA/QLoRA
- 🔧 BF16, FP8 and NVFP4 Reinforcement Learning: advanced GRPO training and evaluation with custom, deterministic environments
- 🔧 RL Environments: predictable environments for RL training
- 🖥️...🖥️ Native multi-GPU training with the multi-threaded backend
- 🖥️...🖥️ Native multi-Node DDP training with Ray
- ⚡ Native C++/CUDA engine for near–Speed-Of-Light (SOL) throughput
- 🔥 Python DSL with AOT auto-differentiation for adding new model architectures
- ⚖️ Smart CPU Offloading for weights, gradients, activations, quants
- 📜 Pre-built training recipes:
- 💎 BF16: Baseline recipe using
bfloat16for all GEMMs, designed for maximum numerical accuracy. No quantization is applied. - 🔥 FP8: Native
FP8training delivering extreme performance withE4M3used for activations and weights andE5M2for gradients. Uses per-tensor delayed scaling to provide stable training. - 🔥 NVFP4: Native CUTLASS
FP4 E2M1training with two-level block scaling for extreme performance and memory efficiency on Blackwell GPUs (SM100+: B200, B300, RTX 50xx series). Uses stochastic rounding and random Hadamard Transforms for numerical stability. Supports NVIDIA B200, B300, RTX 5070, 5080, 5090 !!
- 💎 BF16: Baseline recipe using
- ⚡ BnB/FP8/NVFP4 QLoRA Support for a variety of QLoRA configurations, including online quantization (FP8, NVFP4, BnB) or loading pre-quantized weights (FP8, NVFP4)
- 👌 Optimizers: AdamW 8bit, !! NorMuon !!
- 🖥️ Runs on all NVIDIA GPUs: sm80, sm86, sm89, sm90, sm100, sm103, sm120, sm121
- 🧪 Mixed-precision training: Mix different dtypes for GEMMs, model, gradients and LoRA recipes to create your own flavor.
- 🧬 Adaptive Training: built-in automated training monitoring with automatic phase detection, multi-criteria early stopping (convergence, compute-efficiency, divergence, plateau), auto LR management, MoE imbalance detection, Chinchilla token budgeting and dynamic epoch adjustment
- 🎨 Dedicated MoE Features: Expert Parallelism, Least-Loaded EP load-balancing, MoE training metrics, Imbalance detection
- 🥞 Stacked LoRA training: Train a LoRA adapter on top of another LoRA adapter to skip offline merging into base model.
- 🛡️ Designed for reliability: deterministic configs, explicit recipes, and a clear C++ core
- 🧠 Supported models: Qwen2.5, Qwen3, Qwen3 MoE, Llama 3+, Nemotron Nano. Models can be added easily, please create a PR if you need a specific model.
Quickstart
Option A: Run using Docker (recommended)
Surogate provides 3 docker images for various CUDA versions. Currently only the x86-64 architecture is supported.
| CUDA | Image | Recommended NVIDIA Driver | Minimum NVIDIA Driver |
|---|---|---|---|
| 12.8.1 | ghcr.io/invergent-ai/surogate:latest-cu128 | >= 570.124.06 | >= 525 |
| 12.9.1 | ghcr.io/invergent-ai/surogate:latest-cu129 | >= 575.57.08 | >= 525 |
| 13.1 | ghcr.io/invergent-ai/surogate:latest-cu13 | >= 590.48.01 | >= 580 |
docker run --gpus=all -v /my/local/config.yaml:/home/surogate/config.yaml -v /my/local/output_dir:<OUTPUT_DIR_FROM_CONFIG_YAML> <IMAGE> sft config.yaml
Option B: Install via script
curl -LsSf https://surogate.ai/install.sh | sh
Follow these guides to run your first training:
- Installation
- Training modes: Pretraining vs Full Fine-Tuning vs LoRA
- Quickstart: SFT
- Quickstart: Pre-training
- Quickstart: GRPO
Hardware / Requirements
- NVIDIA GPU + recent driver
- CUDA 12.8, 12.9, 13, NCCL, cuDNN
- Linux x86_64
Supported NVIDIA GPUs:
SM80: A100, A30SM86: A2, A16, A10, A40, RTX3050, RTX3060, RTX 3070, RTX 3080, RTX 3090, A2000, A3000, A4000, A5000, A6000SM89: L4, L40, L40S, RTX 4050, RTX 4060, RTX 4070, RTX 4080, RTX 4090, RTX 2000 Ada, RTX 4000 SFF Ada, RTX 4000 Ada, RTX 4500 Ada, RTX 5000 Ada, RTX 6000 AdaSM90: H100, H200, GH200SM100: B200, GB200SM103: B300, GB300SM120: RTX PRO 6000/5000/4000/2500/2000 Blackwell, RTX 5050, RTX 5060, RTX 5070, RTX 5080, RTX 5090SM121: DGX Spark
Learn More
- How Surogate Works: Deep dive into the C++/CUDA engine and multi-threaded scheduler.
- Examples Library: Pre-built configurations for Qwen, Llama, and MoE models.
- User Guides: Advanced documentation on precision, memory, scaling, and more.
- Technical Reference: Comprehensive CLI and API reference.