Performance
This page is a work in progress. It will collect practical performance tuning tips and describe major optimizations implemented in Surogate.
High-Level Optimizations
This section describes high-level optimizations implemented in the Surogate framework to improve memory efficiency and training speed.
Kernel Fusions
LoRA Fusions
Fused RoPE
Rotary Position Embedding (RoPE) applies position-dependent rotations to query and key vectors. The standard implementation precomputes frequency tensors (freq_cis) and stores them in GPU memory. The fused implementation computes these frequencies on-the-fly within the kernel.
Enable via config
use_fused_rope: true
How it works
Standard RoPE:
- Precompute a
freq_cistensor of shape(T, 2*head_dim)containing cos/sin values for all positions - Store in GPU global memory
- During forward/backward, load these values from global memory for each head
Fused RoPE:
- No precomputed tensor allocation
- Each CUDA block computes cos/sin values on-the-fly using
sincosf() - Results cached in shared memory and reused across all heads in the block
When to use
Fused RoPE is beneficial when:
- Training with long sequence lengths where
freq_cisallocation is significant - Memory-constrained scenarios where every MB matters
- The model uses many attention layers (savings compound)