Performance

This page is a work in progress. It will collect practical performance tuning tips and describe major optimizations implemented in Surogate.

High-Level Optimizations

This section describes high-level optimizations implemented in the Surogate framework to improve memory efficiency and training speed.

Kernel Fusions

LoRA Fusions

Fused RoPE

Rotary Position Embedding (RoPE) applies position-dependent rotations to query and key vectors. The standard implementation precomputes frequency tensors (freq_cis) and stores them in GPU memory. The fused implementation computes these frequencies on-the-fly within the kernel.

Enable via config

use_fused_rope: true

How it works

Standard RoPE:

Precompute a freq_cis tensor of shape (T, 2*head_dim) containing cos/sin values for all positions
Store in GPU global memory
During forward/backward, load these values from global memory for each head

Fused RoPE:

No precomputed tensor allocation
Each CUDA block computes cos/sin values on-the-fly using sincosf()
Results cached in shared memory and reused across all heads in the block

When to use

Fused RoPE is beneficial when:

Training with long sequence lengths where freq_cis allocation is significant
Memory-constrained scenarios where every MB matters
The model uses many attention layers (savings compound)

Performance

High-Level Optimizations

Kernel Fusions

LoRA Fusions

Fused RoPE

Enable via config

How it works

When to use

Multi-threaded multi-GPU

Chunked Attention

Chunked LM

See also

High-Level Optimizations

Kernel Fusions​

LoRA Fusions​

Fused RoPE​

Enable via config​

How it works​

When to use​

Multi-threaded multi-GPU​

Chunked Attention​

Chunked LM​

See also​

Kernel Fusions

LoRA Fusions

Fused RoPE

Enable via config

How it works

When to use

Multi-threaded multi-GPU

Chunked Attention

Chunked LM

See also