Skip to main content

Quantized LoRA (QLoRA)

QLoRA enables memory-efficient fine-tuning by quantizing the frozen base model weights while training LoRA adapters in higher precision. Surogate supports three QLoRA quantization formats:

AspectFP8 QLoRAFP4 QLoRANF4 QLoRA (BitsAndBytes)
FormatE4M3 (fwd), E5M2 (bwd)E2M1 (both)NF4 (4-bit normal float)
ScalingPer-tensor delayedTwo-level block (FP8 + FP32)Per-block absmax (+ double quant)
GPU RequirementSM89+ (Ada, Hopper, Blackwell)SM100+ (Blackwell only)Any CUDA GPU
Memory Compression~50% vs FP16~75% vs FP16~75% vs FP16

QLoRA vs Recipes

QLoRA determines how the frozen base model weights are stored and used during the forward pass. The base weights remain quantized and are never updated.

Recipes determine the precision format used for LoRA adapter computations, activations, and gradients during training.

You can combine any QLoRA format with any compatible recipe:

QLoRA (base weights) + Recipe (LoRA training) = Full Configuration

FP8 QLoRA

FP8 QLoRA stores base model weights in FP8 format, reducing memory by ~50% compared to FP16/BF16.

How It Works

Base weights are quantized to FP8 using two formats optimized for their use cases:

FormatExponentMantissaMax ValueUse Case
E4M34 bits3 bits448Forward pass (higher precision)
E5M25 bits2 bits57344Backward pass (larger dynamic range)

Delayed Scaling: Scale factors are computed from the previous iteration's abs-max values (history window of 1024 by default), providing more stable training than just-in-time scaling.

Parameters

ParameterDefaultDescription
qlora_fp8falseEnable FP8 QLoRA
margin0Margin for scale factor computation
amax_history_len1024Length of amax history window
amax_compute_algoMAXAlgorithm: MAX or MOST_RECENT
reduce_amaxtrueReduce amax across distributed group
skip_quant_first_layers0Skip quantization for first N layers
skip_quant_last_layers0Skip quantization for last N layers
RecipeUse Case
bf16Maximum LoRA accuracy, any GPU (Recommended)
fp8-hybridFaster LoRA compute on SM89+ GPUs
nvfp4Maximum speed on Blackwell (experimental)

Example

qlora_fp8: true
skip_quant_first_layers: 1
skip_quant_last_layers: 2
recipe: bf16
lora: true
lora_rank: 16

FP4 QLoRA

FP4 QLoRA stores base model weights in NVIDIA's FP4 E2M1 format, reducing memory by ~75% compared to FP16/BF16. Requires Blackwell GPUs (SM100+).

How It Works

FP4 E2M1 provides extreme compression with only 8 representable values per sign:

PropertyValue
Exponent bits2
Mantissa bits1
Values±6
Storage2 values per byte (4 bits each)

Two-Level Block Scaling:

  • Level 1: FP8 E4M3 scales per block (16 values for activations, 16x16 for weights)
  • Level 2: FP32 global amax baked into block scales

Stability Techniques:

  • Random Hadamard Transform (RHT): Spreads outliers before quantization
  • Stochastic Rounding: Prevents quantization bias accumulation in gradients
  • Four-Over-Six (4/6) Adaptive Scaling: Selects optimal scale per block
  • Layer Skipping: Keep critical layers (embedding, lm_head) in BF16

Parameters

ParameterDefaultDescription
qlora_fp4falseEnable FP4 QLoRA
skip_quant_first_layers0Skip FP4 for first N layers
skip_quant_last_layers0Skip FP4 for last N layers
backendcutlassBackend: cudnn or cutlass
RecipeUse Case
nvfp4Maximum speed, full FP4 pipeline (Recommended)
bf16Higher LoRA accuracy, slower
fp8-hybridBalance of speed and accuracy

Example

qlora_fp4: true
recipe: nvfp4
lora: true
lora_rank: 16
skip_quant_first_layers: 1
skip_quant_last_layers: 4

NF4 QLoRA (BitsAndBytes)

NF4 QLoRA uses the BitsAndBytes NF4 (NormalFloat4) quantization format, providing ~75% memory reduction with broad GPU compatibility.

How It Works

NF4 is a 4-bit data type optimized for normally distributed weights:

PropertyValue
Bits per value4
Storage2 values per byte
Quantile-based16 levels mapped to normal distribution quantiles
Block sizeConfigurable (default: 64 values per block)

Block-wise Quantization:

  • Weights are divided into blocks (default 64 values)
  • Each block stores an FP32 absmax scale factor
  • Values are quantized to 4-bit indices into a fixed NF4 lookup table

Double Quantization (optional):

  • Absmax scales are further quantized to INT8
  • Groups of 256 blocks share an FP32 scale and offset
  • Reduces scale overhead from 4 bytes to ~1 byte per block

Memory Layout

For a weight tensor with N elements using block size 64:

ComponentSize (bytes)With Double Quant
NF4 dataN / 2N / 2
Absmax scales(N / 64) × 4(N / 64) × 1
Double quant(N / 64 / 256) × 8

Parameters

ParameterDefaultDescription
qlora_bnbfalseEnable BitsAndBytes NF4 QLoRA
qlora_bnb_block_size64Block size for quantization (64 or 128)
qlora_bnb_double_quanttrueEnable double quantization for scales

GPU Compatibility

Unlike FP8 and FP4 QLoRA which require specific GPU architectures, NF4 QLoRA works on any CUDA GPU. The dequantization happens on-the-fly during forward and backward passes.

RecipeUse Case
bf16Best accuracy, broad compatibility (Recommended)
fp8-hybridFaster compute on SM89+ GPUs

Example

model: Qwen/Qwen3-4B
lora: true
lora_rank: 16
lora_alpha: 32

qlora_bnb: true
qlora_bnb_block_size: 64
qlora_bnb_double_quant: true

recipe: bf16

See also