Skip to main content

Speed Benchmarks

All numbers represent total TPS (Tokens per Second)

Single GPU

NVIDIA RTX 5090 32GB, CUDA 12.9

ModelUnsloth NF4Unsloth BF16Surogate BF16Surogate FP8Surogate QoFP8Surogate FP4Surogate QoFP4Surogate QoNF4
Qwen3 0.6B19,1k22,1k30,1k35,9k32.7k36.4k31,3k32,3k
Qwen3 1.7B12k12,6k14,0k18,8k17,4k20,7k17,0k17,6k
Qwen3 4B6k6.1k6,8k8,8k8,3k10,8k8,1k8,6k
Qwen3 8B3,4k3,5k3,8k5,4k5,0k6,9k4,9k5,0k
Qwen/Qwen3-30B-A3B0.016kOOMOOMOOM0,5kOOM0.5k0.5k

Relative Speedup vs Unsloth NF4

ModelBF16FP8QoFP8FP4QoFP4QoNF4
Qwen3 0.6B1.57x1.88x1.71x1.91x1.64x1.69x
Qwen3 1.7B1.17x1.57x1.45x1.73x1.42x1.47x
Qwen3 4B1.13x1.47x1.38x1.80x1.35x1.43x
Qwen3 8B1.12x1.59x1.47x2.03x1.44x1.47x
Qwen/Qwen3-30B-A3B--12.50x-12.50x18.75x

NVIDIA H100 80GB HBM3

ModelUnsloth NF4Unsloth BF16Surogate BF16Surogate FP8Surogate QoFP8Surogate QoNF4
Qwen3 0.6B18k21,3k53,9k51,2k16,5k16.6k
Qwen3 1.7B18k20,8k32,8k33,0k15,8k16,1k
Qwen3 4B11,5k12,4k15,9k17,0k11,2k11,0k
Qwen3 8B8,2k8,9k10,2k11,6k9,6k9,4k
Qwen3 14B5,2k5,6k6,0k7,2k6,5k6,1k
Qwen3 32B2,4k2,6kTODOTODO2,8kTODO

NVIDIA H200

ModelUnsloth NF4Unsloth BF16Surogate BF16Surogate FP8Surogate QFP8Surogate QoNF4
Qwen3 0.6B18,3k21,7kTODOTODOTODOTODO
Qwen3 1.7B18,3k21,4kTODOTODOTODOTODO
Qwen3 4B12,1k12,8kTODOTODOTODOTODO
Qwen3 8B8,4k9,1kTODOTODOTODOTODO
Qwen3 14BTODOTODOTODOTODOTODOTODO
Qwen3 32BTODOTODOTODOTODOTODOTODO

NVIDIA B200

ModelUnsloth NF4Unsloth BF16Surogate BF16Surogate FP8Surogate QFP8Surogate FP4Surogate QFP4Surogate QoNF4
Qwen3 0.6B17k19,1kTODOTODOTODOTODOTODOTODO
Qwen3 1.7B16,7k20,3KTODOTODOTODOTODOTODOTODO
Qwen3 4B13,1k14,8KTODOTODOTODOTODOTODOTODO
Qwen3 8B11,3k12,4KTODOTODOTODOTODOTODOTODO
Qwen3 14BTODO8,6kTODOTODOTODOTODOTODOTODO
Qwen3 32BTODO4,2kTODOTODOTODOTODOTODOTODO

NVIDIA B300 SXM6 AC

ModelUnsloth NF4Unsloth BF16Surogate BF16Surogate FP8Surogate QFP8Surogate FP4Surogate QFP4Surogate QoNF4
Qwen3 0.6BTODOTODOTODOTODOTODOTODOTODOTODO
Qwen3 1.7BTODOTODOTODOTODOTODOTODOTODOTODO
Qwen3 4BTODOTODOTODOTODOTODOTODOTODOTODO
Qwen3 8BTODOTODOTODOTODOTODOTODOTODOTODO
Qwen3 14BTODOTODOTODOTODOTODOTODOTODOTODO
Qwen3 32BTODOTODOTODOTODOTODOTODOTODOTODO

Multi-GPU

4x NVIDIA RTX 5090 32GB, CUDA 12.9

ModelSurogate BF16Surogate FP8Surogate QoFP8Surogate FP4Surogate QoFP4Surogate QoNF4
Qwen3 0.6B111k131,3k120,5k136,2k118,5k120,0k
Qwen3 1.7B53,1k70,8k66,7k79,0k65k66,4k
Qwen3 4B25,8k34,0k32,2k41,3k31,4k32,0k
Qwen3 8B14,6k20,8k19,8k27,1k19,3k19,7k
Qwen/Qwen3-30B-A3BOOMOOM1.4kOOM2.4k2.3k
openai/gpt-oss-20B--3.8k-2.2k-

Notes:

  • Expert Parallelism: 4
  • gpt-oss-20B is pre-quatized to mxfp4. This means only QLoRA training is possible.

Benchmark configuration

dataset = 10000 samples
max_seq_length=2048
per_device_train_batch_size = 2
gradient_accumulation_steps = 4
packing = True
lora rank = 16
lora_alpha = 32
lora_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

Formulas used

Tokens/sec=Batch Size×Grad Accum Steps×Max Seq Length×Num GPUssec/iter\text{Tokens/sec} = \frac{\text{Batch Size} \times \text{Grad Accum Steps} \times \text{Max Seq Length} \times \text{Num GPUs}}{\text{sec/iter}}

Tokens/sec=(iter/sec)×Batch Size×Grad Accum Steps×Max Seq Length×Num GPUs\text{Tokens/sec} = (\text{iter/sec}) \times \text{Batch Size} \times \text{Grad Accum Steps} \times \text{Max Seq Length} \times \text{Num GPUs}

Surogate install

curl -sSL https://surogate.ai/install.sh | bash
source .venv/bin/activate

Configurations used:

  • Surogate BF16: ./benchmarks/benchmark_sft.sh "Qwen/Qwen3-0.6B" bf16
  • Surogate FP8: ./benchmarks/benchmark_sft.sh "Qwen/Qwen3-0.6B" fp8
  • Surogate QFP8: ./benchmarks/benchmark_sft.sh "Qwen/Qwen3-0.6B" qfp8
  • Surogate FP4: ./benchmarks/benchmark_sft.sh "Qwen/Qwen3-0.6B" fp4
  • Surogate QFP4: ./benchmarks/benchmark_sft.sh "Qwen/Qwen3-0.6B" qfp4
  • Surogate NF4: ./benchmarks/benchmark_sft.sh "Qwen/Qwen3-0.6B" qbnb

Unsloth install

apt install -y python3-dev
uv venv --python=3.12
source .venv/bin/activate
uv pip install unsloth

Accuracy Benchmarks

We studied the impact of the recipes supported by Surogate using a custom version of the gsm8k dataset, specifically the ro_gsm8k dataset which is a Romanian translation of the original dataset.

Qwen/Qwen3-0.6B was chosen as a reference model. The measured accuracy of the original model on the ro_gsm8k dataset is close to 0, so this provides a good way to see how fine-tuning will teach the model this new dataset.

Summary table

Precision / ConfigaccuracyStderr
BF160.20850.0095
FP80.18880.0108
FP40.18800.0108
QBnB0.09400.0080
QFP8 + fp8-hybrid0.15310.0099
QFP8 + bf160.16980.0103
QFP40.16000.0101

Loss charts

BF16

BF16

FP8

FP8

FP4

FP4

QBnB

FP4

QFP8

FP4

QFP4

FP4

Config used:

per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 2e-4
warmup_steps: 20
weight_decay: 0.001
lr_scheduler_type: linear
lora_dropout: 0
lora_rank: 16
lora_alpha: 32
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj

Commands used:

VLLM_ALLOW_RUNTIME_LORA_UPDATING=True vllm serve Qwen/Qwen3-0.6B --max-model-len 2048 --max-lora-rank 64 --enable-lora --lora-modules adapter=/home/densemax2/work/flavius/surogate/output/benchmark_sft_bf16/adapter/ --port 8001
lm-eval --model local-completions --model_args model=adapter,base_url=http://localhost:8001/v1/completions,num_concurrent=50,max_retries=3,tokenized_requests=False,tokenizer=Qwen/Qwen3-0.6B --task gsm8k --num_fewshot 0 --output_path ./base
curl -X POST http://localhost:8001/v1/load_lora_adapter -H "Content-Type: application/json" -d '{"lora_name": "adapter", "lora_path": "/home/densemax2/work/flavius/surogate/output/benchmark_sft_qfp8/adapter"}'
curl -X POST http://localhost:8001/v1/unload_lora_adapter -H "Content-Type: application/json" -d '{"lora_name": "adapter"}'