Skip to main content

Qwen 3 Pre-training (PT)

This example demonstrates how to pre-train a Qwen 3 model using Surogate's high-performance FP8 hybrid recipe.

Configuration Highlights

Model: Qwen/Qwen3-0.6B
Precision: fp8-hybrid (Native FP8 training on Hopper/Blackwell)
Optimizer: normuon (Optimized for faster convergence)
Batch Size: 8 per device with 4 GPUs (Effective batch size 32)
Dataset: HuggingFaceFW/fineweb-2

Running the example

surogate pt examples/pt/qwen3.yaml

Config File (`examples/pt/qwen3.yaml`)

model: Qwen/Qwen3-0.6B
output_dir: ./output

per_device_train_batch_size: 8
gradient_accumulation_steps: 1
sequence_len: 2048

recipe: fp8-hybrid
optimizer: normuon
gpus: 4

datasets:
  - path: "HuggingFaceFW/fineweb-2"
    subset: ron_Latn
    split: train
    type: text

Configuration Highlights
Running the example
Config File (examples/pt/qwen3.yaml)