Skip to main content

Qwen 3 Pre-training (PT)

This example demonstrates how to pre-train a Qwen 3 model using Surogate's high-performance FP8 hybrid recipe.

Configuration Highlights

  • Model: Qwen/Qwen3-0.6B
  • Precision: fp8-hybrid (Native FP8 training on Hopper/Blackwell)
  • Optimizer: normuon (Optimized for faster convergence)
  • Batch Size: 8 per device with 4 GPUs (Effective batch size 32)
  • Dataset: HuggingFaceFW/fineweb-2

Running the example

surogate pt examples/pt/qwen3.yaml

Config File (examples/pt/qwen3.yaml)

model: Qwen/Qwen3-0.6B
output_dir: ./output

per_device_train_batch_size: 8
gradient_accumulation_steps: 1
sequence_len: 2048

recipe: fp8-hybrid
optimizer: normuon
gpus: 4

datasets:
- path: "HuggingFaceFW/fineweb-2"
subset: ron_Latn
split: train
type: text