Skip to main content

Configuration Reference

This section provides a comprehensive reference for all configuration options available in Surogate. Each option is described in detail, including its purpose, default value, and possible values.

General Settings

OptionTypeDefaultDescription
run_namestringauto-generatedA descriptor for the run. If not provided, a unique name is generated automatically.
apply_recommended_valuesboolfalseWhether to apply recommended configuration values.
num_epochsint3Total number of training epochs to perform.
output_dirstring"output"The output directory where the model predictions and checkpoints will be written.
checkpoint_dirstringnullDirectory to save checkpoints during training. If None, defaults to output_dir.
resume_from_checkpointbooltrueContinue from checkpoint. If enabled, uses the latest checkpoint.
save_stepsint50Number of steps between saving checkpoints.
save_total_limitint5Limit the total amount of checkpoints. Deletes older checkpoints in output_dir.
from_scratchboolfalseTrain from scratch (random initialization) instead of fine-tuning a pre-trained model.

Model Settings

OptionTypeDefaultDescription
modelstringrequiredPath or HuggingFace model identifier (e.g., "Qwen/Qwen3-0.6B").
model_typestringauto-detectType of the model group. Automatically detected from model config if not specified.
sequence_lenint1024Maximum sequence length for training. Samples exceeding this length are truncated.
max_model_lenintnullMaximum model length for rope scaling. Automatically detected from model config if not specified.
rope_scalingstringnullType of RoPE scaling. Pass a string like "linear", "dynamic", or "yarn" along with max_model_len to automatically configure rope_scaling. Alternatively, pass a JSON string like '{"factor": 2.0, "type": "yarn"}' to directly override the rope_scaling in the model's config.
torch_dtypestringauto-detectPyTorch data type for model weights. Options: "bfloat16", "float16", "float32". Automatically detected from model config if not specified.

Recomputation

Recomputation trades compute for memory by recomputing activations during the backward pass instead of storing them.

OptionTypeDefaultDescription
recomputebooltrueEnable activation recomputation. false saves all activations (fastest, most memory). true recomputes intermediates from checkpoints (saves VRAM, small compute overhead).

Offloading Options

Offloading options move tensors to host (CPU) memory to reduce GPU memory usage at the cost of increased data transfer overhead.

OptionTypeDefaultDescription
offload_residualboolfalseOffload residuals (of the FFN block) to pinned host memory. Combined with recompute, total activation memory becomes independent of network depth.
offload_masterboolfalseStore master weights in pinned host memory.
offload_quantsboolfalseStore quantized weights in pinned host memory. Requires persistent_quants.
offload_optimizerboolfalseStore optimizer state in pinned host memory. Slows down optimizer step drastically, but with enough gradient accumulation steps, the overall contribution becomes negligible.
offload_gradsboolfalseOffload gradients to pinned host memory. Requires shard_gradients=true or zero_level >= 2.
persistent_quantsboolfalseAvoid re-quantization of weights. Increases memory, but when combined with offload_quants, the additional memory is placed on the host. In PCIe settings, this can lead to significant speed-ups. Requires shard_weights.
use_zero_copyboolfalseUse ZeroCopy memory access instead of double-buffered cudaMemcpy for offloaded optimizer states. DMA is slower on consumer cards but faster on professional cards.
use_write_combinedboolfalseUse write-combined memory for offloaded tensors. May improve PCIe throughput in some situations.

Multi-GPU Training (ZeRO) Options

These options apply to single-node multi-GPU training. For multi-node distributed training, see Multi-Node Distributed Training.

OptionTypeDefaultDescription
zero_levelint1ZeRO redundancy optimization level: 1 = sharded optimizer states (default), 2 = sharded gradients + optimizer states, 3 = sharded weights + gradients + optimizer states.
shard_weightsboolfalseShard model weights across data-parallel processes. Enables more effective offloading and reduces memory consumption.
shard_gradientsboolfalseShard gradients across data-parallel processes. Enables more effective offloading and reduces memory consumption.
use_all_to_all_reduceboolfalseUse all-to-all-based reduce algorithm (combine with memcpy_send_recv).
memcpy_all_gatherboolfalseUse memcpy for all-gather operations (threads backend only). Generally gets better bandwidth utilization on PCIe and does not consume SM resources.
memcpy_send_recvboolfalseUse memcpy for send/receive operations (threads backend only).

Multi-Node Distributed Training

Configuration for training across multiple machines using Ray and NCCL. See the Multi-Node Training Guide for detailed setup instructions.

OptionTypeDefaultDescription
distributed.ray_addressstring"auto"Ray cluster address. Options: "auto" (connect to existing cluster), "local" (start local instance), "ray://host:port" (connect to specific head).
distributed.num_nodesint1Total number of nodes to use for training. Set to > 1 to enable multi-node training.
distributed.gpus_per_nodeint0Number of GPUs per node. If 0, uses the value from gpus config parameter.
distributed.worker_output_dirstringnullBase directory for worker-local tokenized data. Each worker creates a node-{rank}/ subdirectory. If null, uses /tmp/surogate-{run_name}/ on each node.

Example configuration:

distributed:
ray_address: "auto"
num_nodes: 2
gpus_per_node: 8
worker_output_dir: /shared/surogate-data

Hardware Settings

OptionTypeDefaultDescription
gpusint1Number of GPUs to use for training. Use 0 for all available GPUs.
use_cuda_graphsbooltrueEnable CUDA graphs for performance.

Mixed Precision & Recipe Options

OptionTypeDefaultDescription
recipestring"bf16"Mixed precision training recipe. Options: "bf16" (default), "fp8_hybrid", "nvfp4", "nvfp4_quartet".
gradient_dtypestringnullDtype for activation gradients / backward matmul policy. Defaults to matmul-dtype. Note: recipes may override backward dtype.
master_dtypestringnullMaster weight dtype for optimizer updates (e.g., FP32 for stable full fine-tuning). Defaults to model-dtype.
use_fused_ropeboolfalseUse fused RoPE kernel with on-the-fly cos/sin computation (saves memory, reduces bandwidth).

FP8 Recipe Options

OptionTypeDefaultDescription
fp8_amax_historyint16FP8 delayed scaling amax history length (for fp8_hybrid recipe).

FP4/NVFP4 Recipe Options

OptionTypeDefaultDescription
fp4_backendstring"cutlass"FP4 matmul backend: "cutlass" (default) or "cudnn" (for nvfp4 recipe).

Layer Quantization Skip Options

OptionTypeDefaultDescription
skip_quant_first_layersint0Skip quantization for the first N transformer decoder layers. (embedding layers are always kept in BF16).
skip_quant_last_layersint0Skip quantization for the last N transformer decoder layers (lm_head layers are always kept in BF16).

Optimizer Settings

OptionTypeDefaultDescription
optimizerstring"adamw_8bit"Optimizer type. Options: "adamw_8bit" (8-bit AdamW), "normuon" (NorMuon hybrid)
learning_ratefloat2e-4The initial learning rate for the optimizer.
lr_scheduler_typestring"linear"Learning rate schedule function: "linear", "cosine", or "wsd".
warmup_ratiofloat0.0Ratio of total training steps used for linear warmup from 0 to learning_rate.
warmup_stepsint0Number of steps for linear warmup. Overrides warmup_ratio if set.
cooldown_stepsint0Number of steps for linear cooldown from learning_rate to final_lr_fraction * learning_rate.
final_lr_fractionfloat0.0Final learning rate as a fraction of the initial learning rate.
weight_decayfloat0.1Weight decay applied to all layers except bias and LayerNorm weights.
max_grad_normfloat1.0Maximum gradient norm for gradient clipping. 0.0 disables clipping.

AdamW 8-bit Optimizer Parameters

Used when optimizer: "adamw_8bit" (default).

OptionTypeDefaultDescription
adamw_beta1float0.9The beta1 parameter for AdamW optimizer.
adamw_beta2float0.999The beta2 parameter for AdamW optimizer.
adamw_epsilonfloat1e-8The epsilon parameter for AdamW optimizer.

NorMuon Optimizer Parameters

Used when optimizer: "normuon". NorMuon uses a hybrid approach: AdamW for embeddings/norms/lm_head, and orthogonalized momentum for 2D weight matrices.

OptionTypeDefaultDescription
normuon_momentumfloat0.95Momentum coefficient for orthogonalized momentum updates in 2D weight matrices.
normuon_beta2float0.95Second moment coefficient for variance tracking in NorMuon optimizer.
normuon_cautious_wdbooltrueEnable cautious weight decay that only applies decay when gradient and momentum align.

Training Loop Settings

OptionTypeDefaultDescription
per_device_train_batch_sizeint2Batch size per device during training/evaluation.
gradient_accumulation_stepsint4Number of update steps to accumulate gradients before performing backward/update pass. Effective batch size = batch_size × grad_accumulation × num_gpus.
max_stepsint-1Total number of training steps. -1 derives from epochs and dataset size.
eval_stepsint100Run evaluation every N optimizer steps.
train_visionboolnullIf true, run the vision encoder during training to process images/videos. If false, train on text only. If null and the model is multimodal, defaults to true.

Dataset Settings

OptionTypeDefaultDescription
datasetslistnullList of datasets for training. Each dataset should specify path, type, and other dataset-specific options. See Dataset Configuration Options below.
validation_datasetslistnullList of datasets for validation during training. If not provided, uses validation_split_ratio to create validation split from training data. Uses same format as datasets.
validation_split_ratiofloat0.1Ratio of training data to use for validation if no validation_datasets are provided. Value between 0.0 and 1.0.
train_seedint1234Random seed for the training dataloader. Controls shuffling and sampling order.
eval_seedint1234Random seed for the evaluation dataloader. Controls shuffling and sampling order.
dataloader_num_workersintautoNumber of subprocesses to use for data loading. 0 means data will be loaded in the main process. Defaults to optimal value based on CPU count.
sample_packingbooltrueWhether to enable sample packing to fit multiple data samples into a single sequence. Packing reduces the number of samples in the dataset; adjust gradient accumulation steps and learning rate accordingly for packed datasets.

Dataset Configuration Options

Each dataset in the datasets or validation_datasets list is configured with the following options. Dataset type determines which additional fields are required.

Base Dataset Options (All Types)

OptionTypeDefaultDescription
pathstringrequiredHuggingFace dataset repo, s3:// URL, gs:// URL, or path to local file or directory.
typestringrequiredDataset type. Options: "text", "instruction", "conversation", "auto" (auto-detect format).
subsetstringnullHuggingFace dataset subset/configuration name to load (e.g., "default" for datasets with multiple configurations).
splitstring"train"Dataset split to load. Common values: "train", "test", "validation".
samplesintnullLimit the number of samples to use from this dataset. If not specified, uses all available samples.

Text Dataset Options (type: "text")

For pre-training or continued pre-training on raw text data.

OptionTypeDefaultDescription
text_fieldstring"text"Name of the column in the dataset that contains the raw text content.

Example:

datasets:
- path: "HuggingFaceFW/fineweb-edu"
type: text
text_field: text
split: train
samples: 100000

Instruction Dataset Options (type: "instruction")

For instruction-following datasets with system/instruction/input/output format.

OptionTypeDefaultDescription
instruction_fieldstringrequiredName of the column containing the instruction/question.
output_fieldstringrequiredName of the column containing the expected output/answer.
input_fieldstringnullName of the column containing additional input context (optional).
system_prompt_typestringnullHow to provide system prompt. Options: "field" (from dataset column), "fixed" (same for all samples), null.
system_prompt_fieldstringnullName of the column containing system prompts (required when system_prompt_type: "field").
system_promptstringnullFixed system prompt text to use for all samples (required when system_prompt_type: "fixed").
prompt_formatstringnullCustom prompt format template. Use {system}, {instruction}, {input}, {output} as placeholders.
prompt_format_no_inputstringnullCustom prompt format when no input field. Use {system}, {instruction}, {output} as placeholders.

Example:

datasets:
- path: "yahma/alpaca-cleaned"
type: instruction
instruction_field: instruction
input_field: input
output_field: output
system_prompt_type: fixed
system_prompt: "You are a helpful AI assistant."

Conversation Dataset Options (type: "conversation")

For multi-turn conversational datasets in chat format.

OptionTypeDefaultDescription
messages_fieldstring"messages"Name of the column containing the list of conversation messages.
system_fieldstringnullName of the column containing the system prompt for the conversation (optional).
tools_fieldstringnullName of the column containing tool/function definitions for function calling.
message_property_mappingsdict{"role": "role", "content": "content", ...}Mapping of message property names if dataset uses non-standard field names.

Example:

datasets:
- path: "HuggingFaceH4/ultrachat_200k"
type: conversation
messages_field: messages
split: train_sft

Memory Optimization Settings

OptionTypeDefaultDescription
lmhead_chunksint1Split LM-head computation into N chunks to reduce logit tensor size by factor of N.
attn_bwd_chunksint1Split attention backward pass into N chunks to save workspace memory.
init_projections_to_zeroboolfalseInitialize projection weights (FFN down and attention out) to zero. Only used when training from scratch.

LoRA Settings

OptionTypeDefaultDescription
lorabooltrueWhether to use LoRA adapters for training.
lora_rankint16Rank for LoRA adapters.
lora_alphaint32Alpha value for LoRA adapters.
lora_dropoutfloat0.05Dropout rate for LoRA adapters.
lora_dtypestring"fp32"Data type for LoRA adapters: "bf16" or "fp32".
lora_target_moduleslist["all"]List of module names to apply LoRA adapters to.
train_routerboolfalseTrain MoE router gate during LoRA fine-tuning. Only applies to MoE models.
adapter_pathstringnullPath to a PEFT adapter directory to merge into base weights before training. Requires lora: true. Not supported with pre-quantized models.
merge_adapterboolfalseWhether to merge LoRA adapters into the base model after training.

MoE Settings

MoE (Mixture-of-Experts) settings control router loss coefficients for load balancing during training.

OptionTypeDefaultDescription
router_aux_loss_coeffloatnullMoE auxiliary (load balancing) loss coefficient. null uses model config default.
router_z_loss_coeffloatnullMoE z-loss (router logit regularization) coefficient. null uses model config default.

Understanding MoE Losses:

  • Auxiliary Loss (aux_loss): Encourages load balancing across experts. Higher values enforce more even token distribution but may reduce model capacity. Typical range: 0.001-0.1.
  • Z-Loss (z_loss): Regularizes router logits to prevent them from growing too large, which can cause routing collapse. Typical range: 0.0001-0.01.

QLoRA Settings

OptionTypeDefaultDescription
qlora_fp4boolfalseEnable NVFP4 QLoRA mode (base weights quantized to FP4 E2M1). Requires Blackwell GPU (SM100+).
qlora_fp8boolfalseEnable FP8 QLoRA mode (base weights quantized to FP8 with per-block scales).
qlora_bnbboolfalseEnable BitsAndBytes NF4 QLoRA mode (base weights quantized to NF4 with per-block absmax). Works on any CUDA GPU.
qlora_block_sizeint128Block size for FP8 QLoRA quantization. Valid values: 64, 128, 256.
qlora_bnb_block_sizeint64Block size for BnB NF4 QLoRA quantization. Valid values: 64, 128, 256, 512.
qlora_bnb_double_quantbooltrueEnable double quantization for BnB (quantize absmax values to INT8 for extra memory savings).
qlora_four_over_sixbooltrueEnable Four Over Six (4/6) adaptive block scaling for NVFP4 QLoRA quantization. Evaluates both max=4 and max=6 scaling per block and selects lower error option.
qlora_selective_expert_dequantboolfalseEnable selective expert dequantization for MoE models to reduce dequant buffer memory. When enabled, it only dequantizes the experts that are actually selected by the router for each forward pass, rather than dequantizing all experts.
qlora_offload_expertsboolfalseOffload expert weights in QLoRA MoE models to host memory. Works at the layer level (loads/unloads entire layer's experts as groups).

Chat Template Settings

Chat template settings control how conversations are formatted for training and inference.

OptionTypeDefaultDescription
use_chat_templatebooltrueWhether to use chat template for training.
templatestringautoThe chat template to use. Automatically detected from model if not specified. Available templates defined in CHAT_TEMPLATE_MAPPING.
systemstringnullOverride the default system prompt in the template. Use \n for newlines.
max_lengthintnullMaximum length for tokenized conversations. Defaults to sequence_len if not specified.
truncation_strategystring"delete"How to handle conversations exceeding max_length. Options: "delete" (skip sample), "left" (truncate from start), "right" (truncate from end), "split" (split into multiple samples).
padding_sidestring"right"Which side to pad sequences on. Options: "left", "right".
padding_freeboolfalseEnable padding-free training for more efficient packing.
loss_scalestring"default"Loss scaling strategy. Options: "default", or custom scaling configuration.
sequence_parallel_sizeint1Sequence parallelism size for distributed training across sequence dimension.
response_prefixstringnullPrefix to add before model responses during inference. Use \n for newlines.
max_pixelsintnullMaximum number of pixels for vision models (multimodal only).
norm_bboxstringnullBounding box normalization strategy for vision models. Options: "norm1000", "none", null.
agent_templatestringnullTemplate for agent-style conversations (advanced usage).

Logging & Reporting

OptionTypeDefaultDescription
report_tolistnullReport results and logs to specified platforms. Options: "wandb", "aim".
log_filestringnullWhere to save the training log. If null, no log file is written.
log_gpu_utilint100Interval for logging GPU utilization.

WandB (Weights & Biases) Settings

OptionTypeDefaultDescription
wandb_projectstringnullWandB project name for logging.
wandb_namestringrun_nameWandB run name for logging. Defaults to the value of run_name.

Aim Settings

OptionTypeDefaultDescription
aim_experimentstringnullAim experiment name for logging.
aim_repostringnullAim repository path for logging. Uses default if not specified.
aim_namestringrun_nameAim run name for logging. Defaults to the value of run_name.

Debugging Options

OptionTypeDefaultDescription
debug_time_breakdownboolfalseEnable detailed training timing breakdown for debugging.
debug_memory_breakdownboolfalsePrint detailed memory breakdown after model allocation (useful for QLoRA optimization).

Training Diagnostics & Automation

These options control automatic training monitoring, early stopping, and compute-optimal adjustments. All are disabled by default and safe to enable — they only add diagnostics or automation on top of the normal training loop.

OptionTypeDefaultDescription
auto_lr_reductionboolfalseDetect loss spikes and gradient explosions, then permanently reduce the learning rate. Monitors a rolling window of loss/grad-norm values; when an anomaly is detected (loss > mean + 3σ, or grad_norm > 10× average), the LR schedule is scaled down by 50%. Up to 5 reductions.
early_stopboolfalseMulti-criteria early stopping. Stops training when ANY of: (1) convergence score > 0.85 for 5 consecutive evals, (2) compute efficiency (loss reduction per FLOP) drops below 50% of peak, (3) training diverges for 200+ consecutive steps, (4) loss plateaus for 500+ consecutive steps. Uses the 6N approximation for FLOPs/token.
epoch_adjustmentboolfalseAutomatically adjust num_epochs to match the Chinchilla-optimal token budget (20× model parameters). If the dataset is smaller than the budget, increases epochs; if larger, decreases them. Only applies when max_steps is not explicitly set.

Always-on diagnostics (no config flag required):

  • Plateau detection: Warns when training loss stops improving over a rolling window. No automatic action taken.
  • Phase detection: Classifies training into WARMUP / CONVERGING / PLATEAU / UNSTABLE / DIVERGING phases. Phase transitions are logged and the current phase is shown in the step log output.
  • Chinchilla token budget: Printed at training start — shows the Chinchilla-optimal token count (20 × params) alongside planned tokens, so you can gauge training sufficiency at a glance.

Recipe Comparison

RecipeFormatGPU RequirementUse Case
bf16BF16 forward/backwardAny CUDA GPUBaseline, maximum compatibility
fp8_hybridFP8 E4M3 fwd / E5M2 bwdSM89+ (Ada, Hopper, Blackwell)2x throughput, minimal accuracy loss
nvfp4FP4 E2M1 with block scalingSM100+ (Blackwell only)Maximum memory efficiency
nvfp4_quartetFP4 E2M1 quartet scalingSM100+ (Blackwell only)Higher accuracy FP4 training

Example Configuration

# Model
model: Qwen/Qwen3-0.6B
model_type: qwen # auto-detected if not specified
sequence_len: 2048
max_model_len: 2048
torch_dtype: bfloat16 # auto-detected if not specified

# Output
output_dir: ./output
save_steps: 100
save_total_limit: 3

# Training
num_epochs: 3
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 2e-4
lr_scheduler_type: cosine
warmup_ratio: 0.03

# Dataset
datasets:
# Conversation dataset (most common for fine-tuning)
- path: "mlabonne/FineTome-100k"
type: conversation
messages_field: conversations
split: train
# Or use instruction dataset format
# - path: "yahma/alpaca-cleaned"
# type: instruction
# instruction_field: instruction
# input_field: input
# output_field: output
validation_split_ratio: 0.1
train_seed: 1234
eval_seed: 1234
sample_packing: true
dataloader_num_workers: 4

# Chat Template
use_chat_template: true
template: qwen # auto-detected if not specified
truncation_strategy: delete
padding_side: right

# LoRA
lora: true
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
lora_dtype: fp32

# Memory optimization
recompute: true
recipe: bf16

# Hardware
gpus: 1
use_cuda_graphs: true