Skip to main content

Config Reference

This page is the canonical reference for all configuration options.


This section provides a comprehensive reference for all configuration options available in Surogate. Each option is described in detail, including its purpose, default value, and possible values.

General Settings

OptionTypeDefaultDescription
run_namestringauto-generatedA descriptor for the run. If not provided, a unique name is generated automatically.
apply_recommended_valuesboolfalseWhether to apply recommended configuration values.
num_epochsint3Total number of training epochs to perform.
output_dirstring"output"The output directory where the model predictions and checkpoints will be written.
checkpoint_dirstring"output"Directory to save checkpoints during training. If None, defaults to output_dir.
resume_from_checkpointboolfalseContinue from checkpoint. If enabled, uses the latest checkpoint.
save_stepsint50Number of steps between saving checkpoints.
save_total_limitint5Limit the total amount of checkpoints. Deletes older checkpoints in output_dir.

Model Settings

OptionTypeDefaultDescription
modelstringrequiredPath or HuggingFace model identifier (e.g., "Qwen/Qwen3-0.6B").
model_typestringauto-detectType of the model group. Automatically detected from model config if not specified.
sequence_lenintmodel's maxMaximum sequence length for training. Defaults to model's max_model_len.
max_model_lenintauto-detectMaximum model length for rope scaling. Automatically detected from model config if not specified.
rope_scalingstringnullType of RoPE scaling. Pass a string like "linear", "dynamic", or "yarn" along with max_model_len to automatically configure rope_scaling. Alternatively, pass a JSON string like '{"factor": 2.0, "type": "yarn"}' to directly override the rope_scaling in the model's config.
torch_dtypestringauto-detectPyTorch data type for model weights. Options: "bfloat16", "float16", "float32". Automatically detected from model config if not specified.

Recomputation Options

Recomputation options trade compute for memory by recomputing activations during the backward pass instead of storing them.

OptionTypeDefaultDescription
recompute_swiglubooltrueRecompute SwiGLU activation during backward pass. As SwiGLU is at the widest part of the model, this results in substantial memory savings at moderate compute cost.
recompute_rmsnormbooltrueRecompute RMSNorm activations during backward pass to save memory.
recompute_ffnbooltrueRecompute Feed-Forward Network (FFN) activations during backward pass. Implies recompute_swiglu.
recompute_qkvbooltrueRecompute QKV projections during backward pass to save memory.
recompute_attbooltrueRecompute attention block during backward pass. Implies recompute_qkv.
recompute_blockbooltrueRecompute entire Transformer block during backward pass to save memory.
recompute_lorabooltrueRecompute ln1/ln2 activations during LoRA backward pass instead of storing per-layer. Only effective when LoRA is enabled. Requires and sets recompute_block to true. When used with offload_residual, CUDA graphs are disabled.

Recomputation Hierarchy

The recomputation options form a hierarchy:

  • recompute_block → implies recompute_att, recompute_ffn, recompute_rmsnorm
  • recompute_att → implies recompute_qkv
  • recompute_ffn → implies recompute_swiglu

Offloading Options

Offloading options move tensors to host (CPU) memory to reduce GPU memory usage at the cost of increased data transfer overhead.

OptionTypeDefaultDescription
offload_residualboolfalseOffload residuals (of the FFN block) to pinned host memory. Combined with recompute_block, total activation memory becomes independent of network depth.
offload_masterboolfalseStore master weights in pinned host memory.
offload_quantsboolfalseStore quantized weights in pinned host memory. Requires persistent_quants.
offload_optimizerboolfalseStore optimizer state in pinned host memory. Slows down optimizer step drastically, but with enough gradient accumulation steps, the overall contribution becomes negligible.
offload_gradsboolfalseOffload gradients to pinned host memory.
persistent_quantsboolfalseAvoid re-quantization of weights. Increases memory, but when combined with offload_quants, the additional memory is placed on the host. In PCIe settings, this can lead to significant speed-ups. Requires shard_weights.
use_zero_copyboolfalseUse ZeroCopy memory access instead of double-buffered cudaMemcpy for offloaded optimizer states. DMA is slower on consumer cards but faster on professional cards.
use_write_combinedboolfalseUse write-combined memory for offloaded tensors. May improve PCIe throughput in some situations.

Distributed Training (ZeRO) Options

OptionTypeDefaultDescription
zero_levelint1ZeRO redundancy optimization level: 1 = sharded optimizer states (default), 2 = sharded gradients + optimizer states, 3 = sharded weights + gradients + optimizer states.
shard_weightsboolfalseShard model weights across data-parallel processes. Enables more effective offloading and reduces memory consumption.
shard_gradientsboolfalseShard gradients across data-parallel processes. Enables more effective offloading and reduces memory consumption.
use_all_to_all_reduceboolfalseUse all-to-all-based reduce algorithm (combine with memcpy_send_recv).
memcpy_all_gatherboolfalseUse memcpy for all-gather operations (threads backend only). Generally gets better bandwidth utilization on PCIe and does not consume SM resources.
memcpy_send_recvboolfalseUse memcpy for send/receive operations (threads backend only).

Hardware Settings

OptionTypeDefaultDescription
gpusint1Number of GPUs to use for training. Use 0 for all available GPUs.
use_cuda_graphsbooltrueEnable CUDA graphs for performance. Automatically disabled for QLoRA and when recompute_lora conflicts with offload_residual.

Mixed Precision & Recipe Options

OptionTypeDefaultDescription
recipestring"bf16"Mixed precision training recipe. Options: "bf16" (default), "fp8_hybrid", "nvfp4".
gradient_dtypestringnullDtype for activation gradients / backward matmul policy. Defaults to matmul-dtype. Note: recipes may override backward dtype.
master_dtypestringnullMaster weight dtype for optimizer updates (e.g., FP32 for stable full fine-tuning). Defaults to model-dtype.
use_fused_ropeboolfalseUse fused RoPE kernel with on-the-fly cos/sin computation (saves memory, reduces bandwidth).

FP8 Recipe Options

OptionTypeDefaultDescription
fp8_amax_historyint1024FP8 delayed scaling amax history length (for fp8_hybrid recipe).

FP4/NVFP4 Recipe Options

OptionTypeDefaultDescription
fp4_backendstring"cutlass"FP4 matmul backend: "cutlass" (default) or "cudnn" (for nvfp4 recipe).

Layer Quantization Skip Options

OptionTypeDefaultDescription
skip_quant_first_layersint0Skip quantization for the first N transformer layers (embedding layers kept in BF16).
skip_quant_last_layersint0Skip quantization for the last N transformer layers (lm_head layers kept in BF16).

Optimizer Settings

OptionTypeDefaultDescription
optimizerstring"adamw_8bit"Optimizer type. Options: "adamw_8bit" (8-bit AdamW), "normuon" (NorMuon hybrid)
learning_ratefloat2e-4The initial learning rate for the optimizer.
lr_scheduler_typestring"linear"Learning rate schedule function: "linear", "cosine", or "wsd".
warmup_ratiofloat0.0Ratio of total training steps used for linear warmup from 0 to learning_rate.
warmup_stepsint0Number of steps for linear warmup. Overrides warmup_ratio if set.
cooldown_stepsint0Number of steps for linear cooldown from learning_rate to final_lr_fraction * learning_rate.
final_lr_fractionfloat0.0Final learning rate as a fraction of the initial learning rate.
weight_decayfloat0.1Weight decay applied to all layers except bias and LayerNorm weights.
max_grad_normfloat1.0Maximum gradient norm for gradient clipping. 0.0 disables clipping.

AdamW 8-bit Optimizer Parameters

Used when optimizer: "adamw_8bit" (default).

OptionTypeDefaultDescription
adamw_beta1float0.9The beta1 parameter for AdamW optimizer.
adamw_beta2float0.999The beta2 parameter for AdamW optimizer.
adamw_epsilonfloat1e-8The epsilon parameter for AdamW optimizer.

NorMuon Optimizer Parameters

Used when optimizer: "normuon". NorMuon uses a hybrid approach: AdamW for embeddings/norms/lm_head, and orthogonalized momentum for 2D weight matrices.

OptionTypeDefaultDescription
normuon_momentumfloat0.95Momentum coefficient for orthogonalized momentum updates in 2D weight matrices.
normuon_beta2float0.95Second moment coefficient for variance tracking in NorMuon optimizer.
normuon_cautious_wdbooltrueEnable cautious weight decay that only applies decay when gradient and momentum align.

Training Loop Settings

OptionTypeDefaultDescription
per_device_train_batch_sizeint2Batch size per device during training/evaluation.
gradient_accumulation_stepsint4Number of update steps to accumulate gradients before performing backward/update pass. Effective batch size = batch_size × grad_accumulation × num_gpus.
max_stepsint-1Total number of training steps. -1 derives from epochs and dataset size.
eval_stepsint100Run evaluation every N optimizer steps.

Dataset Settings

OptionTypeDefaultDescription
datasetslistnullList of datasets for training. Each dataset should specify path, type, and other dataset-specific options. See Dataset Configuration Options below.
validation_datasetslistnullList of datasets for validation during training. If not provided, uses validation_split_ratio to create validation split from training data. Uses same format as datasets.
validation_split_ratiofloat0.1Ratio of training data to use for validation if no validation_datasets are provided. Value between 0.0 and 1.0.
train_seedint1234Random seed for the training dataloader. Controls shuffling and sampling order.
eval_seedint1234Random seed for the evaluation dataloader. Controls shuffling and sampling order.
dataloader_num_workersintautoNumber of subprocesses to use for data loading. 0 means data will be loaded in the main process. Defaults to optimal value based on CPU count.
sample_packingbooltrueWhether to enable sample packing to fit multiple data samples into a single sequence. Packing reduces the number of samples in the dataset; adjust gradient accumulation steps and learning rate accordingly for packed datasets.

Dataset Configuration Options

Each dataset in the datasets or validation_datasets list is configured with the following options. Dataset type determines which additional fields are required.

Base Dataset Options (All Types)

OptionTypeDefaultDescription
pathstringrequiredHuggingFace dataset repo, s3:// URL, gs:// URL, or path to local file or directory.
typestringrequiredDataset type. Options: "text", "instruction", "conversation", "auto" (auto-detect format).
subsetstringnullHuggingFace dataset subset/configuration name to load (e.g., "default" for datasets with multiple configurations).
splitstring"train"Dataset split to load. Common values: "train", "test", "validation".
samplesintnullLimit the number of samples to use from this dataset. If not specified, uses all available samples.

Text Dataset Options (type: "text")

For pre-training or continued pre-training on raw text data.

OptionTypeDefaultDescription
text_fieldstring"text"Name of the column in the dataset that contains the raw text content.

Example:

datasets:
- path: "HuggingFaceFW/fineweb-edu"
type: text
text_field: text
split: train
samples: 100000

Instruction Dataset Options (type: "instruction")

For instruction-following datasets with system/instruction/input/output format.

OptionTypeDefaultDescription
instruction_fieldstringrequiredName of the column containing the instruction/question.
output_fieldstringrequiredName of the column containing the expected output/answer.
input_fieldstringnullName of the column containing additional input context (optional).
system_prompt_typestringnullHow to provide system prompt. Options: "field" (from dataset column), "fixed" (same for all samples), null.
system_prompt_fieldstringnullName of the column containing system prompts (required when system_prompt_type: "field").
system_promptstringnullFixed system prompt text to use for all samples (required when system_prompt_type: "fixed").
prompt_formatstringnullCustom prompt format template. Use {system}, {instruction}, {input}, {output} as placeholders.
prompt_format_no_inputstringnullCustom prompt format when no input field. Use {system}, {instruction}, {output} as placeholders.

Example:

datasets:
- path: "yahma/alpaca-cleaned"
type: instruction
instruction_field: instruction
input_field: input
output_field: output
system_prompt_type: fixed
system_prompt: "You are a helpful AI assistant."

Conversation Dataset Options (type: "conversation")

For multi-turn conversational datasets in chat format.

OptionTypeDefaultDescription
messages_fieldstring"messages"Name of the column containing the list of conversation messages.
system_fieldstringnullName of the column containing the system prompt for the conversation (optional).
tools_fieldstringnullName of the column containing tool/function definitions for function calling.
message_property_mappingsdict{"role": "role", "content": "content", ...}Mapping of message property names if dataset uses non-standard field names.

Example:

datasets:
- path: "HuggingFaceH4/ultrachat_200k"
type: conversation
messages_field: messages
split: train_sft

Memory Optimization Settings

OptionTypeDefaultDescription
lmhead_chunksint1Split LM-head computation into N chunks to reduce logit tensor size by factor of N.
attn_bwd_chunksint1Split attention backward pass into N chunks to save workspace memory.
init_projections_to_zeroboolfalseInitialize projection weights (FFN down and attention out) to zero. Only used when training from scratch.

LoRA Settings

OptionTypeDefaultDescription
lorabooltrueWhether to use LoRA adapters for training.
lora_rankint16Rank for LoRA adapters.
lora_alphaint32Alpha value for LoRA adapters.
lora_dropoutfloat0.05Dropout rate for LoRA adapters.
lora_dtypestring"fp32"Data type for LoRA adapters: "bf16" or "fp32".
lora_target_moduleslist["all-linear"]List of module names to apply LoRA adapters to.
merge_adapterboolfalseWhether to merge LoRA adapters into the base model after training.

QLoRA Settings

OptionTypeDefaultDescription
qlora_fp4boolfalseEnable NVFP4 QLoRA mode (base weights quantized to FP4 E2M1). Requires Blackwell GPU (SM100+).
qlora_fp8boolfalseEnable FP8 QLoRA mode (base weights quantized to FP8 with per-block scales).
qlora_bnbboolfalseEnable BitsAndBytes NF4 QLoRA mode (base weights quantized to NF4 with per-block absmax). Works on any CUDA GPU.
qlora_block_sizeint128Block size for FP8 QLoRA quantization. Valid values: 64, 128, 256.
qlora_bnb_block_sizeint64Block size for BnB NF4 QLoRA quantization. Valid values: 64, 128, 256, 512.
qlora_bnb_double_quantbooltrueEnable double quantization for BnB (quantize absmax values to INT8 for extra memory savings).
qlora_four_over_sixbooltrueEnable Four Over Six (4/6) adaptive block scaling for NVFP4 QLoRA quantization. Evaluates both max=4 and max=6 scaling per block and selects lower error option.

Chat Template Settings

Chat template settings control how conversations are formatted for training and inference.

OptionTypeDefaultDescription
use_chat_templatebooltrueWhether to use chat template for training.
templatestringautoThe chat template to use. Automatically detected from model if not specified. Available templates defined in CHAT_TEMPLATE_MAPPING.
systemstringnullOverride the default system prompt in the template. Use \n for newlines.
max_lengthintnullMaximum length for tokenized conversations. Defaults to sequence_len if not specified.
truncation_strategystring"delete"How to handle conversations exceeding max_length. Options: "delete" (skip sample), "left" (truncate from start), "right" (truncate from end), "split" (split into multiple samples).
padding_sidestring"right"Which side to pad sequences on. Options: "left", "right".
padding_freeboolfalseEnable padding-free training for more efficient packing.
loss_scalestring"default"Loss scaling strategy. Options: "default", or custom scaling configuration.
sequence_parallel_sizeint1Sequence parallelism size for distributed training across sequence dimension.
response_prefixstringnullPrefix to add before model responses during inference. Use \n for newlines.
max_pixelsintnullMaximum number of pixels for vision models (multimodal only).
norm_bboxstringnullBounding box normalization strategy for vision models. Options: "norm1000", "none", null.
agent_templatestringnullTemplate for agent-style conversations (advanced usage).

Logging & Reporting

OptionTypeDefaultDescription
report_tolistnullReport results and logs to specified platforms. Options: "wandb", "aim".
log_filestringauto-generatedWhere to save the training log. Defaults to {output_dir}/log-{run_name}-{timestamp}.json.
log_gpu_utilint100Interval for logging GPU utilization.

WandB (Weights & Biases) Settings

OptionTypeDefaultDescription
wandb_projectstring"Surogate"WandB project name for logging.
wandb_namestringrun_nameWandB run name for logging. Defaults to the value of run_name.

Aim Settings

OptionTypeDefaultDescription
aim_experimentstring"Surogate"Aim experiment name for logging.
aim_repostringnullAim repository path for logging. Uses default if not specified.
aim_namestringrun_nameAim run name for logging. Defaults to the value of run_name.

Debugging Options

OptionTypeDefaultDescription
debug_time_breakdownboolfalseEnable detailed training timing breakdown for debugging.
debug_memory_breakdownboolfalsePrint detailed memory breakdown after model allocation (useful for QLoRA optimization).

Recipe Comparison

RecipeFormatGPU RequirementUse Case
bf16BF16 forward/backwardAny CUDA GPUBaseline, maximum compatibility
fp8_hybridFP8 E4M3 fwd / E5M2 bwdSM89+ (Ada, Hopper, Blackwell)2x throughput, minimal accuracy loss
nvfp4FP4 E2M1 with block scalingSM100+ (Blackwell only)Maximum memory efficiency

Example Configuration

# Model
model: Qwen/Qwen3-0.6B
model_type: qwen # auto-detected if not specified
sequence_len: 2048
max_model_len: 2048
torch_dtype: bfloat16 # auto-detected if not specified

# Output
output_dir: ./output
save_steps: 100
save_total_limit: 3

# Training
num_epochs: 3
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 2e-4
lr_scheduler_type: cosine
warmup_ratio: 0.03

# Dataset
datasets:
# Conversation dataset (most common for fine-tuning)
- path: "mlabonne/FineTome-100k"
type: conversation
messages_field: conversations
split: train
# Or use instruction dataset format
# - path: "yahma/alpaca-cleaned"
# type: instruction
# instruction_field: instruction
# input_field: input
# output_field: output
validation_split_ratio: 0.1
train_seed: 1234
eval_seed: 1234
sample_packing: true
dataloader_num_workers: 4

# Chat Template
use_chat_template: true
template: qwen # auto-detected if not specified
truncation_strategy: delete
padding_side: right

# LoRA
lora: true
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
lora_dtype: fp32

# Memory optimization
recompute_block: true
recipe: bf16

# Hardware
gpus: 1
use_cuda_graphs: true

See also