Skip to content

Training

This guide covers training models in visdet.

Quick Start

To train a model with a single GPU:

python tools/train.py ${CONFIG_FILE}

Multi-GPU Training

For distributed training across multiple GPUs:

bash tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM}

Example with 8 GPUs:

bash tools/dist_train.sh configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py 8

Datasets

COCO 2017 (local)

python tools/misc/download_dataset.py --dataset-name coco2017 --save-dir data/coco --unzip --delete

COCO 2017 (Modal Volume)

Populate a persistent Modal Volume with COCO under /root/data/coco/:

VISDET_COCO_VOLUME=visdet-coco modal run tools/modal/download_coco2017_to_volume.py

Mount the same volume at /root/data in your training app so configs using data/coco/ work unchanged.

Configuration

Training behavior is controlled through configuration files. See the Configuration Guide for details.

Key Configuration Options

  • Learning Rate: Adjust based on batch size and number of GPUs
  • Epochs: Number of training epochs
  • Batch Size: Samples per GPU
  • Optimizer: Adam, SGD, AdamW, etc.

Monitoring Training

TensorBoard

Enable TensorBoard logging in your config:

log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        dict(type='TensorboardLoggerHook')
    ])

Then run:

tensorboard --logdir=work_dirs/

Checkpoints

Model checkpoints are saved to work_dirs/ by default. Configure checkpoint behavior:

checkpoint_config = dict(interval=1)  # Save every epoch

Advanced Topics

Mixed Precision Training

Enable FP16 training for faster training and reduced memory usage:

python tools/train.py ${CONFIG_FILE} --fp16

Resume Training

Resume from a checkpoint:

python tools/train.py ${CONFIG_FILE} --resume-from ${CHECKPOINT_FILE}

Fine-tuning

Load pretrained weights and train:

python tools/train.py ${CONFIG_FILE} --load-from ${CHECKPOINT_FILE}

See Also