Training¶

This page covers training workflows.

Training Configuration¶

Training in this framework is controlled through configuration files. See the Configuration guide for details.

To train a model on a single GPU:

python tools/train.py ${CONFIG_FILE} [optional arguments]

--work-dir ${WORK_DIR}: Override the working directory specified in the config file.
--resume-from ${CHECKPOINT_FILE}: Resume training from a previous checkpoint file.
--no-validate: Whether not to evaluate the checkpoint during training.

This framework supports distributed training with multiple GPUs using torch.distributed.launch or slurm.

bash ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]

Example with 8 GPUs:

bash ./tools/dist_train.sh configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py 8

If you run on a cluster managed with slurm:

bash ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR}

The default learning rate in config files is for 8 GPUs. If you use a different number of GPUs, you should scale the learning rate accordingly:

lr = base_lr * num_gpus / 8

You can enable automatic mixed precision training by adding --fp16 to the training command:

python tools/train.py ${CONFIG_FILE} --fp16

This framework supports TensorBoard for monitoring training progress. To use it, add the following to your config file:

log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        dict(type='TensorboardLoggerHook')
    ])

Then run:

tensorboard --logdir=work_dirs/