Training¶
This page covers training workflows.
Training Configuration¶
Training in this framework is controlled through configuration files. See the Configuration guide for details.
Single GPU Training¶
To train a model on a single GPU:
Optional Arguments¶
--work-dir ${WORK_DIR}: Override the working directory specified in the config file.--resume-from ${CHECKPOINT_FILE}: Resume training from a previous checkpoint file.--no-validate: Whether not to evaluate the checkpoint during training.
Multi-GPU Training¶
This framework supports distributed training with multiple GPUs using torch.distributed.launch or slurm.
Using torch.distributed.launch¶
Example with 8 GPUs:
Using Slurm¶
If you run on a cluster managed with slurm:
Training Tips¶
Learning Rate¶
The default learning rate in config files is for 8 GPUs. If you use a different number of GPUs, you should scale the learning rate accordingly:
Mixed Precision Training¶
You can enable automatic mixed precision training by adding --fp16 to the training command:
Monitoring Training¶
TensorBoard¶
This framework supports TensorBoard for monitoring training progress. To use it, add the following to your config file:
log_config = dict(
interval=50,
hooks=[
dict(type='TextLoggerHook'),
dict(type='TensorboardLoggerHook')
])
Then run: