Skip to content

Latest commit

 

History

History
 
 

Reproducible ImageNet training with Ignite

In this example, we provide script and tools to perform reproducible experiments on training neural networks on ImageNet dataset.

Features:

tb_dashboard

There are three possible options: 1) Experiments tracking with MLflow, 2) Experiments tracking with Polyaxon or 3) Experiments tracking with TRAINS.

Experiments tracking with TRAINS / MLflow is more suitable for a local machine with GPU(s). For experiments tracking with Polyaxon user needs to have Polyaxon installed on a machine/cluster/cloud and can schedule experiments with polyaxon-cli. User can choose one option and skip the descriptions of another option.

Implementation details

Files tree description:

code
  |___ dataflow : module privides data loaders and various transformers
  |___ scripts : executable training script
  |___ utils : other helper modules

configs
  |___ train : training python configuration files  
  
experiments 
  |___ mlflow : MLflow related files
  |___ plx : Polyaxon related files
  |___ trains : requirements.txt to install Trains python package
 
notebooks : jupyter notebooks to check specific parts from code modules 

Code and configs

We use py_config_runner package to execute python scripts with python configuration files.

Training script

Training script is located code/scripts and contains

  • training.py, single training script with possiblity to use one of MLflow / Polayaxon / Trains experiments tracking systems.

Training script contains run method required by py_config_runner to run a script with a configuration.

The split between training script and configuration python file is the following. Configuration file being a python script defines necessary components for neural network training:

  • Dataflow: training/validation/train evaluation data loaders with custom data augmentations
  • Model
  • Optimizer
  • Criterion
  • LR scheduler
  • other parameters: device, number of epochs, etc

Training script uses these components to setup and run training and validation loops. By default, processing group with "nccl" backend is initialized for distributed configuration (even for a single GPU).

Training script is generic, uses ignite.distributed API, and adapts training components to provided distributed configuration (e.g. uses DistribtedDataParallel model wrapper, uses distributed sampling, scales batch size etc).

Configurations

Results

Model Training Top-1 Accuracy Training Top-5 Accuracy Test Top-1 Accuracy Test Top-5 Accuracy
ResNet-50 78% 92% 77% 94%

Acknowledgements

Part of trainings was done within Tesla GPU Test Drive on 2 Nvidia V100 GPUs.

tb_dashboard_images