From 9ea10a68861576fbbe428c8d64f3ca435e0cd0fc Mon Sep 17 00:00:00 2001 From: Indu Date: Mon, 14 May 2018 23:55:08 +0000 Subject: [PATCH 01/20] First draft --- example/multihost_training/README.md | 231 +++++++++++++++++++++++++++ 1 file changed, 231 insertions(+) create mode 100644 example/multihost_training/README.md diff --git a/example/multihost_training/README.md b/example/multihost_training/README.md new file mode 100644 index 000000000000..dc9d80642562 --- /dev/null +++ b/example/multihost_training/README.md @@ -0,0 +1,231 @@ +# Distributed Training with Gluon + +Deep learning models are usually trained using GPUs because GPUs can do a lot more computations in parallel that CPUs. But even with the modern GPUs, big models could take several days to train. Training can be done faster by using multiple GPUs like described in this tutorial. However only a certain number of GPUs can be attached to one host. To make the training even faster, we need to use multiple GPUs attached to multiple hosts. + +In this tutorial, we will show how to train a model faster using multiple GPUs connected to multiple hosts. + +[pic of multiple GPUs connected to multiple hosts] + +We will use data parallelism to distribute the training which involves splitting the training data across hosts and GPUs. Since the hosts are working with different subset of the training data in parallel, the training completes lot faster. + +In this tutorial, we will train a LeNet network using MNIST data using two hosts each having four GPUs. + +## Distributed Training Architecture: + +Training models using multiple hosts and GPUs involves working with three different types of processes - worker, parameter server and scheduler. + +[pic: Distributed Training Architecture] + +### Parameter Server: +The parameters of the model needs to be shared with all hosts since multiple hosts are working together to train one model. To make this sharing efficient, the parameters are split across multiple hosts. A parameter server in each host stores a subset of parameters. At the end of every iteration, each host communicates with every other host to update all parameters of the model. + +### Worker: +Each host has a worker process which in each iteration fetches a batch of data, runs forward and backward pass on all GPUs the host, computes the parameter updates and sends those updates to the parameter servers in each host. Since we have multiple workers to train the model, each worker only needs to train using 1/N part of the training data where N is the number of workers (which is same as the number of hosts). + +### Scheduler: +Scheduler is responsible for scheduling the workers and parameter servers. There is only one scheduler in the entire cluster. + +## Moving to distributed training: + +In this section, we will explain the changes that needs to be done to convert a single-host-single-GPU training script to a multi-host-multi-GPU training script. + +### Step 1: Use a distributed key-value store: + +Like mentioned above, in distributed training, parameters are split into N parts and stored across N hosts. This is done automatically by the distributed key-value store. User only needs to create the distributed kv store and ask the Trainer to use the created store. + +``` +store = mxnet.kv.create('dist') +``` + +It is the job of the trainer to take the gradients computed in the backward pass and update the parameters of the model. We'll tell the trainer to store and update the parameters in the distributed kv store we just created instead of doing it in GPU of CPU memory. For example, + +``` +trainer = gluon.Trainer(net.collect_params(), + 'sgd', {'learning_rate': .1}, + kvstore=store) +``` + +## Step 2: Split the training data: + +In distributed training using data parallelism, training data is split into equal parts across all workers and each worker uses its subset of the training data for training. For example, if we had two machines, each running a worker, each worker managing four GPU's we'll split the data like shown below. Note that we don't split the data depending on the number of GPUs but split it depending on the number of workers. + +[img: splitting data] + +Each worker can find out the total number of workers in the cluster and its own rank which is an integer between 0 and N-1 where N is the number of workers. + +``` +store = kv.create('dist') +print("Total number of workers: %d" % store.num_workers) +print("This worker's rank: %d" % store.rank) +``` + +Knowing the number of workers and a particular worker's rank, it is easy to split the dataset into partitions and pick one partition to train depending on the rank of the worker. Here is a sampler that does exactly that. + +``` +class SplitSampler(gluon.data.sampler.Sampler): + """ Split the dataset into `num_parts` parts and sample from the part with index `part_index` + Parameters + ---------- + length: int + Number of examples in the dataset + num_parts: int + Partition the data into multiple parts + part_index: int + The index of the part to read from + """ + def __init__(self, length, num_parts=1, part_index=0): + # Compute the length of each partition + self.part_len = length // num_parts + # Compute the start index for this partition + self.start = self.part_len * part_index + # Compute the end index for this partition + self.end = self.start + self.part_len + + def __iter__(self): + # Extract examples between `start` and `end`, shuffle and return them. + indices = list(range(self.start, self.end)) + random.shuffle(indices) + return iter(indices) + + def __len__(self): + return self.part_len +``` + +We can then create a DataLoader using the SplitSampler like shown below: + +``` +# Load the training data +train_data = gluon.data.DataLoader( + gluon.data.vision.MNIST(train=True, transform=transform), + batch_size, + sampler=SplitSampler(60000, store.num_workers, store.rank)) +``` + +## Step 3: Training with multiple GPUs + +Note that we didn't split the dataset by the number of GPUs. We split it by the number of workers which usually translates to number of machines. It is the worker's responsibility to split the partition it has across multiple GPUs it might have and run the training in parallel across multiple GPUs. + +First we need to specify the list of GPUs we want to use for training: + +``` +ctx = [mx.gpu(i) for i in range(gpus_per_machine)] +``` + +We can then train a batch like shown below: + +``` +# Train a batch using multiple GPUs +def train_batch(batch, ctx, net, trainer): + + # Split and load data into multiple GPUs + data = batch[0] + data = gluon.utils.split_and_load(data, ctx) + + # Split and load label into multiple GPUs + label = batch[1] + label = gluon.utils.split_and_load(label, ctx) + + # Run the forward and backward pass + forward_backward(net, data, label) + + # Update the parameters + this_batch_size = batch[0].shape[0] + trainer.step(this_batch_size) +``` + +Here is the code that runs the forward (computing loss) and backward (computing gradients) pass on multiple GPUs: + +``` +# We'll use cross entropy loss since we are doing multiclass classification +loss = gluon.loss.SoftmaxCrossEntropyLoss() + +# Run one forward and backward pass on multiple GPUs +def forward_backward(net, data, label): + + # Ask autograd to remember the forward pass + with autograd.record(): + # Compute the loss on all GPUs + losses = [loss(net(X), Y) for X, Y in zip(data, label)] + + # Run the backward pass (calculate gradients) on all GPUs + for l in losses: + l.backward() +``` + +Given ‘train_batch’, training an epoch is simple: + +``` +for batch in train_data: + # Train the batch using multiple GPUs + train_batch(batch, ctx, net, trainer) +``` + +## Final Step: Launching the distributed training + +Note that there are several processes that needs to be launched on multiple machines to do distributed training. One worker and one parameter server needs to be launched on each of the machine. Scheduler needs to be launched on one of the machines. While this can be done manually, MXNet provides the launch.py tool to do this easily. + +For example, the following command launches distributed training on two machines: + +``` +python ~/mxnet/tools/launch.py -n 2 -s 2 -H hosts \ + --sync-dst-dir /home/ubuntu/dist \ + --launcher ssh \ + "python /home/ubuntu/dist/dist.py" +``` + +- `-n 2` specifies the number of workers that must be launched +- `-s 2` specifies the number of parameter servers that must be launched. +- `--sync-dst-dir` specifies a destination location where the contents of the current directory with be rsync'd +- `--launcher ssh` tells launch.py to use ssh to login to each machine in the cluster and launch processes. +- `"python /home/ubuntu/dist/dist.py"` is the command that will get executed in each of the launched processes. +- Finally, `-H hosts` specifies the list of hosts in the cluster to be used for distributed training. + +Let's take a look at the `hosts` file. + +``` +~/dist$ cat hosts +d1 +d2 +``` + +'d1' and 'd2' are the hostnames of the machines I want to use for distributed training. 'launch.py' should be able to ssh into these machines by providing just the hostname on the command line. For example: + +``` +~/dist$ ssh d1 +Welcome to Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-1049-aws x86_64) + + * Documentation: https://help.ubuntu.com + * Management: https://landscape.canonical.com + * Support: https://ubuntu.com/advantage + + Get cloud support with Ubuntu Advantage Cloud Guest: + http://www.ubuntu.com/business/services/cloud + +0 packages can be updated. +0 updates are security updates. + + +Last login: Wed Jan 31 18:06:45 2018 from 72.21.198.67 +``` + +Note that I did not have to provide any kind of authentication to login to the machine. This can be done through multiple methods. One easy way is to specify the ssh certificates in ~/.ssh/config. Example: + +``` +~$ cat ~/.ssh/config +Host d1 + HostName ec2-34-201-108-233.compute-1.amazonaws.com + port 22 + user ubuntu + IdentityFile /home/ubuntu/test.pem + IdentitiesOnly yes + +Host d2 + HostName ec2-34-238-232-97.compute-1.amazonaws.com + port 22 + user ubuntu + IdentityFile /home/ubuntu/test.pem + IdentitiesOnly yes +``` + +A better way is to use ssh agent forwarding. Check this article for more details. + From 9723c3f6505f079ccdb7f9f01c3380b97fc6e312 Mon Sep 17 00:00:00 2001 From: Indu Date: Tue, 15 May 2018 04:55:09 +0000 Subject: [PATCH 02/20] Python syntax highlighting --- example/multihost_training/README.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/example/multihost_training/README.md b/example/multihost_training/README.md index dc9d80642562..6ada3204e212 100644 --- a/example/multihost_training/README.md +++ b/example/multihost_training/README.md @@ -33,13 +33,13 @@ In this section, we will explain the changes that needs to be done to convert a Like mentioned above, in distributed training, parameters are split into N parts and stored across N hosts. This is done automatically by the distributed key-value store. User only needs to create the distributed kv store and ask the Trainer to use the created store. -``` +```python store = mxnet.kv.create('dist') ``` It is the job of the trainer to take the gradients computed in the backward pass and update the parameters of the model. We'll tell the trainer to store and update the parameters in the distributed kv store we just created instead of doing it in GPU of CPU memory. For example, -``` +```python trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .1}, kvstore=store) @@ -53,7 +53,7 @@ In distributed training using data parallelism, training data is split into equa Each worker can find out the total number of workers in the cluster and its own rank which is an integer between 0 and N-1 where N is the number of workers. -``` +```python store = kv.create('dist') print("Total number of workers: %d" % store.num_workers) print("This worker's rank: %d" % store.rank) @@ -61,7 +61,7 @@ print("This worker's rank: %d" % store.rank) Knowing the number of workers and a particular worker's rank, it is easy to split the dataset into partitions and pick one partition to train depending on the rank of the worker. Here is a sampler that does exactly that. -``` +```python class SplitSampler(gluon.data.sampler.Sampler): """ Split the dataset into `num_parts` parts and sample from the part with index `part_index` Parameters @@ -93,7 +93,7 @@ class SplitSampler(gluon.data.sampler.Sampler): We can then create a DataLoader using the SplitSampler like shown below: -``` +```python # Load the training data train_data = gluon.data.DataLoader( gluon.data.vision.MNIST(train=True, transform=transform), @@ -107,13 +107,13 @@ Note that we didn't split the dataset by the number of GPUs. We split it by the First we need to specify the list of GPUs we want to use for training: -``` +```python ctx = [mx.gpu(i) for i in range(gpus_per_machine)] ``` We can then train a batch like shown below: -``` +```python # Train a batch using multiple GPUs def train_batch(batch, ctx, net, trainer): @@ -135,7 +135,7 @@ def train_batch(batch, ctx, net, trainer): Here is the code that runs the forward (computing loss) and backward (computing gradients) pass on multiple GPUs: -``` +```python # We'll use cross entropy loss since we are doing multiclass classification loss = gluon.loss.SoftmaxCrossEntropyLoss() @@ -154,7 +154,7 @@ def forward_backward(net, data, label): Given ‘train_batch’, training an epoch is simple: -``` +```python for batch in train_data: # Train the batch using multiple GPUs train_batch(batch, ctx, net, trainer) From 84b4417387155c8c459931f2352b79c7adeca97a Mon Sep 17 00:00:00 2001 From: Indu Date: Tue, 15 May 2018 06:15:30 +0000 Subject: [PATCH 03/20] Polishing --- example/multihost_training/README.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/example/multihost_training/README.md b/example/multihost_training/README.md index 6ada3204e212..3589d136b0ee 100644 --- a/example/multihost_training/README.md +++ b/example/multihost_training/README.md @@ -1,8 +1,8 @@ # Distributed Training with Gluon -Deep learning models are usually trained using GPUs because GPUs can do a lot more computations in parallel that CPUs. But even with the modern GPUs, big models could take several days to train. Training can be done faster by using multiple GPUs like described in this tutorial. However only a certain number of GPUs can be attached to one host. To make the training even faster, we need to use multiple GPUs attached to multiple hosts. +Deep learning models are usually trained using GPUs because GPUs can do a lot more computations in parallel that CPUs. But even with the modern GPUs, big models could take several days to train. Training can be done faster by using multiple GPUs like described in [this](https://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html) tutorial. However only a certain number of GPUs can be attached to one host (typically 8 or 16). To make the training even faster, we can use multiple GPUs attached to multiple hosts. -In this tutorial, we will show how to train a model faster using multiple GPUs connected to multiple hosts. +In this tutorial, we will show how to train a model faster using multihost distributed training. [pic of multiple GPUs connected to multiple hosts] @@ -12,7 +12,7 @@ In this tutorial, we will train a LeNet network using MNIST data using two hosts ## Distributed Training Architecture: -Training models using multiple hosts and GPUs involves working with three different types of processes - worker, parameter server and scheduler. +Multihost distributed training involves working with three different types of processes - worker, parameter server and scheduler. [pic: Distributed Training Architecture] @@ -20,7 +20,7 @@ Training models using multiple hosts and GPUs involves working with three differ The parameters of the model needs to be shared with all hosts since multiple hosts are working together to train one model. To make this sharing efficient, the parameters are split across multiple hosts. A parameter server in each host stores a subset of parameters. At the end of every iteration, each host communicates with every other host to update all parameters of the model. ### Worker: -Each host has a worker process which in each iteration fetches a batch of data, runs forward and backward pass on all GPUs the host, computes the parameter updates and sends those updates to the parameter servers in each host. Since we have multiple workers to train the model, each worker only needs to train using 1/N part of the training data where N is the number of workers (which is same as the number of hosts). +Each host has a worker process which in each iteration fetches a batch of data, runs forward and backward pass on all GPUs in the host, computes the parameter updates and sends those updates to the parameter servers in each host. Since we have multiple workers to train the model, each worker only needs to train using 1/N part of the training data where N is the number of workers (which is same as the number of hosts). ### Scheduler: Scheduler is responsible for scheduling the workers and parameter servers. There is only one scheduler in the entire cluster. @@ -31,7 +31,7 @@ In this section, we will explain the changes that needs to be done to convert a ### Step 1: Use a distributed key-value store: -Like mentioned above, in distributed training, parameters are split into N parts and stored across N hosts. This is done automatically by the distributed key-value store. User only needs to create the distributed kv store and ask the Trainer to use the created store. +Like mentioned above, in distributed training, parameters are split into N parts and distributed across N hosts. This is done automatically by the distributed key-value store. User only needs to create the distributed kv store and ask the Trainer to use the created store. ```python store = mxnet.kv.create('dist') @@ -47,7 +47,7 @@ trainer = gluon.Trainer(net.collect_params(), ## Step 2: Split the training data: -In distributed training using data parallelism, training data is split into equal parts across all workers and each worker uses its subset of the training data for training. For example, if we had two machines, each running a worker, each worker managing four GPU's we'll split the data like shown below. Note that we don't split the data depending on the number of GPUs but split it depending on the number of workers. +In distributed training (using data parallelism), training data is split into equal parts across all workers and each worker uses its subset of the training data for training. For example, if we had two machines, each running a worker, each worker managing four GPUs we'll split the data like shown below. Note that we don't split the data depending on the number of GPUs but split it depending on the number of workers. [img: splitting data] @@ -91,7 +91,7 @@ class SplitSampler(gluon.data.sampler.Sampler): return self.part_len ``` -We can then create a DataLoader using the SplitSampler like shown below: +We can then create a `DataLoader` using the `SplitSampler` like shown below: ```python # Load the training data @@ -105,7 +105,7 @@ train_data = gluon.data.DataLoader( Note that we didn't split the dataset by the number of GPUs. We split it by the number of workers which usually translates to number of machines. It is the worker's responsibility to split the partition it has across multiple GPUs it might have and run the training in parallel across multiple GPUs. -First we need to specify the list of GPUs we want to use for training: +To train with multiple GPUs, we first need to specify the list of GPUs we want to use for training: ```python ctx = [mx.gpu(i) for i in range(gpus_per_machine)] @@ -152,7 +152,7 @@ def forward_backward(net, data, label): l.backward() ``` -Given ‘train_batch’, training an epoch is simple: +Given `train_batch`, training an epoch is simple: ```python for batch in train_data: @@ -162,7 +162,7 @@ for batch in train_data: ## Final Step: Launching the distributed training -Note that there are several processes that needs to be launched on multiple machines to do distributed training. One worker and one parameter server needs to be launched on each of the machine. Scheduler needs to be launched on one of the machines. While this can be done manually, MXNet provides the launch.py tool to do this easily. +Note that there are several processes that needs to be launched on multiple machines to do distributed training. One worker and one parameter server needs to be launched on each host. Scheduler needs to be launched on one of the hosts. While this can be done manually, MXNet provides the `launch.py` tool to make this easy. For example, the following command launches distributed training on two machines: @@ -176,7 +176,7 @@ python ~/mxnet/tools/launch.py -n 2 -s 2 -H hosts \ - `-n 2` specifies the number of workers that must be launched - `-s 2` specifies the number of parameter servers that must be launched. - `--sync-dst-dir` specifies a destination location where the contents of the current directory with be rsync'd -- `--launcher ssh` tells launch.py to use ssh to login to each machine in the cluster and launch processes. +- `--launcher ssh` tells `launch.py` to use ssh to login to each machine in the cluster and launch processes. - `"python /home/ubuntu/dist/dist.py"` is the command that will get executed in each of the launched processes. - Finally, `-H hosts` specifies the list of hosts in the cluster to be used for distributed training. @@ -208,7 +208,7 @@ Welcome to Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-1049-aws x86_64) Last login: Wed Jan 31 18:06:45 2018 from 72.21.198.67 ``` -Note that I did not have to provide any kind of authentication to login to the machine. This can be done through multiple methods. One easy way is to specify the ssh certificates in ~/.ssh/config. Example: +Note that no authentication information was provided to login to the host. This can be done using multiple methods. One easy way is to specify the ssh certificates in ~/.ssh/config. Example: ``` ~$ cat ~/.ssh/config @@ -227,5 +227,5 @@ Host d2 IdentitiesOnly yes ``` -A better way is to use ssh agent forwarding. Check this article for more details. +A better way is to use ssh agent forwarding. Check [this](https://aws.amazon.com/blogs/security/securely-connect-to-linux-instances-running-in-a-private-amazon-vpc/) article for more details. From a4f0c968d33a70071ba6ce3fd18de76b0d59c493 Mon Sep 17 00:00:00 2001 From: Indu Date: Tue, 15 May 2018 08:19:26 +0000 Subject: [PATCH 04/20] Add distributed MNIST --- example/multihost_training/dist_mnist.py | 176 +++++++++++++++++++++++ 1 file changed, 176 insertions(+) create mode 100644 example/multihost_training/dist_mnist.py diff --git a/example/multihost_training/dist_mnist.py b/example/multihost_training/dist_mnist.py new file mode 100644 index 000000000000..1f8dd436ea6d --- /dev/null +++ b/example/multihost_training/dist_mnist.py @@ -0,0 +1,176 @@ +from __future__ import print_function +import numpy as np +import mxnet as mx +from mxnet import nd, autograd, gluon +from mxnet import kv +import random + +# Create a distributed key-value store +store = kv.create('dist') + +# MNIST images are 28x28. Total pixels in input layer is 28x28 = 784 +num_inputs = 784 +# Clasify the images into one of the 10 digits +num_outputs = 10 + +# 64 images in a batch +batch_size_per_gpu = 64 +# How many epochs to run the training +epochs = 2 + +# How many GPUs per machine +gpus_per_machine = 1 +# Effective batch size across all GPUs +batch_size = batch_size_per_gpu * gpus_per_machine + +# Create the context (a list of all GPUs to be used for training) +ctx = [mx.gpu(i) for i in range(gpus_per_machine)] + +# Convert to float 32 +# Having channel as the first dimension makes computation more efficient. Hence the (2,0,1) transpose. +# Dividing by 255 normalizes the input between 0 and 1 +def transform(data, label): + return nd.transpose(data.astype(np.float32), (2,0,1))/255, label.astype(np.float32) + +class SplitSampler(gluon.data.sampler.Sampler): + """ Split the dataset into `num_parts` parts and sample from the part with index `part_index` + + Parameters + ---------- + length: int + Number of examples in the dataset + num_parts: int + Partition the data into multiple parts + part_index: int + The index of the part to read from + """ + def __init__(self, length, num_parts=1, part_index=0): + # Compute the length of each partition + self.part_len = length // num_parts + # Compute the start index for this partition + self.start = self.part_len * part_index + # Compute the end index for this partition + self.end = self.start + self.part_len + + def __iter__(self): + # Extract examples between `start` and `end`, shuffle and return them. + indices = list(range(self.start, self.end)) + random.shuffle(indices) + return iter(indices) + + def __len__(self): + return self.part_len + +# Load the training data +train_data = gluon.data.DataLoader(gluon.data.vision.MNIST(train=True, transform=transform), + batch_size, sampler=SplitSampler(60000, store.num_workers, store.rank)) +# Load the test data +test_data = gluon.data.DataLoader(gluon.data.vision.MNIST(train=False, transform=transform), + batch_size, shuffle=False) + +# Create a sequential network +net = gluon.nn.Sequential() + +with net.name_scope(): + + # First convolution + net.add(gluon.nn.Conv2D(channels=20, kernel_size=5, activation='relu')) + net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2)) + + # Second convolution + net.add(gluon.nn.Conv2D(channels=50, kernel_size=5, activation='relu')) + net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2)) + + # Flatten the output before the fully connected layers + net.add(gluon.nn.Flatten()) + + # First fully connected layers with 512 neurons + net.add(gluon.nn.Dense(512, activation="relu")) + + # Second fully connected layer with as many neurons as the number of classes + net.add(gluon.nn.Dense(num_outputs)) + +# Initialize the parameters with Xavier initializer +net.collect_params().initialize(mx.init.Xavier(), ctx=ctx) + +# SoftmaxCrossEntropy is the most common choice of loss function for multiclass classification +softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss() + +# Use SGD optimizer with a learning rate of 0.1 +trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .1}, kvstore=store) + +# Evaluate accuracy of the given network using the given data +def evaluate_accuracy(data_iterator, net): + + acc = mx.metric.Accuracy() + + # Iterate through data and label + for i, (data, label) in enumerate(data_iterator): + + # Get the data and label into the GPU + data = data.as_in_context(ctx[0]) + label = label.as_in_context(ctx[0]) + + # Get network's output which is a probability distribution + # Apply argmax on the probability distribution to get network's classification. + output = net(data) + predictions = nd.argmax(output, axis=1) + + # Give network's prediction and the correct label to update the metric + acc.update(preds=predictions, labels=label) + + # Return the accuracy + return acc.get()[1] + +# We'll use cross entropy loss since we are doing multiclass classification +loss = gluon.loss.SoftmaxCrossEntropyLoss() + +# Run one forward and backward pass on multiple GPUs +def forward_backward(net, data, label): + + # Ask autograd to remember the forward pass + with autograd.record(): + # Compute the loss on all GPUs + losses = [loss(net(X), Y) for X, Y in zip(data, label)] + + # Run the backward pass (calculate gradients) on all GPUs + for l in losses: + l.backward() + +# Train a batch using multiple GPUs +def train_batch(batch, ctx, net, trainer): + + # Split and load data into multiple GPUs + data = batch[0] + data = gluon.utils.split_and_load(data, ctx) + + # Split and load label into multiple GPUs + label = batch[1] + label = gluon.utils.split_and_load(label, ctx) + + # Run the forward and backward pass + forward_backward(net, data, label) + + # Update the parameters + this_batch_size = batch[0].shape[0] + trainer.step(this_batch_size) + +# Run as many epochs as required +for epoch in range(epochs): + + # Iterate through batches and run training using multiple GPUs + batch_num = 1 + for batch in train_data: + + # Print progress once in a while + if batch_num % 50 == 0: + print("Worker %d processing batch %d" % (store.rank, batch_num)) + + # Train the batch using multiple GPUs + train_batch(batch, ctx, net, trainer) + + batch_num += 1 + + # Print test accuracy after every epoch + test_accuracy = evaluate_accuracy(test_data, net) + print("Epoch %d: Test_acc %f" % (epoch, test_accuracy)) From 286cbbd09e7c32cb8c45d5dfcb4bf8413d3fb3ab Mon Sep 17 00:00:00 2001 From: Indu Date: Tue, 15 May 2018 08:25:43 +0000 Subject: [PATCH 05/20] rename --- example/multihost_training/{dist_mnist.py => mnist_dist.py} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename example/multihost_training/{dist_mnist.py => mnist_dist.py} (100%) diff --git a/example/multihost_training/dist_mnist.py b/example/multihost_training/mnist_dist.py similarity index 100% rename from example/multihost_training/dist_mnist.py rename to example/multihost_training/mnist_dist.py From 90f1ca3d390da606bafd246af5677df14f9e5874 Mon Sep 17 00:00:00 2001 From: Indu Date: Tue, 15 May 2018 17:32:34 +0000 Subject: [PATCH 06/20] Polishing --- example/multihost_training/README.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/example/multihost_training/README.md b/example/multihost_training/README.md index 3589d136b0ee..a60cfb35e50a 100644 --- a/example/multihost_training/README.md +++ b/example/multihost_training/README.md @@ -1,6 +1,6 @@ # Distributed Training with Gluon -Deep learning models are usually trained using GPUs because GPUs can do a lot more computations in parallel that CPUs. But even with the modern GPUs, big models could take several days to train. Training can be done faster by using multiple GPUs like described in [this](https://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html) tutorial. However only a certain number of GPUs can be attached to one host (typically 8 or 16). To make the training even faster, we can use multiple GPUs attached to multiple hosts. +Deep learning models are usually trained using GPUs because GPUs can do a lot more computations in parallel that CPUs. But even with the modern GPUs, it could take several days to train big models. Training can be done faster by using multiple GPUs like described in [this](https://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html) tutorial. However only a certain number of GPUs can be attached to one host (typically 8 or 16). To make the training even faster, we can use multiple GPUs attached to multiple hosts. In this tutorial, we will show how to train a model faster using multihost distributed training. @@ -162,15 +162,15 @@ for batch in train_data: ## Final Step: Launching the distributed training -Note that there are several processes that needs to be launched on multiple machines to do distributed training. One worker and one parameter server needs to be launched on each host. Scheduler needs to be launched on one of the hosts. While this can be done manually, MXNet provides the `launch.py` tool to make this easy. +Note that there are several processes that needs to be launched on multiple machines to do distributed training. One worker and one parameter server needs to be launched on each host. Scheduler needs to be launched on one of the hosts. While this can be done manually, MXNet provides the [`launch.py`](https://github.com/apache/incubator-mxnet/blob/master/tools/launch.py) tool to make this easy. For example, the following command launches distributed training on two machines: ``` python ~/mxnet/tools/launch.py -n 2 -s 2 -H hosts \ - --sync-dst-dir /home/ubuntu/dist \ + --sync-dst-dir /home/ubuntu/mnist_dist \ --launcher ssh \ - "python /home/ubuntu/dist/dist.py" + "python /home/ubuntu/mnist_dist/mnist_dist.py" ``` - `-n 2` specifies the number of workers that must be launched @@ -188,7 +188,7 @@ d1 d2 ``` -'d1' and 'd2' are the hostnames of the machines I want to use for distributed training. 'launch.py' should be able to ssh into these machines by providing just the hostname on the command line. For example: +'d1' and 'd2' are the hostnames of the hosts we want to run distributed training using. `launch.py` should be able to ssh into these hosts by providing just the hostname on the command line. For example: ``` ~/dist$ ssh d1 @@ -208,7 +208,7 @@ Welcome to Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-1049-aws x86_64) Last login: Wed Jan 31 18:06:45 2018 from 72.21.198.67 ``` -Note that no authentication information was provided to login to the host. This can be done using multiple methods. One easy way is to specify the ssh certificates in ~/.ssh/config. Example: +Note that no authentication information was provided to login to the host. This can be done using multiple methods. One easy way is to specify the ssh certificates in `~/.ssh/config`. Example: ``` ~$ cat ~/.ssh/config @@ -216,14 +216,14 @@ Host d1 HostName ec2-34-201-108-233.compute-1.amazonaws.com port 22 user ubuntu - IdentityFile /home/ubuntu/test.pem + IdentityFile /home/ubuntu/my_key.pem IdentitiesOnly yes Host d2 HostName ec2-34-238-232-97.compute-1.amazonaws.com port 22 user ubuntu - IdentityFile /home/ubuntu/test.pem + IdentityFile /home/ubuntu/my_key.pem IdentitiesOnly yes ``` From c7f06be21b12919d2c33955888bb8f394a06b780 Mon Sep 17 00:00:00 2001 From: Indu Date: Tue, 15 May 2018 17:36:02 +0000 Subject: [PATCH 07/20] Add images --- example/multihost_training/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/example/multihost_training/README.md b/example/multihost_training/README.md index a60cfb35e50a..3f14b6b1a892 100644 --- a/example/multihost_training/README.md +++ b/example/multihost_training/README.md @@ -4,7 +4,7 @@ Deep learning models are usually trained using GPUs because GPUs can do a lot mo In this tutorial, we will show how to train a model faster using multihost distributed training. -[pic of multiple GPUs connected to multiple hosts] +![Multiple GPUs connected to multiple hosts](distributed_training.svg) We will use data parallelism to distribute the training which involves splitting the training data across hosts and GPUs. Since the hosts are working with different subset of the training data in parallel, the training completes lot faster. @@ -14,7 +14,7 @@ In this tutorial, we will train a LeNet network using MNIST data using two hosts Multihost distributed training involves working with three different types of processes - worker, parameter server and scheduler. -[pic: Distributed Training Architecture] +![Distributed training architecture](dist_train_arch.png) ### Parameter Server: The parameters of the model needs to be shared with all hosts since multiple hosts are working together to train one model. To make this sharing efficient, the parameters are split across multiple hosts. A parameter server in each host stores a subset of parameters. At the end of every iteration, each host communicates with every other host to update all parameters of the model. @@ -49,7 +49,7 @@ trainer = gluon.Trainer(net.collect_params(), In distributed training (using data parallelism), training data is split into equal parts across all workers and each worker uses its subset of the training data for training. For example, if we had two machines, each running a worker, each worker managing four GPUs we'll split the data like shown below. Note that we don't split the data depending on the number of GPUs but split it depending on the number of workers. -[img: splitting data] +![Splitting data](split_data.png) Each worker can find out the total number of workers in the cluster and its own rank which is an integer between 0 and N-1 where N is the number of workers. From 8a1f0517eb7b98eeffa35c3819f72323390f4596 Mon Sep 17 00:00:00 2001 From: Indu Date: Tue, 15 May 2018 17:37:23 +0000 Subject: [PATCH 08/20] Add images --- example/multihost_training/dist_train_arch.png | Bin 0 -> 20415 bytes .../multihost_training/distributed_training.svg | 3 +++ example/multihost_training/split_data.png | Bin 0 -> 15881 bytes 3 files changed, 3 insertions(+) create mode 100644 example/multihost_training/dist_train_arch.png create mode 100644 example/multihost_training/distributed_training.svg create mode 100644 example/multihost_training/split_data.png diff --git a/example/multihost_training/dist_train_arch.png b/example/multihost_training/dist_train_arch.png new file mode 100644 index 0000000000000000000000000000000000000000..017a217e42bcf1bf38db482faaa38d4912396f2d GIT binary patch literal 20415 zcmb5W1ymf{wl+!jy9a38rEzz6*8~!1g1fsn?hYZi2X_b_+}-|SpR?aR z=YQ|q`z~WJ=s}OFS~b_I`N^ECf)(V%Q4sMEVPIfTBqcyfFfcCzU|?Ply@3b*bK*-J z4*Wr|m(Xy6fk8rl{(UjNjzbJy?3}u57aK`qpB11i@k-^?M_s|uPG_` zQV9%cjDEE>HEgMZenc-`ytt|2CndE?WYGE^5g{%t{PO<({?sM~BsZupTB=#Qf#wnz zMC>~|KNEO!wq1O*d2zMkrc1NWnRdaBhz4Fx~{5|ia5m59E6 zCeKJTg*2Y&P-4{f0tIPR;7(PZZmTEm?fFhj`2xA4>S3M#;#zlz(L}y1BrHtS-oB#S zEd-lUbw#h$lYPZ#qp*OL2+8bnK(9v@1 znfDzCl(fUIBPT1{-5*0aRmq^&mTJMQ$WS?FT4A$L2Tc}&z8TVK_a!V=D;FD0=ReZ? z$jyzrH&fo*6^y~`cBp+JpQ53pgl;;T9;T-2A^IR9&Fp8Ybl+}1MkXJ9@Bjq7#;uE-8dDqcLDhM^*%!{fb&$j*2k*hOfjOb2Ckbrl{P z8@qGCuH8oS^5-|Cz+`qSy!(d-6ha>)P|X5_)OQlJ==w3_&sX)2wKlFi2$>a<@<2E>(uDDiJZWw$3f3(HBDDbShQ<1 zmi+9mlYxCKr(`;cs+XdKLOF&@nKkTH5bVkuDDwf*v34 zjDXdZ?~&^)3*~%E$n$l@_YQ@bnHf4M>sTVz8&24X#Mttg)okQ95LO%kZ?3JOq#Vbj zI*_<{@g}(&4fiw<+Mg`sVzyG25GS}!t z`5lusUqHGx;M_CDB=ce*jnj_kdCrt0SncfW;;8s9XN~eNBzSQoosBiz+}xaZClP?d z>ojW($b;kC+uLu+_o6LhJA+V*NA-Lz;-uJe=VvA+lx3~PbAzF$aj*k;>vb$wBpjE& zV~og12kboM5`w9YjJZcLgMxx?DdSy9*LsQlz9uAKH#zS(&a_0UOOz?22S-M}yNF=4 zQ4b{-Pl}HA*Q~YFpqr_&Q0BYaug)J$3X)-Zi*FB1Ov)WD9-igxQg^qg%>quv1CTgq zzF03dR9adYdiS?!T6L0!im+(6c>^;WhR><;Qng5lBIQmP6q|6K%z&Y|TsfCAD*=5@ zTPoQfz-Yj;p~l@ilp9silIw6XTWO?Ns>vR=9y|IA$~m-R6ZQ2_NK$SJtG^!}4z5US zAnE4%T3rXKQlzx2-?I-$hu{LdH^?QFi*3ViYN{^ks9WMI|# zVIzJ!!%z31o=tVR-LD<*@^={LvciGJKrFS93vG0t)Ln{X%*Udys?Z=430&OUPw0}Q z+5Wk`&PkBd5&q)fEU=`++Zyo*BB;%GcDuHle8kw$S6NaF&cN;cnJ9>`G@|4m1eP_p zl#tg|#YSQyL$~KNMy$(2biS!CQClPsS;*bJLFzu2xt9EU8n0VjBiQ0Moeg|mB}WYh zOY||GAO;#HIe-2jD3((jxsl7UphOA}M%s`x&+1ot0xmYTS=2&Ix;4w>7ZtP(Y+Uz? zA75TUUpN5|=+T`=d{qVuxo`kHicIS=1VZqQ_4zc-j_iM-7HO~JD92y8wbAl_0akM{ zXXW`vVD&O^hPim&je%+z}4bsP4X&edZ`@)nX z2SQ!*W!c^TB>evcm6#GK9p`5YQNF$@wj)hmISKpvjL#T*sQRhnH%IQ_y$Yw_+rMzs zxLjRU=3h@SROT-Edbg^X|LwqIutwo*)7R}^3r-_Z%vP*ccD4MWuwRS`I`H2I&cB#?s-7Uqz}ue%I)ffb1>kLG$r{?*j!R>pAusgfu4ok5%K&HkKG zo8~k{Ay#xdG)<5)6c`Cr&^`@LtiGH-JHqke9_r|uLk7anW}`Q|I&d0FE~%+7YV?_` zmKsSb+FLZMLm4Lg=&%Eq zvlts&QAveufBCgIWfuHS449c?EGh(XVIs|q#3KAT8u?t2CQ z(n-;!$t(0X$)T_+3 z%cEl7;n!B}efMOTulEU)xy@#}h`%6w9G3D>MY$8?yn+zEr)! zh&9;f)^i{Yu69VdJHof%z)Q&+#sGTYNstjIUkgT_VSJyT3%rLDeux)Y!>-Z>4d6bn}RKky75Td|CWh zTD1|w+fuE1CZitY{pD7C2L}iDKUqP)iq#l_)xS8yEh{bE`3(tMAzudf5e49t*f_}1 zI(I<2I_|-@BOxb;{f3CC)SMg}oBNW3k&)5+VoD8%&%Gkq_h6o#&HNWkfn1t!XCR7V zo)jkL`vJaJuitjHc$`@iK{pb0y-!5xv>V<56hW*%nu6KjCoD3tKq~edG{uiMX^Pk< zCnpkea$@%OOiu)j0Ndq;McA0Hv$2|=qE#!)^tw8vQTGCP>+1G)Q2mN8{`N?kh>FVl zpFe-foVDE@HoNbY*8eKk6RZtopD2*;#n$uwlEiA3?Qyn66NS%g2>bHYtFdM`J6&=R z$n=yp$M$fiAS1B2m==pscSul_p*rO)IS4fy6pW6;VT}Xu3y14tgW8ReG~TgXNsN$B z3ZX>&c$YC_rU(Vtr%Z};4H@)C|WV&tHJ>DBfLlYkM}p~m4@GjFADPW*8ni2 zQEvCXaeTx*mG}~i+{EcA*$QzKc8veI0+3zli!OP=hp$t`YB>O7$!EuA(8i&ZPp@gD zknGAuBy%bmTV#=&E1d%3L4(zF^64z&sW8+tM=Y#rs1RVZ3#MW^GHE=R>pfxB zH~gD#xP1hnaoBA8k(DnUB5-Y zs#X1IM#moeXSZ$FN6G;7t{ z(kKjluMuaflXP}+TGu0qV8S5}A`#L9?slq7ii8RQQv0LK{bUWg9@O$RmQ-pd6V_ie z1Pe?EVmZ6G04tMkTeOV;#I&%hKq*hEMFIjT=^G5?*(8P!*m(CWhfFX#fK{aeLq#a0 zSz~UH%y6KxL1ClHVLhwwqG2?mBK2mt_`8vj(G*8(YpVwk6^*C_T$cesYgP%l4^&id z+8=L=cau!R*ev4>3`;DgpqPAod@78bPSIX*a=_9wP6O_bu}HL^;Ej}+4$rn^ijY&Z zjfbdAqY$$a;TF&SgFYCC))^GC(j=$w!zsd8DOxgL$7jrGdy6dU( z&Vp6?SqxK+N|qb!a0y=qwFdt~8&d$JBDVq|!l-ZbBVR5m5TgGd8rkTd0$UCfS#?Jd zj=rb%b4YVY_xTz)qA)F5kX>&OhwrDPk3j;${can-LAT! zdj3m0%$HrE#F6=S7e)_zaysXf9!df8xNIA}S92 z$jIISAtO_8`ZP>8gHJ#dS{SFe+Yrh?*Z>R2V6dS(_We_}`MDkR_kRlfIb<5=-#=w!#NE ztSimX(PH0ltpvNGovmHe;)s>r$8gkFf}s2UYrl4f{x`dpsAXXuXV=Z7t-s-&#n0Y%YSRCuzF7<<(Ed^pZot*H)sXD!ec=#WqJ=IX zBtYY~yR&DJ+^)1OMFUyn^(v{xMn65IKhNAJ%end<6Whv8PX!YmsDD!x-& z0vD~h5w>^E363VWS+QkDA+ih?*gP5YYA@+1Zbs3l&fqJ2gs~;EZ6NepSi8=B&&p3v zNF+ei=z9o(oPMunU#C{edZXKTjd)AH_xeeJW8p*Z)}gguyH#)5>TTm5&6USy2JJER zYL5RW2g`Np+=AR<7{z>P1&XPGZ1rwK33=Jxxl%LwX>2bN)mCZhv{>}-6Mm^}R?Tru zDoRshWwvFOI@|RbY7&2a)XpuosRDU z3R!c*2ds%Te7jEfC0Nr(n@PMSxUihH^euZPtb`cBQ(g|EqH(5fk0q?Wx1;i04Zkyo zFQiJtCfD8+t=T?WI*H#}z0}NP)$|wwM?;6eT?+$$LUgIav3M(WD-w7 znth#ldNPr7omtN6N{DwVbbDhyCSMrFYEQN)I(J#wHt$%IG-xX}2S?ha*!eaxHHVyO zi<~NEu~0=QQB6=ni;voYb|ycy`(8;j)tes|p#Y<5F+Y|{YCBv{`~muMc;soXB)?!u zPUL>a9aBZ@iO9yY`;jnY{*_hdAi(_*Cs@%(A_G5QfpMdG<5O#5J$Hq6(-tHeeFo2? zM?o8jsd>1lQPmCu(LxB>xxF^_2m02&e%0M>(Z8J&7e5#YR0sBtTV2hcSebK|Q84T` zWd=G&{irZ)Pli?psW7fY?3tArtuNq@IY@~UsN2KzsuFZgQ)_{11CTQ+ytZs%Rgpj` z!8Vi4un~3p$(2YEPuE=YElcnV0&{_x$wKD*S+!E{P>S3V7;nl48tI`9=)m8$ zEBu=8zPZL#IbCI}01eICZBQ5!=BnNdwP+$24k6O|DiCt_dlQMV*w9asgxr1Md?f$% zd>d@UYCtDAuB%A4V3LQt!{d4>2+UqZF0t*F-s^veq;fbNpCd@vot!X}`SAJx{LWm0 zlVo>>>Yiub6=o-N*VO~D6u!MEd|z%{Huc_$P*v9J#bNq4g@TE9&Ae;FXUn~M?cU$8 zUg?S(-kZ>Q=kVUPp1q>sFpLl(H8bxK*|1J!!FfMXEL(rG)cb*0D~(~Qh1GK8Q!tc5 zq0s$`)ud4P(wy1*w;Z@>8HxSgQ{UKK_$b^YbG2AdXJ3_Z!Uagql&@~2yX>}F zmu|hGQxl1MTacIkdv+SLaJGS!fnW&*8(Zz7RYuUw9G^k#K(zn+>YLJO93tIslTBr5 zf?xI$MzdA&JR$Ryh_*z!g|kdA;YE1DZM1?nx?pSDK4&(+?r-(IA$Yt#)Im!Pzy4Rv5l#&b6BcznMkQAj zwxz%#anO@`pd6W?sQow1#0zbaxWsSle?GUij|8SPzfQk#Oh6>bsVncg%**)MJ%#e3 z!^Gk6*fkazN$sq;sj+yxA7w>s?ZyuE2W95hZauQgE?;-|yKHTDtBD$=xIKfJh>YS* zJ|qWkiE&Odi=G^}@Y%1u5-bOT@*eO30vjtckPec$*4dcX6=s_=mYjL9vQm4tE1@E# zuCFF*$Wjt=ELIZ6d}qloG^9|}WC-*9g>rK79dN3z>AXYWSG&qqes3KHM>&NTE|$Pr z5MLk)ou|714@s*gslX*$dU_j5))d}q*I{kqm8Ce_k)W!L$X3RwInxNyd?3wmlU8aQ zty#oRK_->bnL_uSf!+F)8AC<7*bb7Nlxnv154~1D$#Qo_<{9-mlU}798QDJQGfB7} z4J46W-_;*Y+DfsP$3E72nMNZvffB|t$oT1H;ml63a~#aq(J|?g>rX7k??f@*e6I*3 z1Q#t&IVm=H?=l&NABjpiMkBI?z4RY|eo}Z(57uo}vs=;vWRdRb+6M^;ztc>SbZ7{8 zYG4cT23t8}k&o{vIYS}>$XO9%`SqMd3Cmnc7(^#$u?3ZG;2tKBQh>-4Q1)>x7fWe_ z;21o>zbGZQA3kS3zNGd4T``~kdpOl&?H#cb-rx%USF7!a;?In^jy~>paOy)x7z&H~ z%3?V#?S>s!d|-hWUnC&4)`%p*nF)4iDG2Y;mWaKqdqke9wmnbVVf}f128ru*hVThARc)`&PE+Jcb}YwogbpvW_7+ z+n4ib$CTk_+T|LnKT+>x2R#zyy4*Mlcq&pkS|5`E?YoO!g!s1NQ)lT{+S$r|zrvXd z3te;2h6+Iu0;)+EU{<-%Q~&@XQ}82 zR$dWbyj46<-UR_ZXO}`rGKW&Fd_O`F6o<)zqRlJs)!P!SOeiC?e)pScMEd|z7$o&O z8T)Eh2~b&UzT7WqKWy|u@j_O6q2oP79ZRJJ1~aNy-QWPLSuI#$1Aef>@YNu=x9|p+ zTfV>F)7^}3fM#`E<2vDu=(=vs_)QF5Ga>PWDxD^HFoL(9WXD5V#gNEA=mOE3kQg&R}bpQ zOE(4xVtapPeO+HD&cd(KjH!u?Q`|E{O6mw7p2$Jjqo-=&G<4$grc41p4$n>O8$6** z0Lvb3fKZi}OEaH|=dVV-uKNwWy$Z{DB>OA6uRLjivrC7JNwcU4@`S<@ZASKIn#itC z_i*Uzj9Ggj69@W75Qt+C5RTK5b#6xfP~pe!cgrUT+T2dCNIfnhy{pP&4Kq}{vz`9k zfb30Ah-mj@i&Fd-D5BsxeRm$?HC=R7Rhvv!p^Xx^AT2H1QCUi>Q0ZW%&DDg@z)-g9 zmE?Q)i$gnfJ|r5m6HT*}nXd)0$VG?O@8t|*hG>!0#k*P~%tgKyg-V03p6a)iHndTx ziJ8-8HY(m8L*rxgf(ElC?FvvO9rodJb;)}HD+1E=C&&~PXoHP_+9rPdNAY&J!Txuy z38O<>?N&&Rmfy4w(;L9HfNadg{{u&e6}^aC&G@%krxEgsITaBm0_W_Cl(>tMWgF8V z-BGOyXRD7~_q{C6csJDkw39zew(+1Gj0Pw?`bXu}QzBCt0EkHCX8&gpAvg}SBRjdT zfBhRn;pZB%Zo4?YE6CcR=N(+oPj#V|dA>@FU%1_gYoUe#r!uAWWt3 z!(~%De20D{fzW>ctJ&&PtJ;F-=2(K&gA|9tDcjG`ovp>-?f^hQ5!%a_`?bc>_p6K#keZH z=(k$u(T49LzzdN#1$=m+)j@SxQq_l8W=w_NDCcPnUFSEdKQZr5mYQkY20ukYQiFk| zK{22TcGK#8`FAe>W94bgHn`tNQ;H?|(nIoUh8$SbWRZ1*eovQa^U6h1k(tmX)|lVK zIM93-+=oU>R8oXb_ub|M^9>*h3HX2xXJ=W@_Cj=gwCReR0d8G>>~DhwZ^^c!zen&B>eh5hmQpaLmXMzp`s5ee<^1Mv7!8 zss2N71b5ix5z5cnvtDpEx>BP+%fm~$tt$j7+e_-4f+W4l-*-IVV26Z?`-ct)f2H20 z>Yg*+Ku9PgH5`aa4~z;lMGG}4(|3h!yZzefk(_Ps>U)t+wl0uKgFZg}1`>7n?NU&Z zXRV~j+JUrSrH6X6Z)zD7Ls!>R7c43<;_?qTG1eIu86M<>A=@Gl%lawm@dYjFa8_&n zzIV}0r8hn+!dl%z|gb*8ZeAJ_3bR2!s>j$ zF10ldv2Q7r5UKAl)XP31CcGJYVDAUkV)8kg)jt3y(Bgqs|Mu5z;mU)?djPB8)|)i> zYkDHWgcrNby(f8Lm5ET4()fOo1sYD(L|J;ur7pMQ?i9H(U2VE1GS@V9(;lu#>RN+x zkuKvoq_?92`2&r*C(W4h%--c+Fc!dP9rCYwE$K!6?ABk7RxXwg(s)F*Jg@MP;8C5X=bL+plN&Zgi|DhG-ExiZ;4yd5?HG@ZTG~DvhrHK^EF`;KX?STz~ z^gVjBR&g6sX9dV9c(gW9S4qv}WPJoS;!{=q(BWr8p{VkDKe_-X2`Ixboo6lz39lBTwb| zRObggxkHVD<-uvo@E?xslj~#SO3L=udx$F5i@4a~hk+p>#kaco8N4+p*g2Qk#@+T` zRjrUw%Y`j zpSE2!hyvH4pKqlNH%RjUzR*IPgHe28Ymk=Ox51E@j_!l4f;0njyH>Sbn)YrJe-o(|88;=M>F@ zSL6~k;TyvU@DIPPqknMKvsshHP(bw6n);13L71UOngNag)Dtw6WfQi^5@C#n0*?b#YBX#qG#Ru2?+#odEj(oVHq)F-x` zGxRp%CkQZBl%mW%q_}^^g<~YC@N#|Vi{RAWT4j4=sym?{Lt>~UJ+bHsOPogsTGT|$ z(Di&$d9i_3C?QPV@C6_lXZg-`1m)D&fs!A8xup$xOuNh2q&7FW;L1*Qr}8~APF@~) z@k$F4_S6I1Ux~Aof{VrYi!)KwC*?({27kZ-xzMPS1m(VtK9o$c$@u)qrAFzj%4O{R zMf1&UEq*5#j6nMvDcL@W8!bDaJ~p5 z!{Z`H8&%iC7U<3tOb!0|afD=l@DqXWbXToq93sJ)tE{{CAxfS1j6JR=pL|8Q3nH)5 zMNt;(A-qx30%jg88r2Ou#(|T8Y13*$Q zJ_maq#si{Zs_SpjFl<4}i9@%AN~cDcvt?GZ8DF>B^CAm_A?=%Aumam5yHsoF_iuqq zwQn_+41~$#+v&7*Xe3Qf$3hrfcoOshR?*+H7Vq7bd z6I1e{HjJ5 zT#Qg(ad6CSz37H0KB`-rSnf43P^cl$XQ*kZ$s?!T5UA%yAU>X^GI8$kN>l%83`zZN zroXxF%6`ewT)UJl-@6^IXeXZEMp*iN&*n8~QF8d^+Ue5O`ml__z0epf6{r={_KE?} z65em?FSK)gfBB08N3f7$M5ir0?I~3}Z@iniLd53cp`wWP_+Zy3|K_fyDE!NT71`}t zdxuOg3^DAr$GVCP-+bG(9~Wjy=Jeeo>%sBdDWU4C{E;x))i~ebRyX9GugR2njh!r)wit&k`^@-S_?VU!An?6d#QO({NqQ<PSLlQ05u@~Sk9=&#b{gl%yvNBIlAzwT?X&-T=|v4lfvt4#-#Y*=BTgV z0|ctogFDR*FFKJ&JpD1#=qT1Y6dFhYFtYBp)K2QK$+ob|&Cg9e&f=`H&;5gL{)sVu}M9`q&AULP&WUOO98w%l%N zrnY)78AOx{7m;#R>O0rP(^G}V|HdH%03OJwCGbBPlyj0irM{be(7T)E&bHPEqIfw3 zVEVwhjJwC>ZGZc;!Bh?Q??a6@E<7le1v=nYjDRtvAtN2y9%{7UE7X}jiaqN)ht&KA z9$yVf)Sg+8kA2guc!ikP4P_baXlbS?*PNfPguG0*J+9OT0Trod;6Rayu~N|fG`cBr zex_VMinoRM(%CxN2xb2y>-M4cark6+eYmUm=zq0%PNBSMSco~=Cs?6wgsy@ykLJADapd$^$Yk=LMA6G>&dBLJ~9CB+iI6dn$`TR$cq{AICG0BCij z>U68ZLs7^0@SdT}_RSlH5T~H}9Aou{xUR<<(52u*VaIbpwXg;&e);=rBgrn3_;n^O2d?ct)wD!%x`;jW-CRfNVI9*vL zBLsNtk5kdG&K4ylJGTY*yCy}oh}{JTVBvmrE}@)o+G)wO+h2SfA0J;`KREfZSCe4m z^zkzFcb5)duoi-v0)WC`Q925L%-worMpE(azh?h$va=!~rp+>1Z`YHh+Wf+DK5BSy ztmbIYT+llDmM-ZUQ#q)pW{zq{jG%{30BI##h^umsv!T(;a3hrr>eG;0`@c3f7Z#83 z)fn^vHh^>8AvNLTbuq%=FSIULca9t~@v<^R%`S7`x8H)1#QJhL%1^rZ7!G{d^r!iA zZVGlnGX{#pd9OD&vUoVsk!E4dDaDBAxvLIq2fUWync49+&E74GQ-Y6C6A6Tuvjj;1 zDN!30?I<^|fPRe6RFWdeGjIe*QkM5p-P;E@k_W@*)Mr(e%I)9ocPHmVR?85Ln@}?% z$dr@ZGWcgSDm^}beBe4{jH|{WSgIK%(C^6TUpbUCkLS&#K^MVFNbClA+#x<3Y#ex} zib@mO;#t1jRPT93w-NNSTO{(G*T=z!<+=@{*y0Iz)Nl0zLDd^LK*GA;mJutH!y6{z zi$k-Q8@~ zxXOUI%o;fvhsXZzyUhEpQFk(MibKBA6h4*F4xzBS>K>FH?Hul$=y<}*6(6c#?X(S> z{A@ghrOC3T^m6+^4YtFeR@S13`)rHjgt-7Lb~}HwPV<0Lq)oi+`vU9i4)#6gU92K> zTW#o;{}&!L&g{dP%z3bOEPlZc3t+uAtoLzavW4wTyg!>h>@D3;2{|&^FW_()Sc=1M z(bGar7ahd{D|XEwodoyi=iU48jb%!`sUzrXC>ne^zU3b?S7!(5820$>0Kf$$$Y8E; z>iWmcxZfJbveXbFMvafwJ%E#|20>hbGG<_>WddK=dXN~$*Db+5`p57?` zSu7-#!aNhXi56ul(c+(~eVT17zN-AVm)j>mL9gR-V~6nwG|5nf%As@~FT{2npPc^K z_@*T@IDBn^y{k%-=je~F->>ijBi%e~uKlppNe#d`{ zaiH7@6!lD29^Wr8%o%Y*Va;)KJ+odzxs%K49CQN1R9l{Y(pJA@kz*hTb9ki%HFyQz z{v=X2(|R{6>P*P1cY(8TyrYn+!vf##%Oz2Z6pzwULMm#Ny8mCO~*jwSrVT zv6Sn@Wob8f`4{D%cSMSFe6P$D8$eW8p?j7tHetm0-x@;300#T!*#qWT|MpJ?d!8D7 z*Ind;KLtTI?gy-d3Mu?VG|gaX-D>u#Tak_zP`pp`eqCJ@Tf4fj*cV{_ALNvYQ%w{> z>g7t*?159@Qnv)|&cHspdZW=^SH>%zx|ZK8ip__l^ySS-ox3^O?t65_!EO)daK0Jr zSOSmTQPJK=Xk-E%KxG9Cu4<+9j_6+sc`hS;sL*y{Z?D0KJ;@`7iG54tLK;};N)8BO z1^(|QRLSjrDIo@|ZpAkyg-N6E;4W#v!xOBiNxp3V5~sr3B^D9hVC84rkPJMQ9T$VR zU)(EyN&UH6F`?4UcWld;DVgQjp3~J6$R`Avg0?xGc&6KQJ!JeUX=xH}>T$o-G{$^3 zyIXZqKU*bHrVMFZ_so`Fm%>V8s2Yd5Y7F6CQ4J(rtnp){e17u2#+W{+M277-x5(~+ zkF9f8(z)9Ur{d!_YuyAL6qs2ySP= z>pDm|(cox~V$)<9C0Z8Pb^4iCeawgkjh_j4ZpCE5afY2kflphVJ@v1q;8I$n>Rf5? z^2{K#3ho-XpqFWr@G_!dH_3tM|F1syNYr zQmYMjj(}OwYpBiT->ckqrzHNePAft(le+I2mEmgVFM35*OZW9h5s3gF(rU;D9c`|l zNhe*nxWf+W(;VZBLHVQ^1fJ0S47i(`HDKC=gw?Uo~y;@#^Xl3fv@Kz0`>l=gs5_$uGXYR+ z?CdO#)J))`e7xPPrUDlcWxJO6rvzcsaD;yDhzvJcP$lJN7AhT`pU4fjxr0lPJM8`Q ztw14nL&P!}wqmc%_f_1l^P6PQcbkqB7W%jbKN_tync(R-?KGK*kGTqU$9q_`_!-t9v9DwM;_rE2hUiGdD z#%E3R$)jm!VVUyJC``Ti-;{dqI{($B$0<+e4b1+42^jHITE!F`qh9h?YOkK{Z2R2B z?b5Hs0G2bSr`AbQeEubc==R8!2JA!Qj{9e&cQ=y!AnyG1a+qt% zL6EFc2E~#tvr8f(d|0$?m%-3l^)Wr(SBz%gfRtuRu@UaPP++HjI{Bn2CJJPM|6-&SI2_*;WTN1#{>V2;D4aBjE zA79z66p2KK*5$Szq$j~X{ZL>p1^5eCiyvWRjZtjkmtpDIVyx`Ct($%?^2n!CL2_ z;3_cl-8ZblDc?-Zr@WZgzjqH2Mupp&+L?fXE;;2c7Bm-NH?^_70_wS;fRUyY98IpU z1?-&JK(?&e34x#yFYuvDhWv|fkGKr9hF5g;ruLj(!LL3ukOP2$Je8sIW_(^V&0v`~ zxWhzh+n#eiJML1>umj^h0ELAtpd?z)hH`epj3T?Dc8;go%99|EkOSHSuyP+^vqTnjYLft?#1$~JKVz;h67(m`oRe&Nb+5E5&G_a4u`qU&}?xQ~}vAB5n2;UXA_oXi%hcD5V$!z2UHcV==3BgonpoaO$B((a z$>gSwrX6NGyJ1l{*dah~fO{)i6oUls{5z>qk7U3DBKae8?trq`hHCc8v7$_3{x)eP6eWhV^a)SY$!U>>%-Ig_OX4p@{ z&E+-d6JIo^?V>+-)45%^a)3Lv1M+25ZDP59C4J^dMp|(ETx9mZ0gBAxX^;=kzCMqg z#FSlad_)oD7YSLp4fufW`!AbNqsjZwE#7g0zRm@4*#sKhC-f}yMXOj(TudaCW-G#~ z7W!dgKg{9Vw$Ry_JG5&Bm#OebmI`B_Zqd3aFgRlqB1cwGG~P?iA22B$SibWxEezDt zb9vE!Nk&qBA*qpby|Q#9JHvk^WIpU2I=CHMQefQ8gAB{+z?@iLrG|3OIxlXLEE8Lg zz4lMDiPGo>h_1;sZoiLd+@IM2pDopFr<-#$up{*yxlnOy{|@(PsD=3Q+H>RX!Vc_` zmQe0+$;S~91{+GA{Wjitk;Sk^`JB69pX6!|h_f85fC_y*Y7Ah?&tRG}_2~lzXGFv! zArP)Q3(&r<%B)U0{@Rt~!~fXJ>#2W=GaB}E#R2qWu+q~6><`w|OunC}d9d0xc$&ZA z>u?<^^TgmRLH8#bE+%O)l+>1O_+}F{YOvm>%;k9r?e<Jm$?idI{MlHzqOM^h!f>=G@%;)uBAt z-0d~>luO4@BHJ=|DlQmGhORFn!RTmX0nwlBCQ^efk9|F$L6G42;(YeKE2u?<4+5;t z(jw3(_9@+n*95yG6IRa-)h6I1Kyz(zb8q%$)I)_vhu{_Or0NQ4O4G}gvPaJCu>x5a zWNSVAG3<4j2`SiXk{_e^1%_*5m)%dDMYFKA%U2l|L8I|U6?jxGw%Q2g=T?F8@zx@i zp2^7^Rc0bBXOo>&Yulv)y}Hf81^M@(aSJwrN57)N$EM&C9}U%z4^&<+ryn*DWI(2P zBi36zQjQQMf+0fyJ@M=P^9I)yNMCOJ_zkuRPlXAMKygNsiC5xkn8Cl}1KKL!pRxao z{q^%-?JtS{$^ObQJEOU_cE-MtzquITr%_Ki23gUi&0z8w3 za-64$`@vp<_kC2%MKxHn)u6h{$J;{``hjEAj~Gb1Y@2*}cx!*+Gk!V#a<5L^=5q*>he{;Q> z*owaL_cE;2Cbh56Hd@g*c-U-dUZBnugD;8cP33#35JvA;P=D`%y&Ud^39!lpA&kDu zVNG#T9Wx7$=^0ifflPKWh9!ZKpT;DhAKsSfsoV2PZFGeu&GjSIlUVpprG?ERtpc0prVD-WtnZzr1B5BDN0Ra zN00RoZ*kMxb98eNa#xqsZuOUq`(7rsL!*IF;rgIG3C0V4}*s>ar~ ztT3ORb|8Vq|G#`Jv${)@zSn}w><1T*!v;(^6{a)6)}FqCUQQ&K)#(j^Ja#i@U;%V!LFqt4`9Li4`gtcD5EE0!6P>}XcP zz@Tb8zZU>aTV6&?)6+A2S{yVIjC<+F+$lE#{SnZiJTDjztLMOJSG>Zg+nO~WfFqgn zPlu)s?{9}j=YodSE`qY1f}#MhriORxmX@@m;^V_NV%a9xEL!g7*N4GTEhR^*30wM| zZOXtX3|E5HV~+%7DY!C1AwN@h;?c8D+@YK>bu`|p`+YOa z|NHfFt+r>rhjob=IzM&pEg<7vt&ET7&k6W$3Y(42hK`oRIy!`h%Ua!|rV2b3(onUD zlH4NHWP>N00sS~t>A==YzqURHk~h3OV*M2nhg0fsd15`t&3_`PaVQ4p;fRozC1rch zGD=izIO@C~-}7)n3%~c8Bz`tn+AU(f4o@3=oRv~i$%u|G>)Eoxc#OO?+nZ#Pp`a4b zC9=eMkz*@MyFU8LOr^5rT!z(eiVFMBaj*NSFo7$9Kv2Rhn<7d+1yh_)5>E?RnU4I= z0h`{cR#ebNYNCNzU|MR14l)ltck^4uQSd99vgG@t%nl{zr!HUWo}=D zklpFug%-%3iKaDwPq*`pNNnh`CMHJXrZh5jeV`OkF}(~cpEg)I!C`;241&iz9`Pdr zLTE){b|u16t&^gvnlx8g?b~~DHZw_btwf&!4s6xVA0b2^A{|i^^uXG3HC+=^+5~Fb*|6%DHu(Qir9OF`nt zuP$EMQ)c}YdUrQ2aR&7todWrJEu4jja|*Sa5(g><*N;QVJ0B;51dYzVu%bG9H-52j zdZdNa%pLj*iATlg|3YKJarQ-qCzKX?9UwMTu^pOYGMn_z{gm0sK*JkB0N2EhgW_wVVLP2ob_ygH_PZl^ls2Yy5Y@9wDj-hmz^P%-?!dD+L`YM#I>a{v1v2>Ve8(EO)pf4vH10G)Mr1ZbX| zgA>X9;`fG>H2dXmLna)$!=DTcxMP0#{lGRs!}+Lsci&(t-b2a*{nzLK-6AFY`<(xw z<>SA9`^f)ru$xxU;o92IJ%3=Bef@LKm*{(Z-pt*2P^SM&pXr$VukI7r=gnROFs!?v zShd+N&xOhV$;bHLy(4A48rVHX@D)k>x_~s>n)Sym>qselfQ&tas@a-?1K5j~@k~QE z^l|K!P3gL_Qh}bOI9~PjR zBU?M`Lc7H6bHbTU-IR{Mr5O6{0Pd)w2DO0%aCoN^Y)CskF5S2D&!BIZducLU<)6X% z!mk_gz%b6uZwvKX3t$8YA2;GE4?mW#(uK$yC;P@tSBt{^IysU(wRJ8hGn$X1z=&$5VZ5}bOsXfhU9 z2yhY_RqPjvB;Ei$s3tc!+cjP#<64@98Tq4d_{si#=;%4`cExFezj>FHp z*L5y&WEte3MYH;%yGLz%RS*8^5Y`p8SRJgXJk4FyCssA@R}Kz%oa{Uaw1%s5rQ4PF z3oJ9-FKSKI#7w60MVjwF`?aQc`p>QlfK+1D)fb)?;nRU+sys}31&quD*6A>R-Gjdh zH#_g@db$bd`*F7Aab*i_*|KpXyn_bOwq#l(e$a&#ra*XSFv+Owa`U~1^9f%Xb|D@? zUH`28+n1OzrH=!~}`GSE#lk|h&@n4&?}QVsjr+T&tPpW6MW#%oCd_U>K@vU?R#1)ujb@F5<9jq_I&UGy_EV z8kl1vEz~lkSSokV>Oa}v)(ko%Duv6<4>z;aP2<}qrw#9z>*YMjUkZ!1nAC!W@`&UR z%wbga!(?HREIW&YrcV>;3Lm~}Kcx6sF%d;7qL#&xW_SYB=Qgg=0WaJyLx zAHjO0aR@$bh#_W8yu!T#$vENzo3%g^FtnCq~+|hMHEz3a7MF6>1z>6)wb;+R%E0Vd{Tlc?2rfKXG ze2;2`RUA`sqdV`KvX@j?F125OhA{7cNx7Uf15vn?%3^cFKT~!iT*&US#$9Qd#2WUrjmbpkM+velb1=g|*X4%v3AXpjZ3!lC3f-AFXH6Skwhi!=fyR>zb#omMxPVZ$E@h} zIEojDYn*Dic?}Bg70690xs@H?ZLmWP@(U85PK@Cl#h7i^6_fZ|xynt|)eWz4ouje5ME1UIiQN{K0Z8U&#`=9w*0qPn~C2tMOjQ$81hro+ON2Aqfy@(XIjn(zr#jjIAKGXTiI? z$aBa?pEtf|v@7~?6}JuNF^*xjhc8xq1xw)-mRmH@if}h&cd!X2y`9`QZwM{~>v_y; z(%@N3<~<3!mM+P+i#oD-mQleo?(*c~U2YLvvNiqbe*Sn1;`}v8KNsJ#49-+{Ksm92 zzhzP{l@zoez;%}T-%HpSE1o-*0L}dk@?0c5`8DiNxSBfAlypq3vPHZvHaBx&ek8dx zNb~74Kt*nSThxzy=eTFxl;xL8mTxK2y@ZCS*>~Ov{ezbwwcKdmic|7LN8KFIF zSl2a_0N|tEK#Wp%F;o-v6>S}7FDmXO1kyRfMiaco%!?jiw+cigmj^*#q*)u zXlSenrqCl7pl_W0P|&+|sD4-ejowzZZ|{Y5r9e;5p+yRHFCM+^(bLS}wuY?XYW?B5 z@a@snq9@?|EAa0toH7F(F*I0z1nx-`(BarZ1-Bl+iA|@FL>(4Jv01$d6nP)yNu~yjWx@ecQLx3r&-^I$2hdEsuK9F z_?rqkCGzi(>p+-(u9-*R{0>uDF#(-60khgVEBagv + + Produced by OmniGraffle 6.6.1 2017-07-23 10:21:45 +0000Canvas 1Layer 1GPU 0GPU 1GPU 2GPU 3PCIe Switch CPUGPU 0GPU 1GPU 2GPU 3PCIe Switch CPUGPU 0GPU 1GPU 2GPU 3PCIe Switch CPUGPU 0GPU 1GPU 2GPU 3PCIe Switch CPUNetwork Switch diff --git a/example/multihost_training/split_data.png b/example/multihost_training/split_data.png new file mode 100644 index 0000000000000000000000000000000000000000..55ae10b040c29b96d99a25d16b8e6f046077c6ee GIT binary patch literal 15881 zcmeIZX*gEz-!HrHxG_IXR4_z^yL%l<}zJhoWpsa(XQ%Z_Y+f#oU-Iwwj_EDSoLR{+Nk^`B~Nh z7V&{=_k;aqghKAzk7J>ZKd;FtWSp)>c~oPoTmFTNaqi^4%4PoM zXWCZ(V2bHf-Nx$_o=cCpD#+&vfe>9mj{IhzW7t=YTNop zFa&?B{i{T~eDL*G)3~#?<1a^s63m~w_Qd5lOEwBd7yY~d{MKuiREVO&AUf4 zZuwNuz1d795Gi)Qe9Kw=4cjdQ;HVrr>f#qN}ew8YodvWJJXQl6jcIhI!%XSRMwP|ZmD+biL11^0Lx z_b*c!m#@4x%`qAkme4F1V2v(0-?V7DIKCpjAhcqe)bBHR{&%5so#DsQT(x%-cKX-c zblL2we=twCMAbg1<($3xd+H7uSSvJGI8_XWN3av)Nue?&( z!)v#5H?#BGqu(_ziqGt$-E`@MX8W!1%<*OBSkV|6&TW^tex@!PRfyId zmsdF3(j9WO#y(%_^`bQRBIPhn_s>o@)z44N*|LuchCdq{@lV@zuU_quwf6l^dqZ}r zsouHZx#!}g^o$oolwT~&$cc$7zkbNJq><8|28+(6SyB{_SiGqKZyUC zAn?b;u+vtvZ9&iGpJb6EI%kg<9bhT6O)*#5z9YUhggR({CON{eZ9u;11&k1>XbX(=Iz;J zZ%W$P+c)aG{}^?MpPQSgsHoVn7x8O9KPsD886>5oY$0;qym^8fIaBUA zpJdzil2Sg;FjLs5=ymtFocnZ~%lOx`lYNz@@vc7l>gtck|6c#qIj{3gRZ&q?!0SY= z&-z;3TAkL{JA)VVuY3#7GA@A95L-6c6VIa?By=_QXJ&uC0aO>KO{c|0UC(myVa zyEWrH{rzp6C%OvkFWT4`{FJ@w?5ssp)zvAdoPB=p;X~Sk5)$VL?n8&Pd3bo7{!U*k zDk*u}*T*$K)}~Ex3ks?*i<;09`jy@?B1T2r4h{~<3-rX`&z~wd>%wGD;7i?%gd2Y* zom^dyq|HuMu8YLS#~1mm%Xa43Y``5-5n5VW7XLguLii>s2b)jzmTPHt5eS=8VVCYG z{`}{;m6%4XtQVi8q<-{5lJaRXT-MgsKEKoz(quaI^J5e*@u||shk{#HR(1yy6JL0E z_`t7U2cn{)D4Ap}6TggFRQXbvJ35NuLt5!t{4LgpXRNy7W0idxlqy$_PnGfNX6QX- zK47?sNY_q%)Lr7LO7y=EV*mWg;82TgKKTRZDRCX&OANlfyY*B-Y0>r_Lm`#o2>5|2p~k2p1F-{6z`}c@DDg-u>vCsSjs{Ugqu6 z(o$!nIBsg&*KE_^wbf-I;jO>b+dwEe(W}1#q^`n&3hHHkeZQqV0yfpeb`LwDkBkl}E;f#ZW zh+d}Q-J!;alZh?;Uo_rRwF9v!M|M%4O?&tmL~6OY7~** zp-AhRx;ozD$Jr+*C#h#8tUtx6*$b+lI6+DDCr6l2hQu;B^pxyGsD`Gc3FsRdzOAl~ zuMuzfEXS+-sOI;Nx`UTL?s~{BvCY1#;3?Hs=Kf!WebmIt+VT{`_Og|kMyG4nv?~X{ zq-rFh;0CuTB~(i79+Hx)$iVIK&?Ze zIFQ4k{blCS?BxvV<98Ao8d^+g|k-{^(5!nu$vK zZ!Xi}zxsBpmZMx(u3Vw2uT^=J|>9Xa?``ZrW zWLl2KF3t`I78e(LmTV?aTKE2_3s&~@EKx&%brm|^$E#ZU$=>?ND_ zY?jCD&%apL9Q&@L#V(_vw473Tt=hYHR{!%4U7=&2l8z2j?wdFK($Yq$_M)qah$qdp+1T~^$b*-X(Cn?d ziw-i27^O80|NMCm8?!Ryv;HJ3jA?P9%h9r@#I^pz2Zig`3y4LQ-5aW~*o+QG&qNI^l7&|*C}(vqmvYF|m;mn&^vv&rUDoWg^`Rg+5Z z9y8W``}WO;_sq??upVc<5jJ2lo%C8+#cy`oYkQnT8tbCd~r-400&Ccjc zM$v0c(>(e1-MhB=sm^@W7cX9H^tC?BcGbm2c#a_t{WmZ@J-zyFkk(88_ZkZn*8v5|z$-cJXSfUDr zhF?~;*GvYdOyD{~p)wU)RwjMy#=@%!oY#ACh~-0ly@1XXg87*DN)d;YOIo|~m%id_ zsc&PQ0Dcr}r)OuMAj8j?n8XnB*jJZ{c1&J}AZ4DG<_-+X`YlVPt$t{8bM%GLkrHljzh4K+>~XPi*nnUOYL zT_nf2dPH|*KK|cp@L$&Hzj5pTd4d1`5B@Kr@Bf>Nvwx>f7D5P7I%nseaBQRU|0InN zPybh5305&Hi`o~}1aqBuJhOD&2_Bx$WIzQ6KG?clrXb+->0K|7w`oR_|0#|CmkX&D zWK6;1d2#x>509tC&ZrPENo28R6q(e!Bg`kZgtUy%OBd69x|HF6=MMYN!21gK{rv@* zL~Z$Ti~sFB_@A@s-@h^y=OS{i|1x|K7$|f#1~5M1Qr)9qCN5EvG6D9Zc2(2g-(9?N zj_{|Jeustm(=EKf7Y#MfBq91 z5pf^L`d9kb)u!m`^I4OYU7k2ZyGtzYW24#&$?e*y>RyAcvX2}q zSD8VP1!q@4YG=kWvdwG!w~ure&+iZ5^HQmG+ zw*i$duFN(!mQB8R!7=(Z^En#Rc+S}KurR)Wgwwbdi`AuhC7b0RV_k*WWlLij8p)F$ z9v%kK< zKZ51hEJndem^CysCc2BYqt|L#Or@O{0b-Pmj5v&pj2g;5)&{a*ivu;}R_7jnTh1FC z8v33Z>wN-1kSpT_?p9urBvps)JHwQhllAkgw}%TG+)h@Fc-!1ePlky9cHY+)-%*ct z6^br5*Yoi54gpJSqNT0D_I)25yaNVKMISeD%-H7A3fWX#pcPZQFJ)I5>ETGgk5{We~drHNnTnR}VTT=F~6GA$xNd z6bC047scS+GW`=1uSA{zuN}WWA#ZNZbNlw~bJ->`Kn0r+Om~x$Epxgw^z|b-Q90Q8X_XBV+J}SLPL3=Hc${8GM-Fe@jF-RuW4z80{Qan-AfJ1ZQWCHWMX0>;8h)1 zmqpu4UBdU>eR^IgDMM))nVp2BQ@`(>yLa~-I>ZR2Xr$v!sv!`dgavMPq_0vgAt7P9 zQMA$w(MUc17aU@+HC@{sIJIrcVz?v%%m&hC|6R{`Dm zY?HgCrN^naaclwvbzS4~{;NQ)8~(F0EVo__x;YIh#BAGz^HC9=eGYx%|2Cl-9XeA);M!=aBP|Hn&^0Q(4o61q2x)7Wi97#QlIH9zrimgr0V_mJ55KW zPo`?cJ^NCzAB|#uLbFcnTIs+!hEmvh!annG=v_g0*vv3IOq09B6 zl`H12E1>YGs;NCB7AY=cg+r0jG}|~rL@WNDs90UNa7In-u>M{d8Do+}qBcB1T%88k zy>sVI+V}x=Y|Y0QnF|GW9XtrTCkc<-1Fo2=s06$+D%K+EQEc{OXLKV}Ra8Pjzdt0M zc|tCpGya;7lSF=1%CaTl)DBM0%%&Sqg`%WAa>1GPz=$rN``2#ZlE?x?LPcT2ytGeJ z-pjuv49Y$8tKD+14XbzL*=WbhuFronZL`VA$(g|~jlkQMA42&a1A4sg(Q}CqHI0p`FLly+=-7`kSy@^6xoc`I4-5>D*FgAZWMn+#IHpDP z*KzqUA-2*@%1h=ZIF zp%3Qg=ev7)sz5QT?dUMTX=w){qobqC{`B53zj$#EsI>FqjHRFZKbAkeevXoK&2lW>Lz61_cJO zT)%$Zd1=muVuo=D8d`%d#Rip!ESyp<=bntn1(#SN*eMCeip3P2Oher)(@GJ-X>G;h z_sopoK@&v(Cc+@c{4ODHX!z_Vs2Qbv5R0Xasb;DjB(Wrjw_`Rh5m*n(%gb+0SJTuw z)cM5lghPef91=UYfnjyih7cbH8{F=DTdy+f_Sf-fLXG|03Ze)b&{GRU9wmWl5C}5g zGU{Ucu(!giH1gTAhUnvNd-NWMybT!(AD~PSWo3lYnjEgnDbSW8nleqhIzj$3Bno?kNv~m2d{_$}kbcG}SER7MOqF2{P zddo{)Cc6!YekfU?RaNqvX=%fPWtUF-u53s?YTvo5X5e<>J3`ED(ul(BwW-@gdy82N z1Z)*F9t$`LAm@DI;-C9xk;-RKzUWeV$~?50PHFzQt~nI`^VHeXYt9j)w)uv_`mYIy zQCkR|bS_2?5&eTl&--=&Nc1lfPl=TJ)MkU0=5Q`MppVxB-zln_2};Ob=eF<^sw zERh&wfTE%z0pb(P{aA(C+q4Vb7-`*9& z>J!|@jmMy>$B|a?9U0Ow%YYo^qBPq%dPD~r0C5J z8*N3I2;ITK@eq)Yf_VG(EtYXBBzg`>CxV!sZI*lY?Hid34X_wU-F?r=duhW?nVFeS zq4R!6G5tidi$F_%pQf1@c6&Hrc4lT8!po)FfSv87ZlX}T61@wc`nimLt~Kkt&*klp z9rruBQ`bS^e3lUf0j+kITDEQmGr`Tvn^@$KoLUbx?a`x0V{X=#mdXB`dcFP_w`QA~ ze*EVjsK(P;jUIoO4ncl_3>o3Uit<(a`ST7|R#yMuVCC!Ak9PO;(EW5IRg@uA9&Rx) z29#L)U=rK%4<+#D4rb_FZvr$-i=*r~+s*v$YLC$z%^}Njk$wi0Qj& zj?-=4TIYzjwV;B&I-*c>XjsBVHyRVeEf4(#|6}sc>2U2%jSWt3y7Oz+u6I#JSOM0)krw>Iw z3twORm+I&vX;Ti^@n9$_G7{Fu7c)cnN7cNHyYszn-MXb;?97iED6;~Ddsm(heK zQJ(sL{y8G>W2`NMFnc|DzmSG_U07HLsi@_RmBz!}GKMn!noLvy32}h(K$nlJ(HL&f zI&2kv*aVW!u6ZpaF|W}`qfs+^{=b0Z)$qK zFZSkb08hx38{jtai)e?1QmUOeaa-Yn3Ko}zmP^)%PVCbAhqoY^*F0d{&dA7%oE-U_ zy!HJf&aM5wH>g=#3jp}XQSZY3C!6_t{0NfRa$tCDEMPO^UZx8U3h=vVd`~>+fNW18 zAt4d@{Q031GJbklMw^wDmCgren%M&I!w(ZiT3T8RZSfMG~Q9FN>kNEzcgJ43=6YU4VD)NjhS6R)@iB*x7>P|MZip8Jo7MCxx zZHxpdI>}`=CkG|<9e$@-@(o<^MK(V{jjh@SEPP2G9T{S}F@Avy$pGsV>paa*ow(6j{)Zo-+Vp5+`6Hu_Y4qg5T z&XNG$^#MJ~60C985!*1&aVHy8qD-|IX;!(FX}>io?wd0o_DOoq2~@1isGd<*51*cP zVihwZ0e|4q{CGs^S<#ZpnXtN%Of(g;<{)3KG@d2anc|1-b z2-ZY$7T_H`Co#ufX=2?+YWgCGh8~>KhmiBD-@m8cwrv|c7mAhD)y6hiK)SN|uNU+S z>{xNB_hDr3kM&tQ_Om&b9wquPaw01$D^ljh(W4IC{u;82NoroC|7qEgYZ-aSY8$`@ zEqD*9@56>I_bfN@K%byOprYzTOC6Au+=YOm{{0qma7UTPA>hcZfV27r273eqHu)@f z^|fwq&SjJFJPchs0PvB1|Ni~rcI{$_Y-;e&`%j)Q;zF;kuX#h#1uNEtRe~SZ$z~a?1^7Z)p(u2^ixLJ%66bOYMup3={DPw_R zWnFjHvN&U~dH8)jMiAOOhMQwWi7hgHd~mbN={kVkiFyza+DWn3Z{8dd*eb9xSUWLc zXVaON4oGnt)C=9Hq3G%llyXI+QGat+`c~QXq%`M1==oWdB_@PF+$3^CM7y8PiY62l zNiMGb>7%R2KywTqAAO~9!oeYnoL~r9=h#g{b6@M5&+>x(F_)1L(t)ULx)8q5Q+%zi ze>B&|#>RQ&@74LMn+Ycbzj$B6OE+x91AqUJC;K$m!gO2UB=CW619Sx3y#DFOUxe*G zl3n6$=k=7&2hk40>YmVweakkDKzRFEJV@j7_V)gqc|l^HxdZn#+Mbi>GV0n1W8jE~9f``aAgJBAF6xcskiA3uKK^XJc7j88s({8)3ODGUAu z81#*#<0Es8vg*-6dvJibxOhP3AAIE?>QA$AsR{&v{Zw2H`sb9ClyYL9roNa#Z7YDl z@BsiWAX6JWVM)9757wR^)#j8)YWtLB&kGyEw!O0-gz@8T}`D-74tOQEua zz?B;7YZV0C)arTx$3^$0&Ovus27>JD;9p?LlWU8!28u%!^F8g*|Mu?Qy}S0S2M7*& zz>-mZ`eAUHkTXyAZVst5g)eT{Q@6UhIus~cA)tL@Yxu0d+MmkxGoWZxzA#a@5U>_s zF1-+xaddQ?k5VMasD}{)yE|?6EIfGadmJZ~{pb8WVUnrP}TFHtH+?U@J zb%+L5VI5Q+GaH)+_x;pS{M%SkM@NSdYNn0l6$mt(XTW2tRG-%$PADdt6T>h-rxzBy z1CG3O?)dpi*)xkDe#Mv ziAq5MM#0h1oNKF7J`skoIPfB{X>fPUa5WS(3+M~GFdBp^5&vZr9!k8o_srm#q@<*N zv=o6z?tAy|zkB~arN-;7iSX~=zjqzB4q2jXeZaVH0LE|T^!#NjE7j8Lzrb{NzL0X= zf}mG{Pz7<~KByGaRu+-N{`BaLr|j&6!CSWP+<6xiV*n{t_ovTiy=INYYpz8(Qp!~j zp4R-wRi7I_O|#s7bsn*4d%3eIR*pVMN!#udy25+JNfM6PQWZCQK8apR#&eDc;c|J2 zZa!Hj7jYg;-2KK4(tk(OvhFP_4vmZykX!r9s}#ID2+9CCg3(r*d~X|P08H==1bX9d zp!*Dk;UHL*JZ`$=by$CW#fo$AQ}sEl@y+^qW@9&-np|8(>vABAiqyd1%qIvt-50929qIdE>fJ^IHh#bwxODn zx(wVo&I1<;bbME6wvEyI172!EHKnTOGKNs=7aUBFPv1f2kow`>isu>Uvm3CQ8#DC+ z!9o%sP>)I1wjBY_y*||&bl${+3Y)%#Ak|VxbhuRg!9PAgArArqZbwA0;4^x5a;8O2 z3i$0FWXt>)upw154fD@cdV5h313!LPLK92VN#7ymHc5*H3eXp*ME$rKAg!GS zu^2tT4JzQ!s|XRCt)Dc`@ik!^XtvcuslJC{oGOf??Y~k!H~wwkFsWKPJLeh|IZ^!l z`Sa5sWH}quj@MVdiJ%OVz!#8NG&jc5o}Lgk$lVA+R1Ge1uH4g=%C}Pj-x>y%&HmBR z-kuI{#cu{tN}ANQnF;z3QggxcAf4ij^w2Z65O7-sNJZjftXyTcrCy$u?4FCC$Upmg zXLq^8gvs1k=!a%`ry>X{4qA3p^)Q)M_}kV_nzmqhk+NP=P}lB3Em1}GU?b^~V9@Z` zG2#>svBZ$V7jrppfdepgb@JRf223Kf&Rn~8Z3r4DIj%L=*Mqq!a{L7>RU8}N>$C2~ zA?<#Sa>ItFAWj!SR)JSfVQ%K8m)F+(QTVXrsM6oRe+h#0{^8U%qK0IK>3mB{;zQ2U zM2MQ|HFlP`3ZZD5SoC3w)lIw?Eg`xMV6G_*{32(y%ynGvm1$+KStq6frct^$iPsh% zcGy7VmAXFR4>)cH1Ta?p9Sd69)pfyZ&(FWVe;fpP0uC~aUmZoX!Q-HZLPs*Mq~zr9 z@95=faB-Fx1hNK4lwBI(wd>3~gOo@N(-SR+vN8ZYosz(5cO#f^4Dv|L zL~8Q7kyzxA5EOur!d_2R3Sa0R-%Zz-&KEcE5dg@feCujGimDv~<=~}Hq0JX11S@~6 z*Fx0Nw$Scwqnv_*aE!TZ3W;@U@8}?P3`>j)4fOYOV@LHGp>z0^Ep%QXwL~nW*0l6h z^mhPB9T4d@w}w}S`6^J^03ArY^@ zP~v#WzJhG`M}!KSR+P1tjE(C~U$g;IXOfeXbM2vOW(>v_67U4F5wfQ-)Dj@0A${em zuuV(C-D)ecF!W{SV)w8L7@-hXB!?DAMZk3ji>OH$IW>u-OLFY1Ft@VW2bd@7OpX&! z%472&A@9{HbTH1z%L}XaqfrI2PS42GSL$&-|boJ8WNMdlSc+-nxybzQh^D4Lgdi(XO9wy-~zNUxn zcY#Fl>BS%pWlOV9nWIskC*^b!qVVtpI zsUyPFnTK($H!TQIfok-z}YI_OBxVYZ_Vc2t&gLU`tqY(pomWwk(7@$?N z!9*G0_57-tWe^=1=A^_I?Fd{WZc)!LKP9(ry7NZn^Bn0fynH}X<~YwZa~q1jTBH~S z8hC=wrJ;mVflk9soEW!d0IS}Co(wH^v~NNH;IA64#Dn5%!-Dc_N}z6cL4i+$Oz$K% z(b1iA#(24-RXIPwV09g{2X#j7@32*|n zXsMl;B+~RFv8JS?Gyn)D0q06$ga%t(RG;XJ?=k=S%0LB> z0M0BOVCw19r+tx*pRp$$psdAX=S#2K523aW;0|HWRUw{!|M?>^@+LBJ7ip$ptXV}* zFB~P-ydiA=-3CF&8yIRNeW&>;RAscmgORBi1s${gh`vYdMd=9=;4;>_*=@4By6V;j z4B17B7ka>Ffw5c*xiO8rVRX*1VZ@t%P|y*{r;7>XL?S#E($aH%hOQ4ngeja&7@^** zuC8tk(vncQNm5*V2eQNI`gK6D`{ZCKriFw_LW*6YB#@;k(2Up~z6D>tz{%RiE^e`@ zz@aCAOtZv9P&%kb_W}dAf@REoIM5U$6No%%KvZ3HbTpk{Pi5S(gH-MTR%zjU5h3eJ zQz!y^WMy~rx6S8&GZDJhhcRwWG{-T`Lf~_8j~NRCn2b@40%@4RjFk2e|Jf8RgsLZx zS)wpNEgtw`kZBKEHZWnTfm>KujpVJEybC+b)?Vmn`tjM3t%jL;@CriFFVcB?uq2Ys zgY~>^Z}@=H$Jp=w^YLRCX~B`|$WM>31q%xcQeIy@zJLE_s2v*#a1l~_mGoN5TUrDu z`5n8yLPMCDo8Ojl9W#SFFt(}W!I|F?UICk!_s1i2 z);_3Ffq+ZqnDP4y=kx&@!w0~`JD|4yo32lG+c*i9!N33$7AmjkrP)6?ICpwp zFc_NtcaVs&K1|5oU7FJD{4qKC2=FA`2h{$h*BZ2>hIij27skKQ!|6HHF1zKu-GvJm zx+}ccacx@D%cVZ+75W!0WXxk2;SHWmpe9{M&)<_^4VIAh{%r;XBydYe6o$okl*EkQ zxucLeQo>vlr+uAPN_5-t zuu|bT^lOyYRJeru!90#p>_e7mcSN7$;Hq?u>p(y@UcScJN>|B;$kb z(`NI>x6si&#mswnCs*?Rbd#Z-TwIRRZufa@%S+>L}Z0N4^;X-RU66uYRXMcasqkE7>W8boa!JY)+C4~Ufn4#B^Q^+uj`1kCg z;GxveXmiYk?chK5w`QGhi?)hK8DnxK_oU?zTBJWC<3@~Km-56hS= zv4GH8jsaUf!UApzsXR39rw&UKBbqPm+Fw%uCV9V#J9P2w#{HRDLv&K4!Bai_I!t0@ z#O*yP{L$0bkG`V)sd36S z8W;vwkCA?mVB(qkd{f7u& zTT71TjLo5Te8=20>3d-A!FR(E=a9~12&YxCJYtQ@5`>`c%R2~l7K~^pAht*o6=mpFVj?emo~NiO<_IVBiKpajF06O*2?Y%D?$11~ zqN`QDTMwm)0Jm~}9J>w^q#f=jfwn{Qukj4bYE(mak7<@&rbQna!D9nxUVISONl*p6 z7=hqPON_SM$;5OAvL%^G=#U{OZeWa|CU zK)sOiX#Ch2W2l;4PP~nKhd;9eRMGpY$#%N1Ai~7{?63_b+yiXmuCmBnP-_(^iTyu+ zQm^_40J6?N2jD$=lo?NACE8ngVIbcR%4VJeWWBB>A8 zyB3v}3aN$*SLfM$=^q%l?dwaFE&hC7`!yZAIQ{33GN!)1kBs;O2^={lfv$sD@H>dO z1e9?!;c#@%0Sp7ci8RdHM34+x^7!#fGlL)WORpb5v6^k~vN$vE3GI^I#B-D%XX6zY z&;4}kBtrAx^$C5+W5>>*1+<_M!he2s4)8h{s5#ed%EZFT>RnF{C&bR_xw#sUbp|#z z!;wXUWb#3tg{i(^z_fTUAWXFTg4c6!aFA5mwE~!!bgIbbwqQjF02Q+0Vv}Q-E8Tzd zpc?9BxiZdVKQkd+DC#f_6@W}YhR!&_9u8PmR9wueCg(Q!0&?{^bhXx(x~edr<~6Qw zJoop^O?tVY-_Y;15|!xY)dN75RgncX*kK%sB~&UmEY91W?u7AC@Hqx}T{-#rJ0X)t9k$WRIsciQ5OZ>RO)#L(uu3|` zL0EZt{*|a{g%Qq#H4c39<_&mQJ(>lc5RmG{kO>rAFerb#Q5~&GHJY6y>MkLVb+!aq zpmw9lx9`$x(8|8uXAHBSv|&i5fqcC?nM+bFcXxMxsM)CNq-NZj8pSM}D*dY^Fd$$D z3}6aw_+7{wo&$z?A$XL6)Q~-?d^FuUm@vU@QdU+54G^;jN?W;~~z(a;_teRDHbXplb3`s8a0%Pp$ z$3+_kkYUb{IZiZ2=zT+l^>9}#22|{Rj1tSVtkZwr7+yj9zSg}MQczU9!!-Xo6B=x> z`|kvf~b+tgI^_neWIPEBuFQRd}ab-&Pt=*NtlZ{BaU%# p%2jbEg9FLY&;Q;lO>v#(M0>$s9xwYYJjzL&Ii;?cb;9iS{{XiAhNu7l literal 0 HcmV?d00001 From 960eef5ac25612f0437104a9284a0142e024ba73 Mon Sep 17 00:00:00 2001 From: Indu Date: Tue, 15 May 2018 18:02:59 +0000 Subject: [PATCH 09/20] Link to the example python file. Minor edits. --- example/multihost_training/README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/example/multihost_training/README.md b/example/multihost_training/README.md index 3f14b6b1a892..d300a1bbc527 100644 --- a/example/multihost_training/README.md +++ b/example/multihost_training/README.md @@ -1,4 +1,4 @@ -# Distributed Training with Gluon +# Distributed Training using Gluon Deep learning models are usually trained using GPUs because GPUs can do a lot more computations in parallel that CPUs. But even with the modern GPUs, it could take several days to train big models. Training can be done faster by using multiple GPUs like described in [this](https://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html) tutorial. However only a certain number of GPUs can be attached to one host (typically 8 or 16). To make the training even faster, we can use multiple GPUs attached to multiple hosts. @@ -6,7 +6,7 @@ In this tutorial, we will show how to train a model faster using multihost distr ![Multiple GPUs connected to multiple hosts](distributed_training.svg) -We will use data parallelism to distribute the training which involves splitting the training data across hosts and GPUs. Since the hosts are working with different subset of the training data in parallel, the training completes lot faster. +We will use data parallelism to distribute the training which involves splitting the training data across GPUs attached to multiple hosts. Since the hosts are working with different subset of the training data in parallel, the training completes lot faster. In this tutorial, we will train a LeNet network using MNIST data using two hosts each having four GPUs. @@ -17,21 +17,21 @@ Multihost distributed training involves working with three different types of pr ![Distributed training architecture](dist_train_arch.png) ### Parameter Server: -The parameters of the model needs to be shared with all hosts since multiple hosts are working together to train one model. To make this sharing efficient, the parameters are split across multiple hosts. A parameter server in each host stores a subset of parameters. At the end of every iteration, each host communicates with every other host to update all parameters of the model. +The parameters of the model needs to be shared with all hosts since multiple hosts are working together to train one model. To make this sharing efficient, the parameters are split across multiple hosts. A parameter server in each host stores a subset of parameters. In the figure above, parameters are split evenly between the two hosts. At the end of every iteration, each host communicates with every other host to update all parameters of the model. ### Worker: -Each host has a worker process which in each iteration fetches a batch of data, runs forward and backward pass on all GPUs in the host, computes the parameter updates and sends those updates to the parameter servers in each host. Since we have multiple workers to train the model, each worker only needs to train using 1/N part of the training data where N is the number of workers (which is same as the number of hosts). +Each host has a worker process which in each iteration fetches a batch of data, runs forward and backward pass on all GPUs in the host, computes the parameter updates and sends those updates to the parameter servers in each host. Since we have multiple workers to train the model, each worker only needs to process 1/N part of the training data where N is the number of workers. ### Scheduler: Scheduler is responsible for scheduling the workers and parameter servers. There is only one scheduler in the entire cluster. ## Moving to distributed training: -In this section, we will explain the changes that needs to be done to convert a single-host-single-GPU training script to a multi-host-multi-GPU training script. +[mnist_dist.py](mnist_dist.py) contains code that trains a LeNet network using distributed training. In this section we'll walk through parts of that file that are unique to distributed training. ### Step 1: Use a distributed key-value store: -Like mentioned above, in distributed training, parameters are split into N parts and distributed across N hosts. This is done automatically by the distributed key-value store. User only needs to create the distributed kv store and ask the Trainer to use the created store. +Like mentioned above, in distributed training, parameters are split into N parts and distributed across N hosts. This is done automatically by the distributed key-value store. User only needs to create the distributed kv store and ask the `Trainer` to use the created store. ```python store = mxnet.kv.create('dist') From b3859ea456a2655183ede3985ff6bb1fe085985d Mon Sep 17 00:00:00 2001 From: Indu Date: Tue, 15 May 2018 19:20:04 +0000 Subject: [PATCH 10/20] Minor changes --- example/multihost_training/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/example/multihost_training/README.md b/example/multihost_training/README.md index d300a1bbc527..951f13195ee7 100644 --- a/example/multihost_training/README.md +++ b/example/multihost_training/README.md @@ -2,13 +2,13 @@ Deep learning models are usually trained using GPUs because GPUs can do a lot more computations in parallel that CPUs. But even with the modern GPUs, it could take several days to train big models. Training can be done faster by using multiple GPUs like described in [this](https://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html) tutorial. However only a certain number of GPUs can be attached to one host (typically 8 or 16). To make the training even faster, we can use multiple GPUs attached to multiple hosts. -In this tutorial, we will show how to train a model faster using multihost distributed training. +In this tutorial, we will show how to train a model faster using multi-host distributed training. ![Multiple GPUs connected to multiple hosts](distributed_training.svg) We will use data parallelism to distribute the training which involves splitting the training data across GPUs attached to multiple hosts. Since the hosts are working with different subset of the training data in parallel, the training completes lot faster. -In this tutorial, we will train a LeNet network using MNIST data using two hosts each having four GPUs. +In this tutorial, we will train a LeNet network using MNIST dataset using two hosts each having four GPUs. ## Distributed Training Architecture: @@ -31,7 +31,7 @@ Scheduler is responsible for scheduling the workers and parameter servers. There ### Step 1: Use a distributed key-value store: -Like mentioned above, in distributed training, parameters are split into N parts and distributed across N hosts. This is done automatically by the distributed key-value store. User only needs to create the distributed kv store and ask the `Trainer` to use the created store. +Like mentioned above, in distributed training, parameters are split into N parts and distributed across N hosts. This is done automatically by the [distributed key-value store](https://mxnet.incubator.apache.org/tutorials/python/kvstore.html). User only needs to create the distributed kv store and ask the `Trainer` to use the created store. ```python store = mxnet.kv.create('dist') From ec5016c270202c97da41a98176380e29085cde3d Mon Sep 17 00:00:00 2001 From: Indu Date: Tue, 15 May 2018 19:21:27 +0000 Subject: [PATCH 11/20] Use images from web-data --- example/multihost_training/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/example/multihost_training/README.md b/example/multihost_training/README.md index 951f13195ee7..fe6dcab820d7 100644 --- a/example/multihost_training/README.md +++ b/example/multihost_training/README.md @@ -4,7 +4,7 @@ Deep learning models are usually trained using GPUs because GPUs can do a lot mo In this tutorial, we will show how to train a model faster using multi-host distributed training. -![Multiple GPUs connected to multiple hosts](distributed_training.svg) +![Multiple GPUs connected to multiple hosts](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/example/distributed_training/distributed_training.svg) We will use data parallelism to distribute the training which involves splitting the training data across GPUs attached to multiple hosts. Since the hosts are working with different subset of the training data in parallel, the training completes lot faster. @@ -14,7 +14,7 @@ In this tutorial, we will train a LeNet network using MNIST dataset using two ho Multihost distributed training involves working with three different types of processes - worker, parameter server and scheduler. -![Distributed training architecture](dist_train_arch.png) +![Distributed training architecture](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/example/distributed_training/dist_train_arch.png) ### Parameter Server: The parameters of the model needs to be shared with all hosts since multiple hosts are working together to train one model. To make this sharing efficient, the parameters are split across multiple hosts. A parameter server in each host stores a subset of parameters. In the figure above, parameters are split evenly between the two hosts. At the end of every iteration, each host communicates with every other host to update all parameters of the model. @@ -49,7 +49,7 @@ trainer = gluon.Trainer(net.collect_params(), In distributed training (using data parallelism), training data is split into equal parts across all workers and each worker uses its subset of the training data for training. For example, if we had two machines, each running a worker, each worker managing four GPUs we'll split the data like shown below. Note that we don't split the data depending on the number of GPUs but split it depending on the number of workers. -![Splitting data](split_data.png) +![Splitting data](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/example/distributed_training/split_data.png) Each worker can find out the total number of workers in the cluster and its own rank which is an integer between 0 and N-1 where N is the number of workers. From a025b86dace8145e2067c6b04dfd4a0cea718de7 Mon Sep 17 00:00:00 2001 From: Indu Date: Tue, 15 May 2018 19:25:54 +0000 Subject: [PATCH 12/20] Rename folder --- .../README.md | 0 .../dist_train_arch.png | Bin .../distributed_training.svg | 0 .../mnist_dist.py | 0 .../split_data.png | Bin 5 files changed, 0 insertions(+), 0 deletions(-) rename example/{multihost_training => distributed_training}/README.md (100%) rename example/{multihost_training => distributed_training}/dist_train_arch.png (100%) rename example/{multihost_training => distributed_training}/distributed_training.svg (100%) rename example/{multihost_training => distributed_training}/mnist_dist.py (100%) rename example/{multihost_training => distributed_training}/split_data.png (100%) diff --git a/example/multihost_training/README.md b/example/distributed_training/README.md similarity index 100% rename from example/multihost_training/README.md rename to example/distributed_training/README.md diff --git a/example/multihost_training/dist_train_arch.png b/example/distributed_training/dist_train_arch.png similarity index 100% rename from example/multihost_training/dist_train_arch.png rename to example/distributed_training/dist_train_arch.png diff --git a/example/multihost_training/distributed_training.svg b/example/distributed_training/distributed_training.svg similarity index 100% rename from example/multihost_training/distributed_training.svg rename to example/distributed_training/distributed_training.svg diff --git a/example/multihost_training/mnist_dist.py b/example/distributed_training/mnist_dist.py similarity index 100% rename from example/multihost_training/mnist_dist.py rename to example/distributed_training/mnist_dist.py diff --git a/example/multihost_training/split_data.png b/example/distributed_training/split_data.png similarity index 100% rename from example/multihost_training/split_data.png rename to example/distributed_training/split_data.png From f260f5366d4ec90c6cd4d25df96f26a33a8fa2a5 Mon Sep 17 00:00:00 2001 From: Indu Date: Tue, 15 May 2018 19:26:37 +0000 Subject: [PATCH 13/20] Remove images from example folder --- .../distributed_training/dist_train_arch.png | Bin 20415 -> 0 bytes .../distributed_training.svg | 3 --- example/distributed_training/split_data.png | Bin 15881 -> 0 bytes 3 files changed, 3 deletions(-) delete mode 100644 example/distributed_training/dist_train_arch.png delete mode 100644 example/distributed_training/distributed_training.svg delete mode 100644 example/distributed_training/split_data.png diff --git a/example/distributed_training/dist_train_arch.png b/example/distributed_training/dist_train_arch.png deleted file mode 100644 index 017a217e42bcf1bf38db482faaa38d4912396f2d..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 20415 zcmb5W1ymf{wl+!jy9a38rEzz6*8~!1g1fsn?hYZi2X_b_+}-|SpR?aR z=YQ|q`z~WJ=s}OFS~b_I`N^ECf)(V%Q4sMEVPIfTBqcyfFfcCzU|?Ply@3b*bK*-J z4*Wr|m(Xy6fk8rl{(UjNjzbJy?3}u57aK`qpB11i@k-^?M_s|uPG_` zQV9%cjDEE>HEgMZenc-`ytt|2CndE?WYGE^5g{%t{PO<({?sM~BsZupTB=#Qf#wnz zMC>~|KNEO!wq1O*d2zMkrc1NWnRdaBhz4Fx~{5|ia5m59E6 zCeKJTg*2Y&P-4{f0tIPR;7(PZZmTEm?fFhj`2xA4>S3M#;#zlz(L}y1BrHtS-oB#S zEd-lUbw#h$lYPZ#qp*OL2+8bnK(9v@1 znfDzCl(fUIBPT1{-5*0aRmq^&mTJMQ$WS?FT4A$L2Tc}&z8TVK_a!V=D;FD0=ReZ? z$jyzrH&fo*6^y~`cBp+JpQ53pgl;;T9;T-2A^IR9&Fp8Ybl+}1MkXJ9@Bjq7#;uE-8dDqcLDhM^*%!{fb&$j*2k*hOfjOb2Ckbrl{P z8@qGCuH8oS^5-|Cz+`qSy!(d-6ha>)P|X5_)OQlJ==w3_&sX)2wKlFi2$>a<@<2E>(uDDiJZWw$3f3(HBDDbShQ<1 zmi+9mlYxCKr(`;cs+XdKLOF&@nKkTH5bVkuDDwf*v34 zjDXdZ?~&^)3*~%E$n$l@_YQ@bnHf4M>sTVz8&24X#Mttg)okQ95LO%kZ?3JOq#Vbj zI*_<{@g}(&4fiw<+Mg`sVzyG25GS}!t z`5lusUqHGx;M_CDB=ce*jnj_kdCrt0SncfW;;8s9XN~eNBzSQoosBiz+}xaZClP?d z>ojW($b;kC+uLu+_o6LhJA+V*NA-Lz;-uJe=VvA+lx3~PbAzF$aj*k;>vb$wBpjE& zV~og12kboM5`w9YjJZcLgMxx?DdSy9*LsQlz9uAKH#zS(&a_0UOOz?22S-M}yNF=4 zQ4b{-Pl}HA*Q~YFpqr_&Q0BYaug)J$3X)-Zi*FB1Ov)WD9-igxQg^qg%>quv1CTgq zzF03dR9adYdiS?!T6L0!im+(6c>^;WhR><;Qng5lBIQmP6q|6K%z&Y|TsfCAD*=5@ zTPoQfz-Yj;p~l@ilp9silIw6XTWO?Ns>vR=9y|IA$~m-R6ZQ2_NK$SJtG^!}4z5US zAnE4%T3rXKQlzx2-?I-$hu{LdH^?QFi*3ViYN{^ks9WMI|# zVIzJ!!%z31o=tVR-LD<*@^={LvciGJKrFS93vG0t)Ln{X%*Udys?Z=430&OUPw0}Q z+5Wk`&PkBd5&q)fEU=`++Zyo*BB;%GcDuHle8kw$S6NaF&cN;cnJ9>`G@|4m1eP_p zl#tg|#YSQyL$~KNMy$(2biS!CQClPsS;*bJLFzu2xt9EU8n0VjBiQ0Moeg|mB}WYh zOY||GAO;#HIe-2jD3((jxsl7UphOA}M%s`x&+1ot0xmYTS=2&Ix;4w>7ZtP(Y+Uz? zA75TUUpN5|=+T`=d{qVuxo`kHicIS=1VZqQ_4zc-j_iM-7HO~JD92y8wbAl_0akM{ zXXW`vVD&O^hPim&je%+z}4bsP4X&edZ`@)nX z2SQ!*W!c^TB>evcm6#GK9p`5YQNF$@wj)hmISKpvjL#T*sQRhnH%IQ_y$Yw_+rMzs zxLjRU=3h@SROT-Edbg^X|LwqIutwo*)7R}^3r-_Z%vP*ccD4MWuwRS`I`H2I&cB#?s-7Uqz}ue%I)ffb1>kLG$r{?*j!R>pAusgfu4ok5%K&HkKG zo8~k{Ay#xdG)<5)6c`Cr&^`@LtiGH-JHqke9_r|uLk7anW}`Q|I&d0FE~%+7YV?_` zmKsSb+FLZMLm4Lg=&%Eq zvlts&QAveufBCgIWfuHS449c?EGh(XVIs|q#3KAT8u?t2CQ z(n-;!$t(0X$)T_+3 z%cEl7;n!B}efMOTulEU)xy@#}h`%6w9G3D>MY$8?yn+zEr)! zh&9;f)^i{Yu69VdJHof%z)Q&+#sGTYNstjIUkgT_VSJyT3%rLDeux)Y!>-Z>4d6bn}RKky75Td|CWh zTD1|w+fuE1CZitY{pD7C2L}iDKUqP)iq#l_)xS8yEh{bE`3(tMAzudf5e49t*f_}1 zI(I<2I_|-@BOxb;{f3CC)SMg}oBNW3k&)5+VoD8%&%Gkq_h6o#&HNWkfn1t!XCR7V zo)jkL`vJaJuitjHc$`@iK{pb0y-!5xv>V<56hW*%nu6KjCoD3tKq~edG{uiMX^Pk< zCnpkea$@%OOiu)j0Ndq;McA0Hv$2|=qE#!)^tw8vQTGCP>+1G)Q2mN8{`N?kh>FVl zpFe-foVDE@HoNbY*8eKk6RZtopD2*;#n$uwlEiA3?Qyn66NS%g2>bHYtFdM`J6&=R z$n=yp$M$fiAS1B2m==pscSul_p*rO)IS4fy6pW6;VT}Xu3y14tgW8ReG~TgXNsN$B z3ZX>&c$YC_rU(Vtr%Z};4H@)C|WV&tHJ>DBfLlYkM}p~m4@GjFADPW*8ni2 zQEvCXaeTx*mG}~i+{EcA*$QzKc8veI0+3zli!OP=hp$t`YB>O7$!EuA(8i&ZPp@gD zknGAuBy%bmTV#=&E1d%3L4(zF^64z&sW8+tM=Y#rs1RVZ3#MW^GHE=R>pfxB zH~gD#xP1hnaoBA8k(DnUB5-Y zs#X1IM#moeXSZ$FN6G;7t{ z(kKjluMuaflXP}+TGu0qV8S5}A`#L9?slq7ii8RQQv0LK{bUWg9@O$RmQ-pd6V_ie z1Pe?EVmZ6G04tMkTeOV;#I&%hKq*hEMFIjT=^G5?*(8P!*m(CWhfFX#fK{aeLq#a0 zSz~UH%y6KxL1ClHVLhwwqG2?mBK2mt_`8vj(G*8(YpVwk6^*C_T$cesYgP%l4^&id z+8=L=cau!R*ev4>3`;DgpqPAod@78bPSIX*a=_9wP6O_bu}HL^;Ej}+4$rn^ijY&Z zjfbdAqY$$a;TF&SgFYCC))^GC(j=$w!zsd8DOxgL$7jrGdy6dU( z&Vp6?SqxK+N|qb!a0y=qwFdt~8&d$JBDVq|!l-ZbBVR5m5TgGd8rkTd0$UCfS#?Jd zj=rb%b4YVY_xTz)qA)F5kX>&OhwrDPk3j;${can-LAT! zdj3m0%$HrE#F6=S7e)_zaysXf9!df8xNIA}S92 z$jIISAtO_8`ZP>8gHJ#dS{SFe+Yrh?*Z>R2V6dS(_We_}`MDkR_kRlfIb<5=-#=w!#NE ztSimX(PH0ltpvNGovmHe;)s>r$8gkFf}s2UYrl4f{x`dpsAXXuXV=Z7t-s-&#n0Y%YSRCuzF7<<(Ed^pZot*H)sXD!ec=#WqJ=IX zBtYY~yR&DJ+^)1OMFUyn^(v{xMn65IKhNAJ%end<6Whv8PX!YmsDD!x-& z0vD~h5w>^E363VWS+QkDA+ih?*gP5YYA@+1Zbs3l&fqJ2gs~;EZ6NepSi8=B&&p3v zNF+ei=z9o(oPMunU#C{edZXKTjd)AH_xeeJW8p*Z)}gguyH#)5>TTm5&6USy2JJER zYL5RW2g`Np+=AR<7{z>P1&XPGZ1rwK33=Jxxl%LwX>2bN)mCZhv{>}-6Mm^}R?Tru zDoRshWwvFOI@|RbY7&2a)XpuosRDU z3R!c*2ds%Te7jEfC0Nr(n@PMSxUihH^euZPtb`cBQ(g|EqH(5fk0q?Wx1;i04Zkyo zFQiJtCfD8+t=T?WI*H#}z0}NP)$|wwM?;6eT?+$$LUgIav3M(WD-w7 znth#ldNPr7omtN6N{DwVbbDhyCSMrFYEQN)I(J#wHt$%IG-xX}2S?ha*!eaxHHVyO zi<~NEu~0=QQB6=ni;voYb|ycy`(8;j)tes|p#Y<5F+Y|{YCBv{`~muMc;soXB)?!u zPUL>a9aBZ@iO9yY`;jnY{*_hdAi(_*Cs@%(A_G5QfpMdG<5O#5J$Hq6(-tHeeFo2? zM?o8jsd>1lQPmCu(LxB>xxF^_2m02&e%0M>(Z8J&7e5#YR0sBtTV2hcSebK|Q84T` zWd=G&{irZ)Pli?psW7fY?3tArtuNq@IY@~UsN2KzsuFZgQ)_{11CTQ+ytZs%Rgpj` z!8Vi4un~3p$(2YEPuE=YElcnV0&{_x$wKD*S+!E{P>S3V7;nl48tI`9=)m8$ zEBu=8zPZL#IbCI}01eICZBQ5!=BnNdwP+$24k6O|DiCt_dlQMV*w9asgxr1Md?f$% zd>d@UYCtDAuB%A4V3LQt!{d4>2+UqZF0t*F-s^veq;fbNpCd@vot!X}`SAJx{LWm0 zlVo>>>Yiub6=o-N*VO~D6u!MEd|z%{Huc_$P*v9J#bNq4g@TE9&Ae;FXUn~M?cU$8 zUg?S(-kZ>Q=kVUPp1q>sFpLl(H8bxK*|1J!!FfMXEL(rG)cb*0D~(~Qh1GK8Q!tc5 zq0s$`)ud4P(wy1*w;Z@>8HxSgQ{UKK_$b^YbG2AdXJ3_Z!Uagql&@~2yX>}F zmu|hGQxl1MTacIkdv+SLaJGS!fnW&*8(Zz7RYuUw9G^k#K(zn+>YLJO93tIslTBr5 zf?xI$MzdA&JR$Ryh_*z!g|kdA;YE1DZM1?nx?pSDK4&(+?r-(IA$Yt#)Im!Pzy4Rv5l#&b6BcznMkQAj zwxz%#anO@`pd6W?sQow1#0zbaxWsSle?GUij|8SPzfQk#Oh6>bsVncg%**)MJ%#e3 z!^Gk6*fkazN$sq;sj+yxA7w>s?ZyuE2W95hZauQgE?;-|yKHTDtBD$=xIKfJh>YS* zJ|qWkiE&Odi=G^}@Y%1u5-bOT@*eO30vjtckPec$*4dcX6=s_=mYjL9vQm4tE1@E# zuCFF*$Wjt=ELIZ6d}qloG^9|}WC-*9g>rK79dN3z>AXYWSG&qqes3KHM>&NTE|$Pr z5MLk)ou|714@s*gslX*$dU_j5))d}q*I{kqm8Ce_k)W!L$X3RwInxNyd?3wmlU8aQ zty#oRK_->bnL_uSf!+F)8AC<7*bb7Nlxnv154~1D$#Qo_<{9-mlU}798QDJQGfB7} z4J46W-_;*Y+DfsP$3E72nMNZvffB|t$oT1H;ml63a~#aq(J|?g>rX7k??f@*e6I*3 z1Q#t&IVm=H?=l&NABjpiMkBI?z4RY|eo}Z(57uo}vs=;vWRdRb+6M^;ztc>SbZ7{8 zYG4cT23t8}k&o{vIYS}>$XO9%`SqMd3Cmnc7(^#$u?3ZG;2tKBQh>-4Q1)>x7fWe_ z;21o>zbGZQA3kS3zNGd4T``~kdpOl&?H#cb-rx%USF7!a;?In^jy~>paOy)x7z&H~ z%3?V#?S>s!d|-hWUnC&4)`%p*nF)4iDG2Y;mWaKqdqke9wmnbVVf}f128ru*hVThARc)`&PE+Jcb}YwogbpvW_7+ z+n4ib$CTk_+T|LnKT+>x2R#zyy4*Mlcq&pkS|5`E?YoO!g!s1NQ)lT{+S$r|zrvXd z3te;2h6+Iu0;)+EU{<-%Q~&@XQ}82 zR$dWbyj46<-UR_ZXO}`rGKW&Fd_O`F6o<)zqRlJs)!P!SOeiC?e)pScMEd|z7$o&O z8T)Eh2~b&UzT7WqKWy|u@j_O6q2oP79ZRJJ1~aNy-QWPLSuI#$1Aef>@YNu=x9|p+ zTfV>F)7^}3fM#`E<2vDu=(=vs_)QF5Ga>PWDxD^HFoL(9WXD5V#gNEA=mOE3kQg&R}bpQ zOE(4xVtapPeO+HD&cd(KjH!u?Q`|E{O6mw7p2$Jjqo-=&G<4$grc41p4$n>O8$6** z0Lvb3fKZi}OEaH|=dVV-uKNwWy$Z{DB>OA6uRLjivrC7JNwcU4@`S<@ZASKIn#itC z_i*Uzj9Ggj69@W75Qt+C5RTK5b#6xfP~pe!cgrUT+T2dCNIfnhy{pP&4Kq}{vz`9k zfb30Ah-mj@i&Fd-D5BsxeRm$?HC=R7Rhvv!p^Xx^AT2H1QCUi>Q0ZW%&DDg@z)-g9 zmE?Q)i$gnfJ|r5m6HT*}nXd)0$VG?O@8t|*hG>!0#k*P~%tgKyg-V03p6a)iHndTx ziJ8-8HY(m8L*rxgf(ElC?FvvO9rodJb;)}HD+1E=C&&~PXoHP_+9rPdNAY&J!Txuy z38O<>?N&&Rmfy4w(;L9HfNadg{{u&e6}^aC&G@%krxEgsITaBm0_W_Cl(>tMWgF8V z-BGOyXRD7~_q{C6csJDkw39zew(+1Gj0Pw?`bXu}QzBCt0EkHCX8&gpAvg}SBRjdT zfBhRn;pZB%Zo4?YE6CcR=N(+oPj#V|dA>@FU%1_gYoUe#r!uAWWt3 z!(~%De20D{fzW>ctJ&&PtJ;F-=2(K&gA|9tDcjG`ovp>-?f^hQ5!%a_`?bc>_p6K#keZH z=(k$u(T49LzzdN#1$=m+)j@SxQq_l8W=w_NDCcPnUFSEdKQZr5mYQkY20ukYQiFk| zK{22TcGK#8`FAe>W94bgHn`tNQ;H?|(nIoUh8$SbWRZ1*eovQa^U6h1k(tmX)|lVK zIM93-+=oU>R8oXb_ub|M^9>*h3HX2xXJ=W@_Cj=gwCReR0d8G>>~DhwZ^^c!zen&B>eh5hmQpaLmXMzp`s5ee<^1Mv7!8 zss2N71b5ix5z5cnvtDpEx>BP+%fm~$tt$j7+e_-4f+W4l-*-IVV26Z?`-ct)f2H20 z>Yg*+Ku9PgH5`aa4~z;lMGG}4(|3h!yZzefk(_Ps>U)t+wl0uKgFZg}1`>7n?NU&Z zXRV~j+JUrSrH6X6Z)zD7Ls!>R7c43<;_?qTG1eIu86M<>A=@Gl%lawm@dYjFa8_&n zzIV}0r8hn+!dl%z|gb*8ZeAJ_3bR2!s>j$ zF10ldv2Q7r5UKAl)XP31CcGJYVDAUkV)8kg)jt3y(Bgqs|Mu5z;mU)?djPB8)|)i> zYkDHWgcrNby(f8Lm5ET4()fOo1sYD(L|J;ur7pMQ?i9H(U2VE1GS@V9(;lu#>RN+x zkuKvoq_?92`2&r*C(W4h%--c+Fc!dP9rCYwE$K!6?ABk7RxXwg(s)F*Jg@MP;8C5X=bL+plN&Zgi|DhG-ExiZ;4yd5?HG@ZTG~DvhrHK^EF`;KX?STz~ z^gVjBR&g6sX9dV9c(gW9S4qv}WPJoS;!{=q(BWr8p{VkDKe_-X2`Ixboo6lz39lBTwb| zRObggxkHVD<-uvo@E?xslj~#SO3L=udx$F5i@4a~hk+p>#kaco8N4+p*g2Qk#@+T` zRjrUw%Y`j zpSE2!hyvH4pKqlNH%RjUzR*IPgHe28Ymk=Ox51E@j_!l4f;0njyH>Sbn)YrJe-o(|88;=M>F@ zSL6~k;TyvU@DIPPqknMKvsshHP(bw6n);13L71UOngNag)Dtw6WfQi^5@C#n0*?b#YBX#qG#Ru2?+#odEj(oVHq)F-x` zGxRp%CkQZBl%mW%q_}^^g<~YC@N#|Vi{RAWT4j4=sym?{Lt>~UJ+bHsOPogsTGT|$ z(Di&$d9i_3C?QPV@C6_lXZg-`1m)D&fs!A8xup$xOuNh2q&7FW;L1*Qr}8~APF@~) z@k$F4_S6I1Ux~Aof{VrYi!)KwC*?({27kZ-xzMPS1m(VtK9o$c$@u)qrAFzj%4O{R zMf1&UEq*5#j6nMvDcL@W8!bDaJ~p5 z!{Z`H8&%iC7U<3tOb!0|afD=l@DqXWbXToq93sJ)tE{{CAxfS1j6JR=pL|8Q3nH)5 zMNt;(A-qx30%jg88r2Ou#(|T8Y13*$Q zJ_maq#si{Zs_SpjFl<4}i9@%AN~cDcvt?GZ8DF>B^CAm_A?=%Aumam5yHsoF_iuqq zwQn_+41~$#+v&7*Xe3Qf$3hrfcoOshR?*+H7Vq7bd z6I1e{HjJ5 zT#Qg(ad6CSz37H0KB`-rSnf43P^cl$XQ*kZ$s?!T5UA%yAU>X^GI8$kN>l%83`zZN zroXxF%6`ewT)UJl-@6^IXeXZEMp*iN&*n8~QF8d^+Ue5O`ml__z0epf6{r={_KE?} z65em?FSK)gfBB08N3f7$M5ir0?I~3}Z@iniLd53cp`wWP_+Zy3|K_fyDE!NT71`}t zdxuOg3^DAr$GVCP-+bG(9~Wjy=Jeeo>%sBdDWU4C{E;x))i~ebRyX9GugR2njh!r)wit&k`^@-S_?VU!An?6d#QO({NqQ<PSLlQ05u@~Sk9=&#b{gl%yvNBIlAzwT?X&-T=|v4lfvt4#-#Y*=BTgV z0|ctogFDR*FFKJ&JpD1#=qT1Y6dFhYFtYBp)K2QK$+ob|&Cg9e&f=`H&;5gL{)sVu}M9`q&AULP&WUOO98w%l%N zrnY)78AOx{7m;#R>O0rP(^G}V|HdH%03OJwCGbBPlyj0irM{be(7T)E&bHPEqIfw3 zVEVwhjJwC>ZGZc;!Bh?Q??a6@E<7le1v=nYjDRtvAtN2y9%{7UE7X}jiaqN)ht&KA z9$yVf)Sg+8kA2guc!ikP4P_baXlbS?*PNfPguG0*J+9OT0Trod;6Rayu~N|fG`cBr zex_VMinoRM(%CxN2xb2y>-M4cark6+eYmUm=zq0%PNBSMSco~=Cs?6wgsy@ykLJADapd$^$Yk=LMA6G>&dBLJ~9CB+iI6dn$`TR$cq{AICG0BCij z>U68ZLs7^0@SdT}_RSlH5T~H}9Aou{xUR<<(52u*VaIbpwXg;&e);=rBgrn3_;n^O2d?ct)wD!%x`;jW-CRfNVI9*vL zBLsNtk5kdG&K4ylJGTY*yCy}oh}{JTVBvmrE}@)o+G)wO+h2SfA0J;`KREfZSCe4m z^zkzFcb5)duoi-v0)WC`Q925L%-worMpE(azh?h$va=!~rp+>1Z`YHh+Wf+DK5BSy ztmbIYT+llDmM-ZUQ#q)pW{zq{jG%{30BI##h^umsv!T(;a3hrr>eG;0`@c3f7Z#83 z)fn^vHh^>8AvNLTbuq%=FSIULca9t~@v<^R%`S7`x8H)1#QJhL%1^rZ7!G{d^r!iA zZVGlnGX{#pd9OD&vUoVsk!E4dDaDBAxvLIq2fUWync49+&E74GQ-Y6C6A6Tuvjj;1 zDN!30?I<^|fPRe6RFWdeGjIe*QkM5p-P;E@k_W@*)Mr(e%I)9ocPHmVR?85Ln@}?% z$dr@ZGWcgSDm^}beBe4{jH|{WSgIK%(C^6TUpbUCkLS&#K^MVFNbClA+#x<3Y#ex} zib@mO;#t1jRPT93w-NNSTO{(G*T=z!<+=@{*y0Iz)Nl0zLDd^LK*GA;mJutH!y6{z zi$k-Q8@~ zxXOUI%o;fvhsXZzyUhEpQFk(MibKBA6h4*F4xzBS>K>FH?Hul$=y<}*6(6c#?X(S> z{A@ghrOC3T^m6+^4YtFeR@S13`)rHjgt-7Lb~}HwPV<0Lq)oi+`vU9i4)#6gU92K> zTW#o;{}&!L&g{dP%z3bOEPlZc3t+uAtoLzavW4wTyg!>h>@D3;2{|&^FW_()Sc=1M z(bGar7ahd{D|XEwodoyi=iU48jb%!`sUzrXC>ne^zU3b?S7!(5820$>0Kf$$$Y8E; z>iWmcxZfJbveXbFMvafwJ%E#|20>hbGG<_>WddK=dXN~$*Db+5`p57?` zSu7-#!aNhXi56ul(c+(~eVT17zN-AVm)j>mL9gR-V~6nwG|5nf%As@~FT{2npPc^K z_@*T@IDBn^y{k%-=je~F->>ijBi%e~uKlppNe#d`{ zaiH7@6!lD29^Wr8%o%Y*Va;)KJ+odzxs%K49CQN1R9l{Y(pJA@kz*hTb9ki%HFyQz z{v=X2(|R{6>P*P1cY(8TyrYn+!vf##%Oz2Z6pzwULMm#Ny8mCO~*jwSrVT zv6Sn@Wob8f`4{D%cSMSFe6P$D8$eW8p?j7tHetm0-x@;300#T!*#qWT|MpJ?d!8D7 z*Ind;KLtTI?gy-d3Mu?VG|gaX-D>u#Tak_zP`pp`eqCJ@Tf4fj*cV{_ALNvYQ%w{> z>g7t*?159@Qnv)|&cHspdZW=^SH>%zx|ZK8ip__l^ySS-ox3^O?t65_!EO)daK0Jr zSOSmTQPJK=Xk-E%KxG9Cu4<+9j_6+sc`hS;sL*y{Z?D0KJ;@`7iG54tLK;};N)8BO z1^(|QRLSjrDIo@|ZpAkyg-N6E;4W#v!xOBiNxp3V5~sr3B^D9hVC84rkPJMQ9T$VR zU)(EyN&UH6F`?4UcWld;DVgQjp3~J6$R`Avg0?xGc&6KQJ!JeUX=xH}>T$o-G{$^3 zyIXZqKU*bHrVMFZ_so`Fm%>V8s2Yd5Y7F6CQ4J(rtnp){e17u2#+W{+M277-x5(~+ zkF9f8(z)9Ur{d!_YuyAL6qs2ySP= z>pDm|(cox~V$)<9C0Z8Pb^4iCeawgkjh_j4ZpCE5afY2kflphVJ@v1q;8I$n>Rf5? z^2{K#3ho-XpqFWr@G_!dH_3tM|F1syNYr zQmYMjj(}OwYpBiT->ckqrzHNePAft(le+I2mEmgVFM35*OZW9h5s3gF(rU;D9c`|l zNhe*nxWf+W(;VZBLHVQ^1fJ0S47i(`HDKC=gw?Uo~y;@#^Xl3fv@Kz0`>l=gs5_$uGXYR+ z?CdO#)J))`e7xPPrUDlcWxJO6rvzcsaD;yDhzvJcP$lJN7AhT`pU4fjxr0lPJM8`Q ztw14nL&P!}wqmc%_f_1l^P6PQcbkqB7W%jbKN_tync(R-?KGK*kGTqU$9q_`_!-t9v9DwM;_rE2hUiGdD z#%E3R$)jm!VVUyJC``Ti-;{dqI{($B$0<+e4b1+42^jHITE!F`qh9h?YOkK{Z2R2B z?b5Hs0G2bSr`AbQeEubc==R8!2JA!Qj{9e&cQ=y!AnyG1a+qt% zL6EFc2E~#tvr8f(d|0$?m%-3l^)Wr(SBz%gfRtuRu@UaPP++HjI{Bn2CJJPM|6-&SI2_*;WTN1#{>V2;D4aBjE zA79z66p2KK*5$Szq$j~X{ZL>p1^5eCiyvWRjZtjkmtpDIVyx`Ct($%?^2n!CL2_ z;3_cl-8ZblDc?-Zr@WZgzjqH2Mupp&+L?fXE;;2c7Bm-NH?^_70_wS;fRUyY98IpU z1?-&JK(?&e34x#yFYuvDhWv|fkGKr9hF5g;ruLj(!LL3ukOP2$Je8sIW_(^V&0v`~ zxWhzh+n#eiJML1>umj^h0ELAtpd?z)hH`epj3T?Dc8;go%99|EkOSHSuyP+^vqTnjYLft?#1$~JKVz;h67(m`oRe&Nb+5E5&G_a4u`qU&}?xQ~}vAB5n2;UXA_oXi%hcD5V$!z2UHcV==3BgonpoaO$B((a z$>gSwrX6NGyJ1l{*dah~fO{)i6oUls{5z>qk7U3DBKae8?trq`hHCc8v7$_3{x)eP6eWhV^a)SY$!U>>%-Ig_OX4p@{ z&E+-d6JIo^?V>+-)45%^a)3Lv1M+25ZDP59C4J^dMp|(ETx9mZ0gBAxX^;=kzCMqg z#FSlad_)oD7YSLp4fufW`!AbNqsjZwE#7g0zRm@4*#sKhC-f}yMXOj(TudaCW-G#~ z7W!dgKg{9Vw$Ry_JG5&Bm#OebmI`B_Zqd3aFgRlqB1cwGG~P?iA22B$SibWxEezDt zb9vE!Nk&qBA*qpby|Q#9JHvk^WIpU2I=CHMQefQ8gAB{+z?@iLrG|3OIxlXLEE8Lg zz4lMDiPGo>h_1;sZoiLd+@IM2pDopFr<-#$up{*yxlnOy{|@(PsD=3Q+H>RX!Vc_` zmQe0+$;S~91{+GA{Wjitk;Sk^`JB69pX6!|h_f85fC_y*Y7Ah?&tRG}_2~lzXGFv! zArP)Q3(&r<%B)U0{@Rt~!~fXJ>#2W=GaB}E#R2qWu+q~6><`w|OunC}d9d0xc$&ZA z>u?<^^TgmRLH8#bE+%O)l+>1O_+}F{YOvm>%;k9r?e<Jm$?idI{MlHzqOM^h!f>=G@%;)uBAt z-0d~>luO4@BHJ=|DlQmGhORFn!RTmX0nwlBCQ^efk9|F$L6G42;(YeKE2u?<4+5;t z(jw3(_9@+n*95yG6IRa-)h6I1Kyz(zb8q%$)I)_vhu{_Or0NQ4O4G}gvPaJCu>x5a zWNSVAG3<4j2`SiXk{_e^1%_*5m)%dDMYFKA%U2l|L8I|U6?jxGw%Q2g=T?F8@zx@i zp2^7^Rc0bBXOo>&Yulv)y}Hf81^M@(aSJwrN57)N$EM&C9}U%z4^&<+ryn*DWI(2P zBi36zQjQQMf+0fyJ@M=P^9I)yNMCOJ_zkuRPlXAMKygNsiC5xkn8Cl}1KKL!pRxao z{q^%-?JtS{$^ObQJEOU_cE-MtzquITr%_Ki23gUi&0z8w3 za-64$`@vp<_kC2%MKxHn)u6h{$J;{``hjEAj~Gb1Y@2*}cx!*+Gk!V#a<5L^=5q*>he{;Q> z*owaL_cE;2Cbh56Hd@g*c-U-dUZBnugD;8cP33#35JvA;P=D`%y&Ud^39!lpA&kDu zVNG#T9Wx7$=^0ifflPKWh9!ZKpT;DhAKsSfsoV2PZFGeu&GjSIlUVpprG?ERtpc0prVD-WtnZzr1B5BDN0Ra zN00RoZ*kMxb98eNa#xqsZuOUq`(7rsL!*IF;rgIG3C0V4}*s>ar~ ztT3ORb|8Vq|G#`Jv${)@zSn}w><1T*!v;(^6{a)6)}FqCUQQ&K)#(j^Ja#i@U;%V!LFqt4`9Li4`gtcD5EE0!6P>}XcP zz@Tb8zZU>aTV6&?)6+A2S{yVIjC<+F+$lE#{SnZiJTDjztLMOJSG>Zg+nO~WfFqgn zPlu)s?{9}j=YodSE`qY1f}#MhriORxmX@@m;^V_NV%a9xEL!g7*N4GTEhR^*30wM| zZOXtX3|E5HV~+%7DY!C1AwN@h;?c8D+@YK>bu`|p`+YOa z|NHfFt+r>rhjob=IzM&pEg<7vt&ET7&k6W$3Y(42hK`oRIy!`h%Ua!|rV2b3(onUD zlH4NHWP>N00sS~t>A==YzqURHk~h3OV*M2nhg0fsd15`t&3_`PaVQ4p;fRozC1rch zGD=izIO@C~-}7)n3%~c8Bz`tn+AU(f4o@3=oRv~i$%u|G>)Eoxc#OO?+nZ#Pp`a4b zC9=eMkz*@MyFU8LOr^5rT!z(eiVFMBaj*NSFo7$9Kv2Rhn<7d+1yh_)5>E?RnU4I= z0h`{cR#ebNYNCNzU|MR14l)ltck^4uQSd99vgG@t%nl{zr!HUWo}=D zklpFug%-%3iKaDwPq*`pNNnh`CMHJXrZh5jeV`OkF}(~cpEg)I!C`;241&iz9`Pdr zLTE){b|u16t&^gvnlx8g?b~~DHZw_btwf&!4s6xVA0b2^A{|i^^uXG3HC+=^+5~Fb*|6%DHu(Qir9OF`nt zuP$EMQ)c}YdUrQ2aR&7todWrJEu4jja|*Sa5(g><*N;QVJ0B;51dYzVu%bG9H-52j zdZdNa%pLj*iATlg|3YKJarQ-qCzKX?9UwMTu^pOYGMn_z{gm0sK*JkB0N2EhgW_wVVLP2ob_ygH_PZl^ls2Yy5Y@9wDj-hmz^P%-?!dD+L`YM#I>a{v1v2>Ve8(EO)pf4vH10G)Mr1ZbX| zgA>X9;`fG>H2dXmLna)$!=DTcxMP0#{lGRs!}+Lsci&(t-b2a*{nzLK-6AFY`<(xw z<>SA9`^f)ru$xxU;o92IJ%3=Bef@LKm*{(Z-pt*2P^SM&pXr$VukI7r=gnROFs!?v zShd+N&xOhV$;bHLy(4A48rVHX@D)k>x_~s>n)Sym>qselfQ&tas@a-?1K5j~@k~QE z^l|K!P3gL_Qh}bOI9~PjR zBU?M`Lc7H6bHbTU-IR{Mr5O6{0Pd)w2DO0%aCoN^Y)CskF5S2D&!BIZducLU<)6X% z!mk_gz%b6uZwvKX3t$8YA2;GE4?mW#(uK$yC;P@tSBt{^IysU(wRJ8hGn$X1z=&$5VZ5}bOsXfhU9 z2yhY_RqPjvB;Ei$s3tc!+cjP#<64@98Tq4d_{si#=;%4`cExFezj>FHp z*L5y&WEte3MYH;%yGLz%RS*8^5Y`p8SRJgXJk4FyCssA@R}Kz%oa{Uaw1%s5rQ4PF z3oJ9-FKSKI#7w60MVjwF`?aQc`p>QlfK+1D)fb)?;nRU+sys}31&quD*6A>R-Gjdh zH#_g@db$bd`*F7Aab*i_*|KpXyn_bOwq#l(e$a&#ra*XSFv+Owa`U~1^9f%Xb|D@? zUH`28+n1OzrH=!~}`GSE#lk|h&@n4&?}QVsjr+T&tPpW6MW#%oCd_U>K@vU?R#1)ujb@F5<9jq_I&UGy_EV z8kl1vEz~lkSSokV>Oa}v)(ko%Duv6<4>z;aP2<}qrw#9z>*YMjUkZ!1nAC!W@`&UR z%wbga!(?HREIW&YrcV>;3Lm~}Kcx6sF%d;7qL#&xW_SYB=Qgg=0WaJyLx zAHjO0aR@$bh#_W8yu!T#$vENzo3%g^FtnCq~+|hMHEz3a7MF6>1z>6)wb;+R%E0Vd{Tlc?2rfKXG ze2;2`RUA`sqdV`KvX@j?F125OhA{7cNx7Uf15vn?%3^cFKT~!iT*&US#$9Qd#2WUrjmbpkM+velb1=g|*X4%v3AXpjZ3!lC3f-AFXH6Skwhi!=fyR>zb#omMxPVZ$E@h} zIEojDYn*Dic?}Bg70690xs@H?ZLmWP@(U85PK@Cl#h7i^6_fZ|xynt|)eWz4ouje5ME1UIiQN{K0Z8U&#`=9w*0qPn~C2tMOjQ$81hro+ON2Aqfy@(XIjn(zr#jjIAKGXTiI? z$aBa?pEtf|v@7~?6}JuNF^*xjhc8xq1xw)-mRmH@if}h&cd!X2y`9`QZwM{~>v_y; z(%@N3<~<3!mM+P+i#oD-mQleo?(*c~U2YLvvNiqbe*Sn1;`}v8KNsJ#49-+{Ksm92 zzhzP{l@zoez;%}T-%HpSE1o-*0L}dk@?0c5`8DiNxSBfAlypq3vPHZvHaBx&ek8dx zNb~74Kt*nSThxzy=eTFxl;xL8mTxK2y@ZCS*>~Ov{ezbwwcKdmic|7LN8KFIF zSl2a_0N|tEK#Wp%F;o-v6>S}7FDmXO1kyRfMiaco%!?jiw+cigmj^*#q*)u zXlSenrqCl7pl_W0P|&+|sD4-ejowzZZ|{Y5r9e;5p+yRHFCM+^(bLS}wuY?XYW?B5 z@a@snq9@?|EAa0toH7F(F*I0z1nx-`(BarZ1-Bl+iA|@FL>(4Jv01$d6nP)yNu~yjWx@ecQLx3r&-^I$2hdEsuK9F z_?rqkCGzi(>p+-(u9-*R{0>uDF#(-60khgVEBagv - - Produced by OmniGraffle 6.6.1 2017-07-23 10:21:45 +0000Canvas 1Layer 1GPU 0GPU 1GPU 2GPU 3PCIe Switch CPUGPU 0GPU 1GPU 2GPU 3PCIe Switch CPUGPU 0GPU 1GPU 2GPU 3PCIe Switch CPUGPU 0GPU 1GPU 2GPU 3PCIe Switch CPUNetwork Switch diff --git a/example/distributed_training/split_data.png b/example/distributed_training/split_data.png deleted file mode 100644 index 55ae10b040c29b96d99a25d16b8e6f046077c6ee..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 15881 zcmeIZX*gEz-!HrHxG_IXR4_z^yL%l<}zJhoWpsa(XQ%Z_Y+f#oU-Iwwj_EDSoLR{+Nk^`B~Nh z7V&{=_k;aqghKAzk7J>ZKd;FtWSp)>c~oPoTmFTNaqi^4%4PoM zXWCZ(V2bHf-Nx$_o=cCpD#+&vfe>9mj{IhzW7t=YTNop zFa&?B{i{T~eDL*G)3~#?<1a^s63m~w_Qd5lOEwBd7yY~d{MKuiREVO&AUf4 zZuwNuz1d795Gi)Qe9Kw=4cjdQ;HVrr>f#qN}ew8YodvWJJXQl6jcIhI!%XSRMwP|ZmD+biL11^0Lx z_b*c!m#@4x%`qAkme4F1V2v(0-?V7DIKCpjAhcqe)bBHR{&%5so#DsQT(x%-cKX-c zblL2we=twCMAbg1<($3xd+H7uSSvJGI8_XWN3av)Nue?&( z!)v#5H?#BGqu(_ziqGt$-E`@MX8W!1%<*OBSkV|6&TW^tex@!PRfyId zmsdF3(j9WO#y(%_^`bQRBIPhn_s>o@)z44N*|LuchCdq{@lV@zuU_quwf6l^dqZ}r zsouHZx#!}g^o$oolwT~&$cc$7zkbNJq><8|28+(6SyB{_SiGqKZyUC zAn?b;u+vtvZ9&iGpJb6EI%kg<9bhT6O)*#5z9YUhggR({CON{eZ9u;11&k1>XbX(=Iz;J zZ%W$P+c)aG{}^?MpPQSgsHoVn7x8O9KPsD886>5oY$0;qym^8fIaBUA zpJdzil2Sg;FjLs5=ymtFocnZ~%lOx`lYNz@@vc7l>gtck|6c#qIj{3gRZ&q?!0SY= z&-z;3TAkL{JA)VVuY3#7GA@A95L-6c6VIa?By=_QXJ&uC0aO>KO{c|0UC(myVa zyEWrH{rzp6C%OvkFWT4`{FJ@w?5ssp)zvAdoPB=p;X~Sk5)$VL?n8&Pd3bo7{!U*k zDk*u}*T*$K)}~Ex3ks?*i<;09`jy@?B1T2r4h{~<3-rX`&z~wd>%wGD;7i?%gd2Y* zom^dyq|HuMu8YLS#~1mm%Xa43Y``5-5n5VW7XLguLii>s2b)jzmTPHt5eS=8VVCYG z{`}{;m6%4XtQVi8q<-{5lJaRXT-MgsKEKoz(quaI^J5e*@u||shk{#HR(1yy6JL0E z_`t7U2cn{)D4Ap}6TggFRQXbvJ35NuLt5!t{4LgpXRNy7W0idxlqy$_PnGfNX6QX- zK47?sNY_q%)Lr7LO7y=EV*mWg;82TgKKTRZDRCX&OANlfyY*B-Y0>r_Lm`#o2>5|2p~k2p1F-{6z`}c@DDg-u>vCsSjs{Ugqu6 z(o$!nIBsg&*KE_^wbf-I;jO>b+dwEe(W}1#q^`n&3hHHkeZQqV0yfpeb`LwDkBkl}E;f#ZW zh+d}Q-J!;alZh?;Uo_rRwF9v!M|M%4O?&tmL~6OY7~** zp-AhRx;ozD$Jr+*C#h#8tUtx6*$b+lI6+DDCr6l2hQu;B^pxyGsD`Gc3FsRdzOAl~ zuMuzfEXS+-sOI;Nx`UTL?s~{BvCY1#;3?Hs=Kf!WebmIt+VT{`_Og|kMyG4nv?~X{ zq-rFh;0CuTB~(i79+Hx)$iVIK&?Ze zIFQ4k{blCS?BxvV<98Ao8d^+g|k-{^(5!nu$vK zZ!Xi}zxsBpmZMx(u3Vw2uT^=J|>9Xa?``ZrW zWLl2KF3t`I78e(LmTV?aTKE2_3s&~@EKx&%brm|^$E#ZU$=>?ND_ zY?jCD&%apL9Q&@L#V(_vw473Tt=hYHR{!%4U7=&2l8z2j?wdFK($Yq$_M)qah$qdp+1T~^$b*-X(Cn?d ziw-i27^O80|NMCm8?!Ryv;HJ3jA?P9%h9r@#I^pz2Zig`3y4LQ-5aW~*o+QG&qNI^l7&|*C}(vqmvYF|m;mn&^vv&rUDoWg^`Rg+5Z z9y8W``}WO;_sq??upVc<5jJ2lo%C8+#cy`oYkQnT8tbCd~r-400&Ccjc zM$v0c(>(e1-MhB=sm^@W7cX9H^tC?BcGbm2c#a_t{WmZ@J-zyFkk(88_ZkZn*8v5|z$-cJXSfUDr zhF?~;*GvYdOyD{~p)wU)RwjMy#=@%!oY#ACh~-0ly@1XXg87*DN)d;YOIo|~m%id_ zsc&PQ0Dcr}r)OuMAj8j?n8XnB*jJZ{c1&J}AZ4DG<_-+X`YlVPt$t{8bM%GLkrHljzh4K+>~XPi*nnUOYL zT_nf2dPH|*KK|cp@L$&Hzj5pTd4d1`5B@Kr@Bf>Nvwx>f7D5P7I%nseaBQRU|0InN zPybh5305&Hi`o~}1aqBuJhOD&2_Bx$WIzQ6KG?clrXb+->0K|7w`oR_|0#|CmkX&D zWK6;1d2#x>509tC&ZrPENo28R6q(e!Bg`kZgtUy%OBd69x|HF6=MMYN!21gK{rv@* zL~Z$Ti~sFB_@A@s-@h^y=OS{i|1x|K7$|f#1~5M1Qr)9qCN5EvG6D9Zc2(2g-(9?N zj_{|Jeustm(=EKf7Y#MfBq91 z5pf^L`d9kb)u!m`^I4OYU7k2ZyGtzYW24#&$?e*y>RyAcvX2}q zSD8VP1!q@4YG=kWvdwG!w~ure&+iZ5^HQmG+ zw*i$duFN(!mQB8R!7=(Z^En#Rc+S}KurR)Wgwwbdi`AuhC7b0RV_k*WWlLij8p)F$ z9v%kK< zKZ51hEJndem^CysCc2BYqt|L#Or@O{0b-Pmj5v&pj2g;5)&{a*ivu;}R_7jnTh1FC z8v33Z>wN-1kSpT_?p9urBvps)JHwQhllAkgw}%TG+)h@Fc-!1ePlky9cHY+)-%*ct z6^br5*Yoi54gpJSqNT0D_I)25yaNVKMISeD%-H7A3fWX#pcPZQFJ)I5>ETGgk5{We~drHNnTnR}VTT=F~6GA$xNd z6bC047scS+GW`=1uSA{zuN}WWA#ZNZbNlw~bJ->`Kn0r+Om~x$Epxgw^z|b-Q90Q8X_XBV+J}SLPL3=Hc${8GM-Fe@jF-RuW4z80{Qan-AfJ1ZQWCHWMX0>;8h)1 zmqpu4UBdU>eR^IgDMM))nVp2BQ@`(>yLa~-I>ZR2Xr$v!sv!`dgavMPq_0vgAt7P9 zQMA$w(MUc17aU@+HC@{sIJIrcVz?v%%m&hC|6R{`Dm zY?HgCrN^naaclwvbzS4~{;NQ)8~(F0EVo__x;YIh#BAGz^HC9=eGYx%|2Cl-9XeA);M!=aBP|Hn&^0Q(4o61q2x)7Wi97#QlIH9zrimgr0V_mJ55KW zPo`?cJ^NCzAB|#uLbFcnTIs+!hEmvh!annG=v_g0*vv3IOq09B6 zl`H12E1>YGs;NCB7AY=cg+r0jG}|~rL@WNDs90UNa7In-u>M{d8Do+}qBcB1T%88k zy>sVI+V}x=Y|Y0QnF|GW9XtrTCkc<-1Fo2=s06$+D%K+EQEc{OXLKV}Ra8Pjzdt0M zc|tCpGya;7lSF=1%CaTl)DBM0%%&Sqg`%WAa>1GPz=$rN``2#ZlE?x?LPcT2ytGeJ z-pjuv49Y$8tKD+14XbzL*=WbhuFronZL`VA$(g|~jlkQMA42&a1A4sg(Q}CqHI0p`FLly+=-7`kSy@^6xoc`I4-5>D*FgAZWMn+#IHpDP z*KzqUA-2*@%1h=ZIF zp%3Qg=ev7)sz5QT?dUMTX=w){qobqC{`B53zj$#EsI>FqjHRFZKbAkeevXoK&2lW>Lz61_cJO zT)%$Zd1=muVuo=D8d`%d#Rip!ESyp<=bntn1(#SN*eMCeip3P2Oher)(@GJ-X>G;h z_sopoK@&v(Cc+@c{4ODHX!z_Vs2Qbv5R0Xasb;DjB(Wrjw_`Rh5m*n(%gb+0SJTuw z)cM5lghPef91=UYfnjyih7cbH8{F=DTdy+f_Sf-fLXG|03Ze)b&{GRU9wmWl5C}5g zGU{Ucu(!giH1gTAhUnvNd-NWMybT!(AD~PSWo3lYnjEgnDbSW8nleqhIzj$3Bno?kNv~m2d{_$}kbcG}SER7MOqF2{P zddo{)Cc6!YekfU?RaNqvX=%fPWtUF-u53s?YTvo5X5e<>J3`ED(ul(BwW-@gdy82N z1Z)*F9t$`LAm@DI;-C9xk;-RKzUWeV$~?50PHFzQt~nI`^VHeXYt9j)w)uv_`mYIy zQCkR|bS_2?5&eTl&--=&Nc1lfPl=TJ)MkU0=5Q`MppVxB-zln_2};Ob=eF<^sw zERh&wfTE%z0pb(P{aA(C+q4Vb7-`*9& z>J!|@jmMy>$B|a?9U0Ow%YYo^qBPq%dPD~r0C5J z8*N3I2;ITK@eq)Yf_VG(EtYXBBzg`>CxV!sZI*lY?Hid34X_wU-F?r=duhW?nVFeS zq4R!6G5tidi$F_%pQf1@c6&Hrc4lT8!po)FfSv87ZlX}T61@wc`nimLt~Kkt&*klp z9rruBQ`bS^e3lUf0j+kITDEQmGr`Tvn^@$KoLUbx?a`x0V{X=#mdXB`dcFP_w`QA~ ze*EVjsK(P;jUIoO4ncl_3>o3Uit<(a`ST7|R#yMuVCC!Ak9PO;(EW5IRg@uA9&Rx) z29#L)U=rK%4<+#D4rb_FZvr$-i=*r~+s*v$YLC$z%^}Njk$wi0Qj& zj?-=4TIYzjwV;B&I-*c>XjsBVHyRVeEf4(#|6}sc>2U2%jSWt3y7Oz+u6I#JSOM0)krw>Iw z3twORm+I&vX;Ti^@n9$_G7{Fu7c)cnN7cNHyYszn-MXb;?97iED6;~Ddsm(heK zQJ(sL{y8G>W2`NMFnc|DzmSG_U07HLsi@_RmBz!}GKMn!noLvy32}h(K$nlJ(HL&f zI&2kv*aVW!u6ZpaF|W}`qfs+^{=b0Z)$qK zFZSkb08hx38{jtai)e?1QmUOeaa-Yn3Ko}zmP^)%PVCbAhqoY^*F0d{&dA7%oE-U_ zy!HJf&aM5wH>g=#3jp}XQSZY3C!6_t{0NfRa$tCDEMPO^UZx8U3h=vVd`~>+fNW18 zAt4d@{Q031GJbklMw^wDmCgren%M&I!w(ZiT3T8RZSfMG~Q9FN>kNEzcgJ43=6YU4VD)NjhS6R)@iB*x7>P|MZip8Jo7MCxx zZHxpdI>}`=CkG|<9e$@-@(o<^MK(V{jjh@SEPP2G9T{S}F@Avy$pGsV>paa*ow(6j{)Zo-+Vp5+`6Hu_Y4qg5T z&XNG$^#MJ~60C985!*1&aVHy8qD-|IX;!(FX}>io?wd0o_DOoq2~@1isGd<*51*cP zVihwZ0e|4q{CGs^S<#ZpnXtN%Of(g;<{)3KG@d2anc|1-b z2-ZY$7T_H`Co#ufX=2?+YWgCGh8~>KhmiBD-@m8cwrv|c7mAhD)y6hiK)SN|uNU+S z>{xNB_hDr3kM&tQ_Om&b9wquPaw01$D^ljh(W4IC{u;82NoroC|7qEgYZ-aSY8$`@ zEqD*9@56>I_bfN@K%byOprYzTOC6Au+=YOm{{0qma7UTPA>hcZfV27r273eqHu)@f z^|fwq&SjJFJPchs0PvB1|Ni~rcI{$_Y-;e&`%j)Q;zF;kuX#h#1uNEtRe~SZ$z~a?1^7Z)p(u2^ixLJ%66bOYMup3={DPw_R zWnFjHvN&U~dH8)jMiAOOhMQwWi7hgHd~mbN={kVkiFyza+DWn3Z{8dd*eb9xSUWLc zXVaON4oGnt)C=9Hq3G%llyXI+QGat+`c~QXq%`M1==oWdB_@PF+$3^CM7y8PiY62l zNiMGb>7%R2KywTqAAO~9!oeYnoL~r9=h#g{b6@M5&+>x(F_)1L(t)ULx)8q5Q+%zi ze>B&|#>RQ&@74LMn+Ycbzj$B6OE+x91AqUJC;K$m!gO2UB=CW619Sx3y#DFOUxe*G zl3n6$=k=7&2hk40>YmVweakkDKzRFEJV@j7_V)gqc|l^HxdZn#+Mbi>GV0n1W8jE~9f``aAgJBAF6xcskiA3uKK^XJc7j88s({8)3ODGUAu z81#*#<0Es8vg*-6dvJibxOhP3AAIE?>QA$AsR{&v{Zw2H`sb9ClyYL9roNa#Z7YDl z@BsiWAX6JWVM)9757wR^)#j8)YWtLB&kGyEw!O0-gz@8T}`D-74tOQEua zz?B;7YZV0C)arTx$3^$0&Ovus27>JD;9p?LlWU8!28u%!^F8g*|Mu?Qy}S0S2M7*& zz>-mZ`eAUHkTXyAZVst5g)eT{Q@6UhIus~cA)tL@Yxu0d+MmkxGoWZxzA#a@5U>_s zF1-+xaddQ?k5VMasD}{)yE|?6EIfGadmJZ~{pb8WVUnrP}TFHtH+?U@J zb%+L5VI5Q+GaH)+_x;pS{M%SkM@NSdYNn0l6$mt(XTW2tRG-%$PADdt6T>h-rxzBy z1CG3O?)dpi*)xkDe#Mv ziAq5MM#0h1oNKF7J`skoIPfB{X>fPUa5WS(3+M~GFdBp^5&vZr9!k8o_srm#q@<*N zv=o6z?tAy|zkB~arN-;7iSX~=zjqzB4q2jXeZaVH0LE|T^!#NjE7j8Lzrb{NzL0X= zf}mG{Pz7<~KByGaRu+-N{`BaLr|j&6!CSWP+<6xiV*n{t_ovTiy=INYYpz8(Qp!~j zp4R-wRi7I_O|#s7bsn*4d%3eIR*pVMN!#udy25+JNfM6PQWZCQK8apR#&eDc;c|J2 zZa!Hj7jYg;-2KK4(tk(OvhFP_4vmZykX!r9s}#ID2+9CCg3(r*d~X|P08H==1bX9d zp!*Dk;UHL*JZ`$=by$CW#fo$AQ}sEl@y+^qW@9&-np|8(>vABAiqyd1%qIvt-50929qIdE>fJ^IHh#bwxODn zx(wVo&I1<;bbME6wvEyI172!EHKnTOGKNs=7aUBFPv1f2kow`>isu>Uvm3CQ8#DC+ z!9o%sP>)I1wjBY_y*||&bl${+3Y)%#Ak|VxbhuRg!9PAgArArqZbwA0;4^x5a;8O2 z3i$0FWXt>)upw154fD@cdV5h313!LPLK92VN#7ymHc5*H3eXp*ME$rKAg!GS zu^2tT4JzQ!s|XRCt)Dc`@ik!^XtvcuslJC{oGOf??Y~k!H~wwkFsWKPJLeh|IZ^!l z`Sa5sWH}quj@MVdiJ%OVz!#8NG&jc5o}Lgk$lVA+R1Ge1uH4g=%C}Pj-x>y%&HmBR z-kuI{#cu{tN}ANQnF;z3QggxcAf4ij^w2Z65O7-sNJZjftXyTcrCy$u?4FCC$Upmg zXLq^8gvs1k=!a%`ry>X{4qA3p^)Q)M_}kV_nzmqhk+NP=P}lB3Em1}GU?b^~V9@Z` zG2#>svBZ$V7jrppfdepgb@JRf223Kf&Rn~8Z3r4DIj%L=*Mqq!a{L7>RU8}N>$C2~ zA?<#Sa>ItFAWj!SR)JSfVQ%K8m)F+(QTVXrsM6oRe+h#0{^8U%qK0IK>3mB{;zQ2U zM2MQ|HFlP`3ZZD5SoC3w)lIw?Eg`xMV6G_*{32(y%ynGvm1$+KStq6frct^$iPsh% zcGy7VmAXFR4>)cH1Ta?p9Sd69)pfyZ&(FWVe;fpP0uC~aUmZoX!Q-HZLPs*Mq~zr9 z@95=faB-Fx1hNK4lwBI(wd>3~gOo@N(-SR+vN8ZYosz(5cO#f^4Dv|L zL~8Q7kyzxA5EOur!d_2R3Sa0R-%Zz-&KEcE5dg@feCujGimDv~<=~}Hq0JX11S@~6 z*Fx0Nw$Scwqnv_*aE!TZ3W;@U@8}?P3`>j)4fOYOV@LHGp>z0^Ep%QXwL~nW*0l6h z^mhPB9T4d@w}w}S`6^J^03ArY^@ zP~v#WzJhG`M}!KSR+P1tjE(C~U$g;IXOfeXbM2vOW(>v_67U4F5wfQ-)Dj@0A${em zuuV(C-D)ecF!W{SV)w8L7@-hXB!?DAMZk3ji>OH$IW>u-OLFY1Ft@VW2bd@7OpX&! z%472&A@9{HbTH1z%L}XaqfrI2PS42GSL$&-|boJ8WNMdlSc+-nxybzQh^D4Lgdi(XO9wy-~zNUxn zcY#Fl>BS%pWlOV9nWIskC*^b!qVVtpI zsUyPFnTK($H!TQIfok-z}YI_OBxVYZ_Vc2t&gLU`tqY(pomWwk(7@$?N z!9*G0_57-tWe^=1=A^_I?Fd{WZc)!LKP9(ry7NZn^Bn0fynH}X<~YwZa~q1jTBH~S z8hC=wrJ;mVflk9soEW!d0IS}Co(wH^v~NNH;IA64#Dn5%!-Dc_N}z6cL4i+$Oz$K% z(b1iA#(24-RXIPwV09g{2X#j7@32*|n zXsMl;B+~RFv8JS?Gyn)D0q06$ga%t(RG;XJ?=k=S%0LB> z0M0BOVCw19r+tx*pRp$$psdAX=S#2K523aW;0|HWRUw{!|M?>^@+LBJ7ip$ptXV}* zFB~P-ydiA=-3CF&8yIRNeW&>;RAscmgORBi1s${gh`vYdMd=9=;4;>_*=@4By6V;j z4B17B7ka>Ffw5c*xiO8rVRX*1VZ@t%P|y*{r;7>XL?S#E($aH%hOQ4ngeja&7@^** zuC8tk(vncQNm5*V2eQNI`gK6D`{ZCKriFw_LW*6YB#@;k(2Up~z6D>tz{%RiE^e`@ zz@aCAOtZv9P&%kb_W}dAf@REoIM5U$6No%%KvZ3HbTpk{Pi5S(gH-MTR%zjU5h3eJ zQz!y^WMy~rx6S8&GZDJhhcRwWG{-T`Lf~_8j~NRCn2b@40%@4RjFk2e|Jf8RgsLZx zS)wpNEgtw`kZBKEHZWnTfm>KujpVJEybC+b)?Vmn`tjM3t%jL;@CriFFVcB?uq2Ys zgY~>^Z}@=H$Jp=w^YLRCX~B`|$WM>31q%xcQeIy@zJLE_s2v*#a1l~_mGoN5TUrDu z`5n8yLPMCDo8Ojl9W#SFFt(}W!I|F?UICk!_s1i2 z);_3Ffq+ZqnDP4y=kx&@!w0~`JD|4yo32lG+c*i9!N33$7AmjkrP)6?ICpwp zFc_NtcaVs&K1|5oU7FJD{4qKC2=FA`2h{$h*BZ2>hIij27skKQ!|6HHF1zKu-GvJm zx+}ccacx@D%cVZ+75W!0WXxk2;SHWmpe9{M&)<_^4VIAh{%r;XBydYe6o$okl*EkQ zxucLeQo>vlr+uAPN_5-t zuu|bT^lOyYRJeru!90#p>_e7mcSN7$;Hq?u>p(y@UcScJN>|B;$kb z(`NI>x6si&#mswnCs*?Rbd#Z-TwIRRZufa@%S+>L}Z0N4^;X-RU66uYRXMcasqkE7>W8boa!JY)+C4~Ufn4#B^Q^+uj`1kCg z;GxveXmiYk?chK5w`QGhi?)hK8DnxK_oU?zTBJWC<3@~Km-56hS= zv4GH8jsaUf!UApzsXR39rw&UKBbqPm+Fw%uCV9V#J9P2w#{HRDLv&K4!Bai_I!t0@ z#O*yP{L$0bkG`V)sd36S z8W;vwkCA?mVB(qkd{f7u& zTT71TjLo5Te8=20>3d-A!FR(E=a9~12&YxCJYtQ@5`>`c%R2~l7K~^pAht*o6=mpFVj?emo~NiO<_IVBiKpajF06O*2?Y%D?$11~ zqN`QDTMwm)0Jm~}9J>w^q#f=jfwn{Qukj4bYE(mak7<@&rbQna!D9nxUVISONl*p6 z7=hqPON_SM$;5OAvL%^G=#U{OZeWa|CU zK)sOiX#Ch2W2l;4PP~nKhd;9eRMGpY$#%N1Ai~7{?63_b+yiXmuCmBnP-_(^iTyu+ zQm^_40J6?N2jD$=lo?NACE8ngVIbcR%4VJeWWBB>A8 zyB3v}3aN$*SLfM$=^q%l?dwaFE&hC7`!yZAIQ{33GN!)1kBs;O2^={lfv$sD@H>dO z1e9?!;c#@%0Sp7ci8RdHM34+x^7!#fGlL)WORpb5v6^k~vN$vE3GI^I#B-D%XX6zY z&;4}kBtrAx^$C5+W5>>*1+<_M!he2s4)8h{s5#ed%EZFT>RnF{C&bR_xw#sUbp|#z z!;wXUWb#3tg{i(^z_fTUAWXFTg4c6!aFA5mwE~!!bgIbbwqQjF02Q+0Vv}Q-E8Tzd zpc?9BxiZdVKQkd+DC#f_6@W}YhR!&_9u8PmR9wueCg(Q!0&?{^bhXx(x~edr<~6Qw zJoop^O?tVY-_Y;15|!xY)dN75RgncX*kK%sB~&UmEY91W?u7AC@Hqx}T{-#rJ0X)t9k$WRIsciQ5OZ>RO)#L(uu3|` zL0EZt{*|a{g%Qq#H4c39<_&mQJ(>lc5RmG{kO>rAFerb#Q5~&GHJY6y>MkLVb+!aq zpmw9lx9`$x(8|8uXAHBSv|&i5fqcC?nM+bFcXxMxsM)CNq-NZj8pSM}D*dY^Fd$$D z3}6aw_+7{wo&$z?A$XL6)Q~-?d^FuUm@vU@QdU+54G^;jN?W;~~z(a;_teRDHbXplb3`s8a0%Pp$ z$3+_kkYUb{IZiZ2=zT+l^>9}#22|{Rj1tSVtkZwr7+yj9zSg}MQczU9!!-Xo6B=x> z`|kvf~b+tgI^_neWIPEBuFQRd}ab-&Pt=*NtlZ{BaU%# p%2jbEg9FLY&;Q;lO>v#(M0>$s9xwYYJjzL&Ii;?cb;9iS{{XiAhNu7l From de9a19898c91845290e237c1b5e2a5d51eb8b4cd Mon Sep 17 00:00:00 2001 From: Indu Date: Tue, 15 May 2018 23:14:33 +0000 Subject: [PATCH 14/20] Add license header --- example/distributed_training/mnist_dist.py | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/example/distributed_training/mnist_dist.py b/example/distributed_training/mnist_dist.py index 1f8dd436ea6d..907e793ec1e3 100644 --- a/example/distributed_training/mnist_dist.py +++ b/example/distributed_training/mnist_dist.py @@ -1,3 +1,22 @@ +#!/usr/bin/env python + +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + from __future__ import print_function import numpy as np import mxnet as mx From 27335a39064e123f2065040ef720e9eab6ef63fc Mon Sep 17 00:00:00 2001 From: Indu Date: Tue, 15 May 2018 23:46:03 +0000 Subject: [PATCH 15/20] Use png image instead of svg --- example/distributed_training/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/example/distributed_training/README.md b/example/distributed_training/README.md index fe6dcab820d7..8fd2916edb16 100644 --- a/example/distributed_training/README.md +++ b/example/distributed_training/README.md @@ -4,7 +4,7 @@ Deep learning models are usually trained using GPUs because GPUs can do a lot mo In this tutorial, we will show how to train a model faster using multi-host distributed training. -![Multiple GPUs connected to multiple hosts](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/example/distributed_training/distributed_training.svg) +![Multiple GPUs connected to multiple hosts](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/example/distributed_training/distributed_training.png) We will use data parallelism to distribute the training which involves splitting the training data across GPUs attached to multiple hosts. Since the hosts are working with different subset of the training data in parallel, the training completes lot faster. From 23966afab9d608a6f30fa30143b2527e6f125939 Mon Sep 17 00:00:00 2001 From: Indu Date: Tue, 15 May 2018 23:49:53 +0000 Subject: [PATCH 16/20] Add distributed training tutorial to tutorials index --- docs/tutorials/index.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md index ae0851425be0..32c4a16a8e0d 100644 --- a/docs/tutorials/index.md +++ b/docs/tutorials/index.md @@ -40,6 +40,7 @@ Select API:  * Practitioner Guides * [Multi-GPU training](http://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html) External link * [Checkpointing and Model Serialization (a.k.a. saving and loading)](/tutorials/gluon/save_load_params.html) External link ([Alternative](http://gluon.mxnet.io/chapter03_deep-neural-networks/serialization.html)) + * [Distributed Training](https://github.com/apache/incubator-mxnet/tree/master/example/distributed_training) * [Inference using an ONNX model](/tutorials/onnx/inference_on_onnx_model.html) * [Fine-tuning an ONNX model on Gluon](/tutorials/onnx/fine_tuning_gluon.html) * [Visualizing Decisions of Convolutional Neural Networks](/tutorials/vision/cnn_visualization.html) From 5866c4e0928a27b9fd6f8349b9d2723e33ebd33b Mon Sep 17 00:00:00 2001 From: Indu Date: Sat, 2 Jun 2018 15:16:12 -0700 Subject: [PATCH 17/20] Use CIFAR-10 instead of MNIST. --- example/distributed_training/README.md | 13 ++-- .../{mnist_dist.py => cifar10_dist.py} | 71 +++++++------------ 2 files changed, 32 insertions(+), 52 deletions(-) rename example/distributed_training/{mnist_dist.py => cifar10_dist.py} (76%) diff --git a/example/distributed_training/README.md b/example/distributed_training/README.md index 8fd2916edb16..9c2d09c76fb8 100644 --- a/example/distributed_training/README.md +++ b/example/distributed_training/README.md @@ -8,7 +8,7 @@ In this tutorial, we will show how to train a model faster using multi-host dist We will use data parallelism to distribute the training which involves splitting the training data across GPUs attached to multiple hosts. Since the hosts are working with different subset of the training data in parallel, the training completes lot faster. -In this tutorial, we will train a LeNet network using MNIST dataset using two hosts each having four GPUs. +In this tutorial, we will train a ResNet18 network using CIFAR-10 dataset using two hosts each having four GPUs. ## Distributed Training Architecture: @@ -27,7 +27,7 @@ Scheduler is responsible for scheduling the workers and parameter servers. There ## Moving to distributed training: -[mnist_dist.py](mnist_dist.py) contains code that trains a LeNet network using distributed training. In this section we'll walk through parts of that file that are unique to distributed training. +[cifar10_dist.py](cifar10_dist.py) contains code that trains a ResNet18 network using distributed training. In this section we'll walk through parts of that file that are unique to distributed training. ### Step 1: Use a distributed key-value store: @@ -41,7 +41,7 @@ It is the job of the trainer to take the gradients computed in the backward pass ```python trainer = gluon.Trainer(net.collect_params(), - 'sgd', {'learning_rate': .1}, + 'adam', {'learning_rate': .001}, kvstore=store) ``` @@ -95,10 +95,9 @@ We can then create a `DataLoader` using the `SplitSampler` like shown below: ```python # Load the training data -train_data = gluon.data.DataLoader( - gluon.data.vision.MNIST(train=True, transform=transform), - batch_size, - sampler=SplitSampler(60000, store.num_workers, store.rank)) +train_data = gluon.data.DataLoader(gluon.data.vision.CIFAR10(train=True, transform=transform), + batch_size, + sampler=SplitSampler(50000, store.num_workers, store.rank)) ``` ## Step 3: Training with multiple GPUs diff --git a/example/distributed_training/mnist_dist.py b/example/distributed_training/cifar10_dist.py similarity index 76% rename from example/distributed_training/mnist_dist.py rename to example/distributed_training/cifar10_dist.py index 907e793ec1e3..506afbbe081a 100644 --- a/example/distributed_training/mnist_dist.py +++ b/example/distributed_training/cifar10_dist.py @@ -18,27 +18,27 @@ # under the License. from __future__ import print_function -import numpy as np +import random, sys + import mxnet as mx -from mxnet import nd, autograd, gluon -from mxnet import kv -import random +from mxnet import autograd, gluon, kv, nd +from mxnet.gluon.model_zoo import vision + +import numpy as np # Create a distributed key-value store store = kv.create('dist') -# MNIST images are 28x28. Total pixels in input layer is 28x28 = 784 -num_inputs = 784 # Clasify the images into one of the 10 digits num_outputs = 10 # 64 images in a batch batch_size_per_gpu = 64 # How many epochs to run the training -epochs = 2 +epochs = 5 # How many GPUs per machine -gpus_per_machine = 1 +gpus_per_machine = 4 # Effective batch size across all GPUs batch_size = batch_size_per_gpu * gpus_per_machine @@ -81,33 +81,16 @@ def __len__(self): return self.part_len # Load the training data -train_data = gluon.data.DataLoader(gluon.data.vision.MNIST(train=True, transform=transform), - batch_size, sampler=SplitSampler(60000, store.num_workers, store.rank)) +train_data = gluon.data.DataLoader(gluon.data.vision.CIFAR10(train=True, transform=transform), + batch_size, + sampler=SplitSampler(50000, store.num_workers, store.rank)) + # Load the test data -test_data = gluon.data.DataLoader(gluon.data.vision.MNIST(train=False, transform=transform), +test_data = gluon.data.DataLoader(gluon.data.vision.CIFAR10(train=False, transform=transform), batch_size, shuffle=False) -# Create a sequential network -net = gluon.nn.Sequential() - -with net.name_scope(): - - # First convolution - net.add(gluon.nn.Conv2D(channels=20, kernel_size=5, activation='relu')) - net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2)) - - # Second convolution - net.add(gluon.nn.Conv2D(channels=50, kernel_size=5, activation='relu')) - net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2)) - - # Flatten the output before the fully connected layers - net.add(gluon.nn.Flatten()) - - # First fully connected layers with 512 neurons - net.add(gluon.nn.Dense(512, activation="relu")) - - # Second fully connected layer with as many neurons as the number of classes - net.add(gluon.nn.Dense(num_outputs)) +# Use ResNet from model zoo +net = vision.resnet18_v1() # Initialize the parameters with Xavier initializer net.collect_params().initialize(mx.init.Xavier(), ctx=ctx) @@ -115,21 +98,21 @@ def __len__(self): # SoftmaxCrossEntropy is the most common choice of loss function for multiclass classification softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss() -# Use SGD optimizer with a learning rate of 0.1 -trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .1}, kvstore=store) +# Use Adam optimizer. Ask trainer to use the distributer kv store. +trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': .001}, kvstore=store) # Evaluate accuracy of the given network using the given data def evaluate_accuracy(data_iterator, net): acc = mx.metric.Accuracy() - + # Iterate through data and label for i, (data, label) in enumerate(data_iterator): - + # Get the data and label into the GPU data = data.as_in_context(ctx[0]) label = label.as_in_context(ctx[0]) - + # Get network's output which is a probability distribution # Apply argmax on the probability distribution to get network's classification. output = net(data) @@ -158,7 +141,7 @@ def forward_backward(net, data, label): # Train a batch using multiple GPUs def train_batch(batch, ctx, net, trainer): - + # Split and load data into multiple GPUs data = batch[0] data = gluon.utils.split_and_load(data, ctx) @@ -169,7 +152,7 @@ def train_batch(batch, ctx, net, trainer): # Run the forward and backward pass forward_backward(net, data, label) - + # Update the parameters this_batch_size = batch[0].shape[0] trainer.step(this_batch_size) @@ -181,15 +164,13 @@ def train_batch(batch, ctx, net, trainer): batch_num = 1 for batch in train_data: - # Print progress once in a while - if batch_num % 50 == 0: - print("Worker %d processing batch %d" % (store.rank, batch_num)) - # Train the batch using multiple GPUs train_batch(batch, ctx, net, trainer) batch_num += 1 - + # Print test accuracy after every epoch test_accuracy = evaluate_accuracy(test_data, net) - print("Epoch %d: Test_acc %f" % (epoch, test_accuracy)) + print("Epoch %d: Test_acc %f" % (epoch, test_accuracy)) + sys.stdout.flush() + From 0d6cf2b8072f3d6053114203f63f93154d40dacc Mon Sep 17 00:00:00 2001 From: Indu Date: Sat, 2 Jun 2018 15:25:37 -0700 Subject: [PATCH 18/20] Fix language errors --- example/distributed_training/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/example/distributed_training/README.md b/example/distributed_training/README.md index 9c2d09c76fb8..ebca94fff386 100644 --- a/example/distributed_training/README.md +++ b/example/distributed_training/README.md @@ -6,7 +6,7 @@ In this tutorial, we will show how to train a model faster using multi-host dist ![Multiple GPUs connected to multiple hosts](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/example/distributed_training/distributed_training.png) -We will use data parallelism to distribute the training which involves splitting the training data across GPUs attached to multiple hosts. Since the hosts are working with different subset of the training data in parallel, the training completes lot faster. +We will use data parallelism to distribute the training which involves splitting the training data across GPUs attached to multiple hosts. Since the hosts are working with different subset of the training data in parallel, the training completes a lot faster. In this tutorial, we will train a ResNet18 network using CIFAR-10 dataset using two hosts each having four GPUs. @@ -174,8 +174,8 @@ python ~/mxnet/tools/launch.py -n 2 -s 2 -H hosts \ - `-n 2` specifies the number of workers that must be launched - `-s 2` specifies the number of parameter servers that must be launched. -- `--sync-dst-dir` specifies a destination location where the contents of the current directory with be rsync'd -- `--launcher ssh` tells `launch.py` to use ssh to login to each machine in the cluster and launch processes. +- `--sync-dst-dir` specifies a destination location where the contents of the current directory will be rsync'd +- `--launcher ssh` tells `launch.py` to use ssh to login on each machine in the cluster and launch processes. - `"python /home/ubuntu/dist/dist.py"` is the command that will get executed in each of the launched processes. - Finally, `-H hosts` specifies the list of hosts in the cluster to be used for distributed training. From 2207d5be41d84e1cc7acb768cde58f92ef0fc172 Mon Sep 17 00:00:00 2001 From: Indu Bharathi Date: Mon, 4 Jun 2018 12:37:27 -0700 Subject: [PATCH 19/20] Add a sample output from distributed training --- example/distributed_training/README.md | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/example/distributed_training/README.md b/example/distributed_training/README.md index ebca94fff386..e94ec992b7af 100644 --- a/example/distributed_training/README.md +++ b/example/distributed_training/README.md @@ -167,9 +167,9 @@ For example, the following command launches distributed training on two machines ``` python ~/mxnet/tools/launch.py -n 2 -s 2 -H hosts \ - --sync-dst-dir /home/ubuntu/mnist_dist \ + --sync-dst-dir /home/ubuntu/cifar10_dist \ --launcher ssh \ - "python /home/ubuntu/mnist_dist/mnist_dist.py" + "python /home/ubuntu/cifar10_dist/cifar10_dist.py" ``` - `-n 2` specifies the number of workers that must be launched @@ -228,3 +228,23 @@ Host d2 A better way is to use ssh agent forwarding. Check [this](https://aws.amazon.com/blogs/security/securely-connect-to-linux-instances-running-in-a-private-amazon-vpc/) article for more details. +Here is a sample output from running distributed training: + +``` +$ python ~/mxnet/tools/launch.py -n 2 -s 2 -H hosts --sync-dst-dir /home/ubuntu/cifar10_dist --launcher ssh "python /home/ubuntu/cifar10_dist/cifar10_dist.py" +2018-06-03 05:30:05,609 INFO rsync /home/ubuntu/cifar10_dist/ -> a1:/home/ubuntu/cifar10_dist +2018-06-03 05:30:05,879 INFO rsync /home/ubuntu/cifar10_dist/ -> a2:/home/ubuntu/cifar10_dist +Epoch 0: Test_acc 0.467400 +Epoch 0: Test_acc 0.466800 +Epoch 1: Test_acc 0.568500 +Epoch 1: Test_acc 0.571300 +Epoch 2: Test_acc 0.586300 +Epoch 2: Test_acc 0.594000 +Epoch 3: Test_acc 0.659200 +Epoch 3: Test_acc 0.653300 +Epoch 4: Test_acc 0.681200 +Epoch 4: Test_acc 0.687900 +``` + +Note that the output from all hosts are merged and printed to the console. + From 8fcc69d56bfaf5309df018ab371f3c4f2e338aa9 Mon Sep 17 00:00:00 2001 From: Indu Bharathi Date: Mon, 4 Jun 2018 12:39:42 -0700 Subject: [PATCH 20/20] Add the output of store.num_workers and store.rank --- example/distributed_training/README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/example/distributed_training/README.md b/example/distributed_training/README.md index e94ec992b7af..b0b0447725b5 100644 --- a/example/distributed_training/README.md +++ b/example/distributed_training/README.md @@ -59,6 +59,11 @@ print("Total number of workers: %d" % store.num_workers) print("This worker's rank: %d" % store.rank) ``` +``` +Total number of workers: 2 +This worker's rank: 0 +``` + Knowing the number of workers and a particular worker's rank, it is easy to split the dataset into partitions and pick one partition to train depending on the rank of the worker. Here is a sampler that does exactly that. ```python