Introduction

TensorFlow's distributed strategies make it extremely easier for us to seamlessly scale up our heavy training workloads across multiple hardware accelerators - be it GPUs or even TPUs. That said, distributed training has been a challenge for a long time especially when it comes to neural network training. The primary challenges that come with distributed training procedures are as follows:

All of these may sound very daunting if you think of the training process end-to-end. Thankfully, libraries like TensorFlow give us the freedom of incorporating distributed training very easily - be it for tf.keras models with the classic fit and compile paradigm or be it for custom training loops. This report, however, only deals with the former. If you are interested in learning more about distributed training for custom training loops, be sure to check this tutorial out.

In the first half of this report, we will be discussing some pointers to keep in mind when choosing VMs on Google Cloud Platform (for distributed training) as I conducted my experiments on a GCP VM. But those pointers should also hold for any platform of your choice. We will then see the steps required to distribute the training workloads for tf.keras models across multi GPUs in a single machine. Finally, we will conclude by analyzing the system metrics from this wandb run summary.

In this report, I will show you how to seamlessly integrate tf.distribute.MirroredStrategy for distributing your training workloads across multiple GPUs for tf.keras models. Distributed training can be particularly very useful when you have very large datasets and the need to scale the training costs becomes very prominent with that. It becomes unrealistic to perform the training on only a single hardware accelerator (a GPU in this case), hence the need for performing distributed training.

GitHub repository

Towards the end of the report, we will be seeing two methods that can make distributed training very effective - a. pre-fetching data so that it's ready for the model to consume as soon as it finishes an epoch, b. tuning the batch size.

A huge shoutout to Martin Gorner of Google (ML Product Manager on the Kaggle team) for providing me guidance to prepare this report.

System set-up, costs, etc

We primarily have two options to perform distributed training with GCP -

Compute Engine allows you to create Virtual Machines with a number of different software and hardware configurations that might be suitable for a variety of tasks not just for training deep learning models. On the other hand, AI Platform Notebooks give us pre-configured Jupyter Lab Notebook instances with the flexibility of customization.

In my experience, I have found the process of setting up a Compute Engine instance to be more involved than spinning up an AI Platform Notebook instance. Let's see how they differ from the perspective of costs.

Here are my system configurations:

Compute Engine, for the above, would cost me -

image.png

And, AI Platform Notebooks would cost me the following -

image.png

As you can see there's a difference between the costs of the two yet the latter (AI Platform Notebooks) one is just click-and-go kind of a thing. As a practitioner, I would want my time to be spent on the things that matter with respect to my expertise, I would not want to reinvent the wheel when it's not needed. Hence, I chose to go with AI Platform Notebooks. For a more thorough coverage on setting up AI Platform Notebooks and using them, refer to this guide.

To be able to use multiple GPUs in one AI Platform Notebook instance, you would first need to apply for a quota increase. You can check out this thread to know more about it.

Show me the code

Note that we are fine-tuning this network as opposed to just pre-computing bottlenecks and then feeding it to the classification top. Hence, EXTRACTOR.trainable = True is set that way. With the trainable parameter set to False, it is just a shallow network with a non-trainable feature extractor in its body. We won't likely see any advantage when using distributed training with that setting since it'd then become a very shallow network.

We have got about 94% accuracy on the validation set. But that is not the point of concern here. We wish to be able to speed up the model training with the use of distributed training.

It takes approximately 2090 seconds to train as you see in the below table (see the row 1-gpu-ft).

When fine-tuning networks, it's a good practice to use a learning rate schedule with ramp-up. The idea here is to start with a low learning rate so that the pre-trained weights of the base network do not get broken. We then increase the learning rate and decrease it again. The schedule would look like so -

image.png

tf.distribute.MirroredStrategy does parameter updates in synchronous mode using the all-reduce algorithm by default. However, TensorFlow 2.x supports parameter updates in asynchronous modes as well. Explaining their details is out of the scope for this report. In case you are interested in learning more about them, here are some very good resources:

Okay, back to code!

As a starting point, let's first train an image classifier to distinguish between cats and dogs on a single K80 GPU. We will be using a MobileNetV2 network (pre-trained on ImageNet) as our based architecture and on its top, we will append the classification head. So, in code, it would look like so -

# Load the MobileNetV2 model but exclude the classification layers
EXTRACTOR = MobileNetV2(weights='imagenet', include_top=False,
                 input_shape=(224, 224, 3))

# We are fine-tuning
EXTRACTOR.trainable = True

# Construct the head of the model that will be placed on top of the
# the base model
class_head = EXTRACTOR.output
class_head = GlobalAveragePooling2D()(class_head)
class_head = Dense(512, activation="relu")(class_head)
class_head = Dropout(0.5)(class_head)
class_head = Dense(1)(class_head)

# Create the new model
pet_classifier = Model(inputs=EXTRACTOR.input, outputs=class_head)

Training this fella for 10 epochs gives us a good result -

Fine-tuning with LR schedules (single GPU)

Although there's a bit of overfitting behavior in the network this time you can see an improvement in the performance as well (~98% validation accuracy). Model training time is also not affected by this (~2080 seconds, see 1-gpu-ft-lrs in the table below).

Let's now quickly see if the learning schedule had any effect on the training.

Porting the model to multiple GPUs

As you can see the accuracies (we have ~98% in this case) and losses still remain somewhat the same as the above plots.

The model training time has reduced, though - 1046 seconds (see 4-gpu-ft-no-pref in the table below). Woah! A whopping 2x speedup there. This can be further improved, though and we will see how in a moment. But first, let's do some analysis on the GPU metrics.

Now, in order to distribute this training across the four GPUs, we first need to define the MirroredStrategy scope -

strategy = tf.distribute.MirroredStrategy()

After that, we can compile our model withing the scope context -

with strategy.scope():
    model = get_training_model()

get_training_model includes the model definition as shown above along with the compilation step. So, what we are doing is creating and compiling the model within the scope of the MirroredStrategy. After this, it's absolutely the same - you call model.fit. Our learning rate schedule will also change a bit as we will now be distributing the model parameters across four GPUs (note the Y-values).

image.png

The performance did not change much (at least in terms of accuracy and loss) as you can see in the below figure -

Analyzing GPU metrics

As a Deep Learning Practitioner, your goal should be to maximize the GPU utilization while reducing the time spent by the GPU to access the memory for fetching the data. So, an obvious solution to reduce the time spent by the GPU for fetching data would be to pre-fetch the data while an epoch is completing.

GPUs are comparatively not as cheap as other commodity hardware. So, it's important to ensure the GPU utilization is as high as possible. Let's quickly see how we are doing there. Here are the graphs for GPU utilization and the time spent by the GPU to access the memory for fetching data (single GPU). As we can see the GPU utilization high most of the times which is good.

When we use multiple GPUs, we get -

(You see multiple lines because there are four GPUs for which the metrics are being calculated here)

As we can see the GPU utilization on an average for the four GPUs is far off from desired but the memory access time has greatly reduced as it is now distributed across multiple GPU. As for the utilization bit, it is because of the dataset 's relatively low volume. With a larger dataset, we can expect to see more increase in GPU performance. In the next section, we will discuss two common techniques to enhance the utilization metrics even further.

We can also see some smoothness in the lines of the above two plots and it denotes all the GPUs are operating under the same load. It's safe to say that when operating in a multi GPU setting this smoothness is to be expected. If you see weird spikes it might be a sign that the GPUs are not operating under the same load and you might want to make that even.

Two methods to further improve performance

Although the enhancement in the GPU utilization and time spent on memory access is very tiny, it's still there.

The performance gets affected big time with a larger batch size but on the other hand, the training time further reduced to ~887 seconds (see 4-gpu-ft-largerbs in the table below).

In case of the same batch size of 32, we get -

Method #1

TensorFlow's data API offers a number of things to further improve model training where input data streaming is a bottleneck. For example, ideally when a model is training the data for the next epoch should be ready so that the model does not need to wait for it. If it needs to wait then it introduces some bottleneck in terms of the overall training time.

prefetch allows us to instruct TensorFlow to prepare the next batch of data ready as soon as the model finishes the current epoch. It even allows us to specify the number of samples the system should fetch beforehand. But what if we would want the system to decide that for us depending on the bandwidth of the system processes and hardware. We can specify that as well with tf.data.experimental.AUTOTUNE -

# Prepare batches and randomly shuffle the training images (this time with prefetch)
train_batches = train.shuffle(1024).repeat().batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
valid_batches = valid.repeat().batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)

Method #2

The second thing we can do is play with the batch sizes. Since we are using multiple GPUs with the synchronous parameter update scheme, each GPU will receive a slice of data and will be trained on that. So, if we use too high of a batch size it might be difficult for the GPUs to properly distribute that among each other. On the other hand, if we use too small of a batch size, the GPUs might go under-utilized. So, we would need to find the sweet spot there. Here are some general suggestions (these come from Martin Gorner's notebook)

# Calculate batch size
batch_size_per_replica = 32
batch_size = batch_size_per_replica * strategy.num_replicas_in_sync

Note that we are using a larger batch size now, in the previous experiment we used a batch size of 16.

The above-mentioned notebook by Martin contains a number of tips and tricks to optimize the model performance when using distributed training and these include:

There are a number of suggestions available in this guide as well.

Okay, enough talking! It's now time to tie the above-discussed methods together and see the results -

LR Schedule + BS of 16 + Pre-fetch

As you can see we are able to retain the same performance and it also reduces the model training time by a tiny bit (~ 6 seconds) (see 4-gpu-ft-pref in the table below).

In terms of GPU metrics -

There's almost no change in the GPU metrics as evident in the plots above. When using multiple GPUs, generally the more data you have the better utilization you'd get out of them. It's also important to keep in mind that using a larger batch size in these situations might hurt the model performance as we saw above.

You are encouraged to check out two amazing benchmarks provided below for two different Kaggle competitions:

It also provides cost trade-offs between a number of different hardware configurations. It would be helpful for you to take the best decision depending on the cost budget you have.

Conclusion and next steps

It's important to keep in mind that when using distributed training, the bigger the dataset the better the utilization of the accelerators. Be it TPUs or multiple GPUs, this logic would hold true.

When using multiple GPUs (whether in a single or in multiple machines/clusters) it's very important to note the cost associated with the synchronization time needed for the multiple accelerators to coordinate between themselves. This video explains some trade-offs related to this.

I hope you got a sense of how easy it is to distribute training workloads for your tf.keras models. As the next steps, you might want to experiment with the different tips and tricks shared in this report and also in the resources I mentioned. If you are more into customizing training loops, you might want to try mixed-precision training and distributed training in there as well.

If you have any feedback to share with me, you can do so via a Tweet here. I would really appreciate it.