GPU workload distribution affects training time

Goal: finetune a pretrained net

The objective is to finetune a network pre-trained on ImageNet to classify a photo of an animal into one of 10 taxonomic classes (insect, bird, mammal, etc). The photos come from iNaturalist 2017 and are very similar to ImageNet photos.

Two Inception variants as base networks

I compare two pretrained networks that are both extensions of Inception: the newer Inception-ResNet V2 (shown in reds above) and the older Inception V3 (shown in violet/blues). One might expect the newer variant to outperform the older one (it certainly does on ImageNet).

Both networks are trained on 8 GPUs on GCP. The only difference across the two experiments is the model architecture with pre-trained weights loaded in from Keras. See the run diff panel below for an explicit side-by-side comparison of the hyperparameters. The distributed training simply uses the data-parallel training utility function in Keras (multi_gpu_model())

Inception V3 outperforms Inception-ResNet V2 substantially?

Note that the training accuracy (right image above, dense dotted lines) and training loss (right image above, dashed lines) are very similar. However, the validation accuracy (the metric we really care about) is much higher—by 20%—for IN V3 as compared to IRN V2.

Next questions

To optimize for efficiency, I may choose the better-performing Inception V3 base network for further experiments. However, some interesting questions arise: why is the workload distributed so unevenly with IRN V2? Does the Keras multi_gpu_model() behave differently in these two cases? Was my GCP configuration somehow off during one of these experiments? Is the finetuning task not suited for IRN V2 for more subtle reasons? More detailed investigation of system metrics could help explain this.

Inception-ResNet V2 takes 6 times longer to train than Inception V3

This is a surprisingly large difference. The GPU usage across the two models may help explain this: consider Inception-ResNet V2 (IRN V2, red colors) and Inception V3 (IN V3, blue colors) below. In IRN V2, GPU 0 is does way more work: 6-8X the work of the other 7 GPUs (dropping at the end of the first epoch, at around 46 minutes). When training Inception V3, all 8 GPUs share work approximately evenly.