One architecture trains 6X faster

I tried training two versions of Inception on image classification, running data-parallel on 8 GPUs with Keras:

I thought the newer, more sophisticated Red version would be better, but it is 6 times slower. On the left plot below, you can see that the training loss and accuracy curves get to the same final values, but the red ones take about 6 times longer.

Inception V3 parallelizes better than Inception-ResNet-V2

Plotting my GPU usage across the two models explains what's happening. The right plot below shows the GPU utilization percentage for each of 8 GPUs in red colors for the Red 2016 Inception-ResNet V2 model and in blue colors for Blue 2015 Inception V3. In the Red 2016 model, GPU 0 (top orange line) does way more work: 6-8X the work of the other 7 GPUs (dropping at the end of the first epoch, at around 46 minutes). In the Blue 2015 model, all 8 GPUs share work much more evenly. Using Inception V3 instead of Inception-ResNet-V2 for this task will let me iterate much faster.