In this work log, I explore data-parallel distributed training in Keras. I try different configurations of GPU count (1, 2, 4 or 8 GPUs) and total (original)/effective (per GPU) batch size, increase the dataset size, and compare evaluation methods. I include my notes and ideas for next steps from different experiments to show a realistic research process.
- training acceleration is linear-ish: compared to 1 GPU, training runs 1.6 times faster on 2 GPUs and 2.5 times faster on 4 GPUs
- this is very easy to accomplish in Keras with minimal code effort and no fine-tuning: see multi_gpu_model() in Keras utils
- tradeoff between batch size, stability, and accuracy: larger batches generally increase training speed and stabilize the training (fewer lags/stalls observed when training in the cloud). Smaller batches lead to slightly higher validation accuracy on average.
- batch size affects validation accuracy more than GPU count: 4-5% versus 1% — this is reassuring for trusting this data-parallel training paradigm to not adversely affect the final model
- if time-constrained, train on less data with bigger batches: training on less data (10% of full size) with larger batch sizes yields comparable/slightly higher validation accuracy than training on the full dataset with lower batch size
- the optimal configuration, especially the effective batch size per GPU, is more subtle to tune and likely depends on the particular model
Data and model
The core model trains a basic 7-layer convolutional network predicting one of 10 animal classes (bird, mammal, reptile, etc) on class-balanced subsets of iNaturalist 2017, typically 5000 train / 800 test for fast iteration. You can read more about this task with more powerful models or in the context of curriculum learning.
The code to train a data-parallel model is in this example gist.
Basic Multi-GPU in Keras
Smaller batches for accuracy, larger for speed on 2 GPUs
Keras has a built-in function for model parallelization: mutli_gpu_model in utils. This is trivial to enable—here are examples on a local machine and GCP with 2 GPUs.
- The "original" batch size shown is for the original, non-parallel version of the model. The parallel version splits each of these batches evenly across the GPUs. The effective sub-batch size seen by each parallel copy of the model is the original batch size / 2, in this case.
- smaller batch sizes (original 32/64, sub-batch 16/32) reach higher accuracy for a fixed epoch count than larger batch sizes (original 128/256, sub-batch 64/128)
- local mode is slightly better than cloud (GCP) due to more data (6400 vs 5000), possibly due to physical co-location of GPUs
- larger batches generally lead to faster training, with a few confounding factors as seen in the bottom graph
- local runs (purples) train the fastest
- not much difference between 32, 64, and 128
- long stuck period for "batch 64 (V2, 5K train)" (orange)—generally observing high variability in runtimes on GCP
- I initially attempted to use a clever early version of the multi_gpu_model here which requires substantially more complicated batch size adjustments in the training/validation generator
- increase number of GPUs, amount of training data
- explore effect of batch size
- increase complexity of model (especially ResNet)
Scaling to 4 GPUs
Speed up: 4 GPUs: 2.5X, 2 GPUs: 1.6X
2.5X faster training on 4 GPUs vs 1 GPU
- model reaches a slightly higher validation accuracy 2.5X as fast when using 4 versus 1 GPU—this is the main advantage
- train/val acc/loss are not significantly affected by parallelizing the job across 1, 2, or 4 GPUs—this is expected and reassuring
- could continue tuning to improve relative speed-up—improvement is less than linear
- need better metrics (data throughput, batches per unit time, time to convergence) to quantify the added value of distributed training
Train a 7-layer convnet on main iNaturalist dataset (5000 train / 800 val) as a proof of concept for the Keras multi_gpu_model function.
- initially no noticeable difference between 1 GPU and 2 GPUs—masked by one extremely slow run, batch 64_2, which stalled for 2 hours during training for unknown reasons. Leaving it out of the average shows that 2 GPUs yield a 1.6x acceleration
- accuracy vs batch size: consider effect of both batch size and number of GPUs
- for 2 GPUs, batch 64 > batch 32 / 128 > batch 256, but the effect is not super clear. 64 seems to be the optimal choice for batch size
- some combinations might be slower—resource sharing on the CPU?
- run with same settings on 1, 2, and 3? GPUs
- try with various batch sizes on 4 GPUs (64, 128)
- how to log some more details of training time, e.g. time per epoch? so we can compare throughput/speed to a fixed accuracy level.
- try running with more/fixed amount of data? Other experiments used 6400/1280
- consider ensembling models
Batch size matters more than GPU count
Use smaller batches when acc matters
- batch size 64 is still best across 1/2/4 GPUs, closely followed by 128/32, 256 worst. Note that the differences are very small, and averaging by training step versus time shuffles the ordering.
- batch size has a bigger impact than GPU count: 4-5% difference in train/val acc between 64 and 256 versus ~1% difference between 1 and 4 GPUs (see previous section) when training a 7-layer CNN on 5000 images to predict 1/10 labels. Note that the learning rate stays constant. Adjusting it may equalize the disparity across batch size.
- batch size 32 is smaller but performs worse than 64: perhaps subbatches of only 8 items are inefficient when split across 4 GPUs?
- test effects of 1) more data, 2) bigger model (simply larger layers/deeper net, optionally Inception-ResNet V2/resnet)
- sudden jump in training loss for batch size 256 and 64, around 125 minutes in — side effect of how run averaging works? need to run more trials to average over shifting training dynamics or different clusters?
- this is hitting CPU limits
Training on 2 GPUs with 10X the data
If time-constrained, train with larger batches on less data
Compare performance when training with 5K (blues) vs 50K (red) images on 2 GPUs. For a fixed 50 epochs, the increase in training time is linear with the increase in dataset size and doesn't significantly improve this particular model. The 50K version plateaus in the same amount of time it takes the 5K version to finish training (and barely start to plateau). The 50K version does reach a slightly higher max validation accuracy (49% vs 45% for the 5K case), though this decays with further training. The effect of parallelization and of using 10X more data is much more obvious when looking at the time taken to train than at epochs seen.
Note that for a given amount of training time (up to 3.5 hours), training on 5K examples with larger batch sizes outperforms training on 50K examples on validation accuracy. The 50K case eventually surpasses the 5K cases, but the difference in max validation accuracy reached is only about 4-5%.
- repeat with 4 GPUs
- consider further exploration of training dataset size: tradeoff between more data, accuracy, and more epochs
- the boldest blue run has an unexplained lag (straight segment)—resource competition?
Train 2.5X faster on 4 GPUs, 3X on 8 GPUs?
Linear-ish speedup with distributed training
Distributing over 4 GPUs, even for such a small network and dataset, gives a 2.5X speed-up relative to 1 GPU. 2 GPUs gives a 1.6X speed-up relative to 1 GPU. Overall, the improvement is not linear with GPU count but still substantial. Increasing the compute to 8 GPUs doesn't improve the runtime with the batch sizes tried so far and goes against the overall trend of reducing compute time.
Note that the 8GPU runs are not directly comparable, as they use double the training data and more than double the validation data.
- The overall accuracy is relatively low in this proof-of-concept. What happens if we train a deeper net?
- Run distributed training for longer / more iterations to get more reliable estimates of acceleration (though this may be task-specific)
- Get more comfortable with distributed training approaches in Tensorflow as opposed to Keras—specific strategies may also show a greater speedup
- GPU usage is uneven with basic Keras distribution: e.g. 11,000 MiB on GPU 0 and 60 MiB each on 1, 2, and 3.
- how to quantify throughput more meaningfully?
Optimal batch size & GPU count? It depends.
Find an optimal batch size for the problem
An original batch size of 64 still does best on average, compared to scaling for 8 GPUs. E.g., it may be better to run on 4 GPUs with original batch size 64, subbatch size 16, than on 8 GPUs with batch size 64, subbatch size 8, or on 4 GPUs with batch size 256, subbatch size 64 (which is what Keras would recommend).
8 GPUs with the max batch size 512, subbatch size 64 is still best overall (assuming one has access to this extra compute and is willing to explore the best configuration for a particular problem).
Tentative: Not much impact from fixed subbatch size & scaled batch size
Subbatch size matters at 8 GPUs, not before?
Keras recommends increasing the original batch size proportionally with GPU count. In this scenario, the subbatch size is fixed at 64 and the original model's batch size scaled up accordingly from 1 to 8 GPUs.
On 8GPUs, the training is ~3x faster. There is no noticeable speedup between 1, 2, and 4 GPUs.
This is surprising—perhaps the model needs to be more complex or the data load heavier.
Tentative: Vary subbatch size on 8GPUs
Maximize batch size to be safe?
Larger batch size appears more reliable. Note the long relatively flat stretches of the maroon and light blue lines, as if computation temporarily slowed down (issues with the cloud, perhaps?). Need to run more experiments before drawing solid conclusions