This report is not going to talk about the nitty-gritty of the EfficientNet family of models. If you re interested in learning about the details of those models, you should absolutely check out this amazing report.
This report is accompanied by a Colab Notebook so that you are able to reproduce the results.
In this report, I am going to show how to make use of the EfficientNet family of models for transfer learning for image classification tasks. We will be using the EfficientNet models ranging from
b3. For comparison purposes, we will be using the MobileNetV2 model.
We will be using the Cats. vs. Dogs dataset. It is already included in TensorFlow Datasets. So, much of the hard work is already done for us. The below code listing downloads (if not already cached) and load the dataset that is already split into train and test sets as per our choice.
(raw_train, raw_validation), metadata = tfds.load( 'cats_vs_dogs', split=['train[:80%]', 'train[80%:]'], with_info=True, as_supervised=True )
Most of the image classification based TF Hub models come in the following two variants:
All of these models are pre-trained on the ImageNet dataset. As we will be using transfer learning, we will be going with the second variant of models. One very important thing to note here is not all of these models can be fine-tuned especially the ones based on TensorFlow 1.
Unfortunately, the EfficientNet family of models is not eligible for fine-tuning for this experimental configuration. The below code-listing provides a utility function that downloads the respective feature extraction model, adds a classification top, compiles the final model, and finally returns it.
def get_training_model(url, trainable=False): # Load the respective EfficientNet model but exclude the classification layers extractor = hub.KerasLayer(url, input_shape=(IMG_SIZE, IMG_SIZE, 3), trainable=trainable) # Construct the head of the model that will be placed on top of the # the base model model = tf.keras.models.Sequential([ extractor, tf.keras.layers.Dense(128, activation="relu"), tf.keras.layers.Dropout(0.5), tf.keras.layers.Dense(1) ]) # Compile and return the model model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), optimizer="adam", metrics=["accuracy"]) return model
url argument. For feature extractor networks based on EfficientNets, this generally looks like -
https://tfhub.dev/google/efficientnet/<variant>/feature-vector/1. Note that
<variant> can be anything from
b7. Although the utility function has a
trainable argument, for EfficientNet models in TF Hub, if you specify
trainable=True you would get the following -
ValueError: in user code: /usr/local/lib/python3.6/dist-packages/tensorflow_hub/keras_layer.py:206 call * self._check_trainability() /usr/local/lib/python3.6/dist-packages/tensorflow_hub/keras_layer.py:265 _check_trainability * raise ValueError( ValueError: Setting hub.KerasLayer.trainable = True is unsupported when loading from the hub.Module format of TensorFlow 1.
In the next few sections, we will be performing transfer learning with 4 different variants (
b3) of the EfficientNet family of models and we will also be analyzing the performances of those different models.
All the models we will be using for the experiments come from TensorFlow Hub. TensorFlow Hub provides a comprehensive collection of pre-trained models that can be used for transfer learning and many of those models even support fine-tuning as well. TensorFlow Hub has models for a number of different domains including image, text, video, and audio. Models are also available in different TensorFlow product formats including TensorFlow Lite, TensorFlow JS, and so on.
As we can see the network does not too unstable training behavior. Following denotes the memory footprint of this model -
ls -lh b0.h5 -rw-r--r-- 1 root root 18M Apr 11 14:19 b0.h5
To maintain brevity, let's visualize the training behavior of the rest of the three models based on the
b3 models respectively.
We can see that all the model variants show good training behavior. So far, the
b0 model showed the best performance in terms of validation accuracy. In terms of the memory footprints, the
b3-based model is the heaviest, coming in at 44mb.
ls -lh b1.h5 -rw-r--r-- 1 root root 28M Apr 11 14:45 b1.h5 ls -lh b2.h5 -rw-r--r-- 1 root root 33M Apr 11 15:14 b2.h5 ls -lh b3.h5 -rw-r--r-- 1 root root 44M Apr 11 15:54 b3.h5
It's important to note that a model's size defined by the number of the trainable parameters it contains along with the precision format it operates in. In our case, the precision is float32. Below I present a table from the original EfficientNet paper which reflects on the FLOPs (floating-point operations) of each of the different EfficientNet variants (FLOPs is a direct measurement of computation):
We can see that as we scale the model further from
b7 the FLOPs increase. We will come back to this point in a moment.
As we can see in the figure on the left, although the
b0 model takes the least amount of time it also produces the highest
val_accuracy. We'll see why in just a second.
Given all these observations, there are a few guidelines for picking which variant of EfficientNet is the best to use for a custom dataset. If your dataset is fairly small, it's prudent to start with
b0 but without force-fitting it. There might be a simpler model available for your custom dataset, always make sure that's not that case when using EfficientNets.
You might have already noticed that as we increased the model capacity (
b0 means the least heavy model and
b3 means the heaviest model in our case) the performance kept on degrading. Quoting Ajay's afore-mentioned report:
Ok, so you probably have a fairly good idea of the computational cost of different EfficientNets by now.
But we still haven't addressed the most disturbing question: why didn't compound scaling work?
Specifically, we should have seen at least consistent performance across models, even if there wasn't an accuracy increase. So why did the bigger models perform worse? Here are a few possible reasons:
- Hyperparameters: it's well known that the same hyperparameters don't work for all models, otherwise we'd all just use the globally "optimal" learning rate, batch size, etc. It could be the case that the larger models require higher/lower learning rates to perform well.
- Overparameterization: The largest EfficientNet we used, EfficientNetb7, has over 60 million parameters. That a lot of a small dataset like ImageNette, and it's likely that the larger models had many more parameters than necessary.
- Regularization: would have probably helped control the overparameterization issue. But adding regularization only to the large models would lead to unfair comparisons.
I really couldn't have explained the phenomenon better than what Ajay already did in the report. Now to measure how well are these models, in the next sections, we will be comparing them with MobileNetV2.
As we can see that in terms of the losses our MobileNetV2-based model is doing way better than the
b0-based model. Note that we did not fine-tune in this case. We compared only with the
b0-based model because it was the best performing one in our previous experiments.
In terms of memory footprint as well, this MobileNetV2-based network wins:
ls -lh mobilenet_v2_no_ft.h5 -rw-r--r-- 1 root root 11M Apr 11 16:09 mobilenet_v2_no_ft.h5
Only 11 MBs whereas the
b0-based network as 18 MBs in size. In terms of accuracy scores and model training times, here's the scene.
Everything is crystal clear, isn't it? 😉
Everything remains the same for this case except we can now make use of fine-tuning as well. More comparison's sake, we will only be using transfer learning in this case.
As we can see the MobileNetV2-based model clearly outperforms all the variants of the EfficientNet-based models we tried so far. It's not only better performing but also it's better in terms of memory footprint and training time. The memory footprint can further be reduced with the help of quantization.
So for our dataset, the EfficientNet family of models did not perform quite well but that does not anyway demean their significance.
If you have a relatively large dataset, you should definitely give those models a try. But at the same time, we should keep in mind we don't need a hammer to kill a rat.
Let me know your thoughts on this report via Twitter (@RisingSayak).