Experimental setup

I am going to compare the EvoNorm B0 and S0 layers with respect to the following Mini Inception architecture:

(BN refers to Batch Normalization)

The EvoNorm authors refer to their layers as the EvoNorm-B series, as they involve Batch aggregations hence require maintaining a moving average statistics for inference. The EvoNorm-S series refers to batch-independent layers that rely on individual samples only (a desirable property to simplify implementation and to stabilize training with small batch sizes).

It should be also noted that the EvoNorm layers perform quite well in tasks like instance segmentation with Mask R-CNN and image synthesis with BigGAN.

In this report, I am going to lament on my experiments with the EvoNorm layers proposed in Evolving Normalization-Activation Layers. In the paper, the authors attempt to unify the normalization layers and activation functions into a single computation graph. The authors claim -

Several of these layers enjoy the property of being independent from the batch statistics.

I used Colab to perform my experiments. The authors tested the EvoNorm layers on MobileNetV2, ResNets, MnasNet, and EfficientNets. I decided to try out some quick experiments on a Mini Inception architecture as shown in this blog post. I trained them on the CIFAR10 dataset.

👉 GitHub repo to reproduce results.

Adam + BN-ReLU + No Data Augmentation

We see that with the Adam optimizer and no data augmentation, the validation loss fluctuates a lot and the network doesn't show any signs of generalization, it's clearly overfitting.

SGD + BN-ReLU + No Data Augmentation

We see that with the SGD optimizer and no data augmentation, we still do not see any signs of generalization.

SGD params:

opt = tf.keras.optimizers.SGD(lr=1e-2, momentum=0.9, decay=1e-2 / EPOCHS)

SGD + BN-ReLU + Data Augmentation

When using the SGD optimizer and some data augmentation, we see a balanced training behavior. Let's now try the EvoNorm layers.

EvoNorm B0 + No Data Augmentation

With EvoNorm B0 and no data augmentation, we see that the validation loss, in this case, is higher than what we saw in the previous experiment. The accuracies also differ from each other. The network is not generalizing well in this case.

EvoNorm B0 + Data Augmentation

With EvoNorm B0 and some data augmentation, we see NaN losses.

EvoNorm S0 + No Data Augmentation + Groups8

With EvoNorm B0 and no data augmentation in groups of 8, we again see that the validation loss, in this case, is higher than that of the previous experiment. The accuracies also differ from each other. The network is not generalizing well in this case either.

A note on the groups hyperparameter in the EvoNorm layers:

groups allow us to control how many data points should be used for group aggregation similar to what is used in group normalization. The authors show what groups work well as the task changes in the original paper.

EvoNorm S0 + No Data Augmentation + Groups16

Same with groups of 16.

EvoNorm S0 + No Data Augmentation + Groups32

With EvoNorm B0 and no data augmentation in groups of 32, we see NaN losses.

Observations on EvoNom S0 layers without data augmentation

sweep_config = {
    "method": "random",
    "metric": {
      "name": "accuracy",
      "goal": "maximize"   
    },
    "parameters": {
        **"groups": {
            "values": [4, 8, 12, 16, 32]**
        },
        "epochs": {
            "values": [10, 20, 30, 40, 50, 60]
        },
        "learning_rate": {
            "values": [1e-2, 1e-3, 1e-4, 3e-4, 3e-5, 1e-5]
        },
        "optimizer": {
            'values': ["adam", "sgd"]
        }
    }
}

SGD + BN-ReLU + Data Augmentation shows the most stable training behavior so far.

If we look closely, all EvoNorm S0 experiments (except groups of 32) without data augmentation show stable training behavior up until ~12 epochs.

This is the case for EvoNorm B0 + No Data Augmentation as well.

One thing that might help here is tuning the learning rates and groups hyperparameters more.

This is why I decided to run a hyperparameter sweep with the search space that you can see besides.

Hyperparameter sweep on EvoNorm S0 layers without data augmentation

The following plot shows the effect of the different hyperparameters on val_accuracy.

We can see that Adam seems to be the most promising optimizer.

Learning rate seems to have a strong positive correlation with val_accuracy, whereas groups has a weak one.

EvoNorm S0 + Data Augmentation + Groups8

All the experiments with EvoNorm S0 layer + data augmentation showed NaN loss values not just this one.

Final remarks

As we saw for this quick experimental setup, EvoNorm layers fail to match the performance of BN-ReLU. But this should not be treated as a foregone conclusion. I encourage you to try the EvoNorm layers out in your own experiments and let me know via Twitter (@RisingSayak) what you find.

👉 Colab notebook to reproduce results.

Acknowledgement

Hanxiao Liu (first author of the paper) for helping me correct the implementation.