There are different levels of stochasticity in machine learning. Sometimes they're in the process of sampling the dataset, and other times in the machine learning models (specifically neural networks) themselves. While stochasticity brings a number of advantages in model training, it also introduces some gnarly problems with reproducibility.

GitHub repository →

In this report, we'll go over some of the methods that promise to make our machine learning experiments more reproducible. Before we jump to the nitty-gritty of that, we would discuss some motivation behind ensuring our machine learning experimentation is reproducible.

Let's get started!

(Image comes from here)

Why do we care about reproducibility?

To start this section, I will borrow something from Joel Grus's talk Reproducibility as a Vehicle for Engineering Best Practices -

Joel presented a number of very important points as to why reproducibility in ML is necessary. Here are some of them -

Just to top it all (from Joel's afore-mentioned talk) -

[...] software engineering best practices will make you a better researcher.

Honestly, although I knew about reproducibility, it was only after I went through Joel's deck that I could truly understand the urgent need for reproducibility.

This report focuses on developing reproducible models, which in turn takes care of most of the issues that arise from non-reproducibility.

Developing reproducible models

Battling non-reproducibility in data

A lot of randomnesses can come from the better half of the machine learning models - data! Often times while training models, we supply different training and validation splits. This, of course, can lead to different model performance results each time. A better approach is to fix the train and validation splits before we train our models.

If serialization of the data splits is difficult, then we can still supply the seed parameters while splitting the data each time. For example, when using the train_test_split method of scikit-Learn, we can specify the seed argument. The idea here is to fix all the variables that can produce different splits each time a function is run.

Common examples include -

As a general tip, be very careful when shuffling the training data points. You wouldn't want to shuffle the features and their labels independently. There will always be some amount fo randomness when you are performing data augmentation. In that case, it is always recommended to specify the seeds whenever possible.

Hyperparameters, hyperparameters everywhere!

(Base image courtesy)

Hyperparameters remain at the heart of a neural network and hyperparameter tuning is quite an involved process. So, as you run different experiments with the same network architecture, but with different hyperparameter configurations, it might get difficult to track the values set in those configurations.

Things get even more complicated with different network architectures, each with a different set of hyperparameter settings. This is exactly where Weights and Biases can really shine. Whether you are running hyperparameter optimization or you simply want to store the hyperparameter configurations in a safe place, W&B has got your covered.

For hyperparameter tuning, simply define values you'd like to test in the following way and let W&B sweep it away (see this notebook for full code) -

sweep_config = {
    "method": "random", #grid, random
    "metric": {
      "name": "accuracy",
      "goal": "maximize"   
    "parameters": {
        "epochs": {
            "values": [10, 15, 20]
        "learning_rate": {
            "values": [1e-2, 1e-3, 1e-4, 3e-4, 3e-5, 1e-5]
        "optimizer": {
            'values': ["adam", "sgd"]

Within just a few lines of code, we can generate sweep reports as shown below (just expand the "Run set" below).

After specifying the hyperparameters for defining your model, you would simply access them in your code via the config dictionary (config.epochs, for example). An end-to-end example is available here. Even if you are not doing any hyperparameter tuning it's always a good practice to log your hyperparameter values and you can easily log them in Weights and Biases by specifying the config argument while calling wandb.init(). See an example here (check the config_defaults variable).

Version control for keeping sane

Imagine a scenario where you modified the current data input pipeline, and while doing so you realize you introduced a bug in the model. You'd like to revert to the previous version, but you don't have it anymore. There can be many similar, potentially disastrous situations of different flavors that you might find yourself in.

The answer is simple - "Use a version control system for everything that incorporates code!"

When you are syncing up your ML experiments with W&B, it picks up the SHA of the latest git commit and gives you a link to that version of the code from your GitHub repo. See an example here.

In machine learning, model and data versioning are equally important. Data versioning is a lot more involved than model versioning. Look here to version your data better.

To do model versioning, we can follow some simple steps:

Tests that ensure correctness

It's incredibly hard to debug machine learning systems and here's why. So, how can you make sure that your models are error-free? As both Joel Grus and Jeremy Howard opine - the best way to ensure that is to not make any mistakes in the first place. This is where writing good tests can really help reassure you that your model is working in the ways you expected.

Joel, in the talk mentioned above, put together a common set of test scenarios for an ML model -

Although the test cases would vary from scenario to scenario, the above ones definitely give you a very good starting point. So, to cut a long story short, unit tests help to ensure the correctness of your model thereby making them more reproducible.

Model checkpointing and beyond

Imagine during training your neural network, for a particular epoch the network showed good generalization behavior and just after that epoch it started to diverge again. Wouldn't it be better if you could have set up checkpoints to either save network snapshots after every epoch, or save the best snapshot of the network within a range of epochs.

It's even worse if somehow the best model crashes midway or you lose the weights. The good news is Weights and Biases can automatically do this for you – i.e. it will save the best version of your network automatically to the runs page for that model.

Currently, for tf.keras models, W&B can serialize and sync the best model in an .h5 format. For custom models (that do not support serialization to an .h5 format) you need to do this manually. You can see how in this tutorial. Here's I would set up model checkpointing and saving with W&B -

# Set up model checkpoint callback
filepath = + "/{epoch:02d}-{val_accuracy:.2f}.ckpt"
checkpoint = tf.keras.callbacks.ModelCheckpoint(filepath,
    save_best_only=True, mode="max")

After the model finishes training, the checkpoint files will automatically get uploaded to the corresponding W&B run page.

This makes collaborating across distributed teams a lot easier, since anyone on your team who wants to reproduce your model can do so – they have access to the weights and the code (as W&B links your GitHub commits to your training runs).

Each run here refers to an individual experiment and W&B shows you the performance metrics associated with each run.

Battling non-reproducibility at code-level

I am again going to borrow some of Joel's ideas from the talk mentioned above and also some findings from my own experiences. Many ML algorithms are stochastic in nature, and most of this stochasticity is introduced when there are configurations in the algorithms that are not constant in nature. For example, the way we initialize the weights of a neural network should be random by definition.

Random seeds + fixed (initial) weights

There are two obvious solutions here -

Now, of course, there are other parts of a neural network that can introduce non-determinism – dropout layers, sampling layers (remember VAEs?), latent vectors to name a few. The non-determinism they introduce is of the good kind, as it often helps neural networks perform better. We can run our training data through these layers a number of times and measure the average deviation in each of the results. If there isn't anything wrong, the deviations won't be very high.

Ground setup and CUDA-cuDNN

For the sake of "reproducibility", I am going to use the following configuration for my machine to host a Google Cloud Platform AI Platform Notebook -

Another very important consideration that follows the infrastructural uniformity is the behavior of cuDNN and CUDA (you not going to want to train your large neural networks on CPUs). There are many efficient ways to compute the operations involved in a neural network and they do not always produce the same results every time because these results are approximations.

ML libraries generally make use of these efficient implementations that come with CUDA and cuDNN. While doing so, they introduce randomness (those implementations are approximates as mentioned above). Another reason is the type of implementation to be used is determined by cuDNN during runtime. So, when using TensorFlow (2.1) along with a compatible NVIDIA-GPU, to save yourselves from a potential reproducibility crisis, it's always better to do the following before you do anything else in the code -

import tensorflow as tf
import os
os.environ["TF_DETERMINISTIC_OPS"] = "1"

This practice comes from the tensorflow-determinism repository. Thanks to Sebastian Raschka's L13 Intro to Convolutional Neural Networks (Part 2) 1/2 lecture from which the part on CUDA and cuDNN is inspired. Be sure to check out this amazing presentation by Duncan Riach to know more about determinism in deep learning in general.

We can now move towards reproducibility concerns with code.

It's almost impossible to cater to all the ML models and frameworks out there and talk about reproducibility in one single report. So, we are just going to focus on one pair - Neural Networks and TensorFlow. Note that most of these concepts would still be applicable to the other frameworks.

Before we write any code, we need to make sure our hardware/software infrastructure is unified. This is especially useful when you are working on a team.

Overview of the methods covered


This report was an effort to provide some simple but useful methods that can help you to build reproducible models. This is in no way an exhaustive list. I polled a number of Machine Learning GDEs about their thoughts on reproducibility and here's what they said:

Thanks to Mat, Aakash, and Souradip for their contributions. As ML practitioners, maximum reproducibility should always be our goal, in addition to SOTA results.

I would love to know what reproducibility tools/methods you use. If you have any feedback on the report don't hesitate to tweet me at @RisingSayak.