Transformer models and transfer learning methods continue to propel the field of Natural Language Processing forwards at a tremendous pace. However, state-of-the-art performance too often comes at the cost of (a lot of) complex code.

Simple Transformers avoids all the complexity and lets you get down to what matters – model training and experimenting with the Transformer model architecture. It helps you bypass all the complicated setups, boilerplate code, and all the other general unpleasantness by initializing a model in one line, training in the next, and evaluating with the third.

In this report, I build on the simpleTranformers repo, and explore some of the most common applications of deep NLP – including tasks from GLUE benchmark, along with the recipes for training SOTA transformer models to perform these tasks. I've used the distilbert transformer model for all the tasks as it is less expensive computationally. I also extensively explore optimizing your distilbert hyperparameters with Sweeps.

Simpletransformers comes with native support for model performance tracking, using Weights & Biases.

Full code walkthrough on Colab →

Language Modeling

Training and Evaluation losses

On comparing the training and evaluation losses of all the runs, we can observe the trend that the rate of optimization (or the rate at which the loss decreases ) increases as we increase the learning_rate, until the 5th run where it's set to 3e-3.

On further increasing the learning_rate to 3e-2, we get an irregular graph that remains almost parallel to the X-axis, suggesting that no learning is taking place as the learning_rate is too high. This is more evident in the eval_loss graph.


The Weights and Biases dashboard has ways to analyze the performance of a model based on a target metric, as well as the level of resource utilization. We'll focus on the following visualizations to compare our models:

Parallel Co-ordinates: This is a chart that compares all the hyper-parameters with respect to the metric being optimized. In this case, the objective is to minimize the logged training_loss

Line-Plot : We'll use the line plot to compare the training_loss and eval_loss of all the logged runs.

Training the language model using simpleTransformers

def trainLM():
  import wandb
  from simpletransformers.language_modeling import LanguageModelingModel
  # configure your model
  train_args = {
      "reprocess_input_data": False, "overwrite_output_dir": True, "num_train_epochs": 2,
       "save_eval_checkpoints": True,"save_model_every_epoch": 
       False,"learning_rate": 3e-2, "warmup_steps": 1000,"train_batch_size": 64,
      "eval_batch_size": 128,"fp16": False,"gradient_accumulation_steps": 1,
      "block_size": 128,"max_seq_length": 128,"dataset_type": "simple",
       'wandb_project': "simpletransformers","wandb_kwargs": {"name": "LM3e-2"},
      "logging_steps": 100,"evaluate_during_training": True,"evaluate_during_training_steps" :
       50000,"evaluate_during_training_verbose": True,
      "use_cached_eval_features": True,"sliding_window": True,"vocab_size": 20000,
      "generator_config": {
          "embedding_size": 128,
          "hidden_size": 256,
          "num_hidden_layers": 3,
      "discriminator_config": {
          "embedding_size": 128,
          "hidden_size": 256,
  train_file = "train.txt"
  test_file = "test.txt"
  # Initialize a LanguageModelingModel
  model = LanguageModelingModel(
  # Train the model
      train_file, eval_file=test_file,


ELECTRA is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the SQuAD 2.0 dataset.


The major advantage of ELECTRA training process is that it not only enables training large models on single GPU but it is also more accurate when compared to traditional training methods.


MultiLabel Classification


It's quite straightforward to visually demonstrate the effect of each hyper-parameter on the metric being optimized by using the parallel coordinate chart.

We can also use the parameter importance plot to find most important parameters with respect to the desired metric (training loss). For e.g. here we can see that our loss is negatively correlated to the number of epochs, so the longer we train, the more our loss goes down.


To demonstrate Multilabel Classification we will use the Jigsaw Toxic Comments dataset from Kaggle. Simple Transformers requires a column labels which contains a multi-hot encoded lists of labels, as well as a column text which contains all the text.

from simpletransformers.classification import MultiLabelClassificationModel

model = MultiLabelClassificationModel('distilbert', 'distilbert-base-uncased', num_labels=6, 
args={'train_batch_size':2, 'gradient_accumulation_steps':16, 'learning_rate': 3e-5,
 'num_train_epochs': 3, 'max_seq_length': 512})

This creates a MultiLabelClassificationModel that can be used for training, evaluating, and predicting on multilabel classification tasks. The first parameter is the model_type, the second is the model_name, and the third is the number of labels in the data.

Training the transformer model

from simpletransformers.classification import MultiLabelClassificationModel
import pandas as pd

def trainMultiLabel():
  print("HyperParams=>>", wandb.config.epochs)
  # Create a MultiLabelClassificationModel
  model = MultiLabelClassificationModel(
      args={"reprocess_input_data": True, "overwrite_output_dir": True, "num_train_epochs": wandb.config.epochs,
            'learning_rate': wandb.config.learning_rate,
                'wandb_project': "simpletransformers",    "fp16": False,
            "max_seq_length": 64,


  # You can set class weights by using the optional weight argument

  # Train the model

  # Evaluate the model
  result, model_outputs, wrong_predictions = model.eval_model(eval_df)

Named Entity Recognition


We'll again go through the visualizations used in the previous sections to gain some valuable insights.

Training Loss: The trend observed here is that on this particular task, the model optimizes faster on a lower learning rate. We've used learning rates ranging from 2e-3 to 2e-5 and the models with lower learning rates have resulted in lower losses and reached their optimal point much faster. The same hypothesis can be confirmed using the parallel co-ordinates chart.


Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a sub-task of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. To demonstrate Named Entity Recognition, we’ll be using the CoNLL Dataset.


We'll create a NERModel that can be used for training, evaluation, and prediction in NER tasks. The NERModel object takes in the following parameters:

We use the following default args for the simpletransformers NERModel:

    "output_dir": "outputs/","cache_dir": "cache_dir/","fp16": True,
    "fp16_opt_level": "O1","max_seq_length": 128,"train_batch_size": 8,
    "gradient_accumulation_steps": 1,"eval_batch_size": 8, "num_train_epochs": 1,
    "weight_decay": 0, "learning_rate": 4e-5, "adam_epsilon": 1e-8,
    "warmup_ratio": 0.06, "warmup_steps": 0,"max_grad_norm": 1.0,
    "logging_steps": 50,"save_steps": 2000,"overwrite_output_dir": False,
    "reprocess_input_data": False,"evaluate_during_training": False,
    "process_count": cpu_count() - 2 if cpu_count() > 2 else 1,
    "n_gpu": 1,

Training the transfer model

def trainNER():
  from simpletransformers.ner import NERModel

  print("HyperParam=>>" , wandb.config.epochs, wandb.config.learning_rate)
  # Create a NERModel
  model = NERModel('distilbert', 'distilbert-base-cased', 
                    args={"reprocess_input_data": True, "overwrite_output_dir": True, "num_train_epochs": wandb.config.epochs,
              'learning_rate': wandb.config.learning_rate,
                  'wandb_project': "simpletransformers",    "fp16": False,
              "max_seq_length": 64,


  # Evaluate the model
  result, model_outputs, predictions = model.eval_model('test.txt')

  # Check predictions

Question Answering

Here's I've fixed the number of epochs to 2 as training a QA model on SQUAD dataset requires a lot of computational power.

Thus, the sweep iterates across multiple learning rates, while the number of training epochs remains constant.


Question answering (QA) is a computer science discipline in the field of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.


We'll use the Stanford Question Answering Dataset (SQuAD 2.0) for training and evaluating our model. SQuAD is a reading comprehension dataset and a standard benchmark for QA models. The dataset is publicly and it is also used as one the evaluation metrics for calculating GLUE benchmark scores.

The dataset consists of multiple dictionaries. Each such dictionary contains two attributes –

Questions and answers are represented as dictionaries. Each dictionary in qas has the following components.

A single answer is represented by a dictionary with the following attributes.

The Question Answering Model

Next we'll create a QuestionAnsweringModel object and set the hyperparameters for fine tuning the model. Just as before, the first parameter is the model_type and the second is the model_name.

Training the model

import json
with open('train-v2.0.json', 'r') as f:
    train_data = json.load(f)

train_data = [item for topic in train_data['data'] for item in topic['paragraphs'] ]

train_data = train_data[:5000]

def trainQA():
  from simpletransformers.question_answering import QuestionAnsweringModel
  train_args = {
      'learning_rate': wandb.config.learning_rate,
      'num_train_epochs': 2,
      'max_seq_length': 128,
      'doc_stride': 64,
      'overwrite_output_dir': True,
      'reprocess_input_data': False,
      'train_batch_size': 2,
      'fp16': False,
      'wandb_project': "simpletransformers"

  model = QuestionAnsweringModel('distilbert', 'distilbert-base-cased', args=train_args)



In this report, we've trained and visualized models to perform some of the most important deep NLP tasks using simpletransformers which is a high-level wrapper around the famous huggingface library. Simpletransformers combines the accessible transformer models provider by huggingface with its own powerful training scripts which makes training a SOTA model a piece of cake.