# Introduction

## Protein Structure Background

Despite recent advancements in the fields of biology, computer science, and machine learning, protein structure prediction remains one of the "holy grails" of molecular biology. Since the Nobel Prize-winning discovery of the structure of sperm whale myoglobin in the late 1950s, researchers have worked to uncover the 3D structure of thousands of other proteins in an effort to better understand the molecular basis of life.

Proteins, long chains of amino acids that naturally fold up into unique, “globular” structures when produced, are essential to many of the life-sustaining chemical reactions in the cell. Even though proteins are composed of only about 20 different types of amino acid residues, each residue has a unique molecular “side chain” with chemical properties that determine the protein’s function. In fact, knowing the precise orientation of each amino acid is key to structure-based drug discovery, a method of developing new medicines by rational examination of where a drug molecule might bind to its protein target.

Although protein structure information is undeniably significant, it is difficult and expensive to produce. For instance, scientists may spend upwards of tens of thousands of dollars to produce a protein structure via X-ray crystallography, with many failed experiments along the way. As a result of the high cost, high impact, and low availability of protein structures, the scientific community has spent significant time and effort on methods that can predict the shape of proteins from their primary amino acid sequence, since this information is much more readily available.

To put this in perspective, as of March 2020, there were more than 175 million protein sequences available in the UniProtKB database, while only 162 thousand protein structures in the Protein Data Bank (PDB). Clearly, there is a great need (and opportunity) to predict protein structures from their underlying amino acid sequences!

The availability of protein structure data (red) pales in comparison with that of protein sequence data (blue). Source: GORBI: Gene Ontology at Ruđer Bošković Institute (C) 2010.

Proteins are a linear chain of amino acids, each of which contains a unique sidechain. Source: Nature Education (C) 2010. Reproduced for educational purposes only.

## My work

Since August 2017, I have been working on deep learning methods for protein structure prediction. My latest attempt is to harness the power of the "Neural Machine Translation" methods that have been so successful over the last few years. If I can formulate the problem of protein structure prediction as one of language translation, then I can take advantage of these high-performing models!

In my case, I am translating from the "language of amino acid residues" (lysine, arginine, etc.) into the "language of angles" that define how each atom is placed with respect to its predecessors. The angles are then converted into Cartesian coordinates, which can be directly compared to the true protein structure. One of the main contributions of my work is the fact my models will predict both the protein backbone and side chain atoms, which is imperative for certain research like structure-based drug discovery.

The model I am using is based on the now-ubiquitous Transformer model, although my current model disposes of the decoder half for simplicity. The training data is based on ProteinNet by Mohammed AlQuraishi, but has been modified to include sidechain information.

You can find my current work-in-progress on the ProteinTransformer here.

# Models trained with MSE loss improve by adding convolution layers

## Motivation

As much as we've tried, we've been having some trouble getting our ProteinTransformer to make reasonable predictions. Many proteins predicted so far kind of look like big balls of spaghetti! It could be the case that since our training data is based on amino acid sequences alone rather than incorporating information common to other prediction methods (think multiple sequence alignments, etc.), that there just may not be much signal to work with. However, another hypothesis is that while Transformer layers are great for predicting long-range interactions between amino acid residues, they may not incorporate local interactions as well. Adding local representations of the sequence via 1D convolutions may be enough to help the model learn simple things, such as the location of alpha-helices, that are only dependent on their near-by neighbors.

## Experiment

In this experiment, I modified my base ProteinTransformer model by adding 1-dimensional sequence convolution layers after the embedding layer, but before the Transformer layers. In essence, the predictions from this model look something like this: $\text{Pred} = \text{TransformerAttention}{1..L}(\text{Conv}{1..n}( \text{Embedding}(X)))$.

I tested two kernel sizes, 3 and 11. I also experimented with the number of output channels from each convolution layer. In the models titled conv-enc-3/2 and conv-enc-11/2, each convolution layer had twice as many output layers as input layers. This is common practice in convolutional neural networks for image processing as each layer builds up a representation of the underlying data. Models ending with 3/1 or 11/1 had the same number of input and out channels.

Each model was trained for 10 epochs with the same hyperparameters and using the Mean Squared Error (MSE) loss between the predicted angles and the true angles that represent the protein's structure in 3D space. For reference, lower loss values mean better performance.

This is good news - our hypothesis was correct! It looks like convolution layers (all lines except blue) help the model perform slightly better, especially when the convolution layer windows are slightly larger (yellow, orange). Not all of the predictions below look that great, but the point of this experiment is to find any improvement, and for right now, I'm satisfied!

For all structure visualizations, red is the predicted structure and blue is the true structure.

Feel free to move the "Step" slider below to view different predictions. You can see both the backbone and sidechain structure elements in the images labeled "Validation Set Predictions" below.

# Models trained with DRMSD loss also improve with convolution layers

## Motivation

Protein structures are more than just angles, though. From Mohammed AlQuraishi's work, we know that "Distance-Based Root Mean Squared Distance", or DRMSD, is one differentiable way to compare two protein structures and train a model. Here, I am repeating the same experiment as above, but instead of just training the models on the RMSE between the true and predicted angles, I am also comparing the complete protein structures in something called the "Combined Loss" which combines both the RMSE and DRMSD values.

Great! We've now verified that convolution layers are helpful, regardless if we are optimizing for RMSE loss or a combined loss. Once again, red structures below are predictions while blue represents the "true" structures.

# Embedding layers appear to improve overall performance

## Motivation

Ok, now we know that adding convolution layers to the mix can help our model overall! However, we started wondering as to whether or not the embedding layer was really necessary for our model.

You see, in language translation and other language processing tasks, embedding layers are used to take a high-dimensional representation of a word ($d\approx 10^4$) and turn it into a lower-dimensional representation ($d\approx 10^3$) that incorporates the "meaning" of the word. However, we are not dealing with a very large input vocabulary! In fact, since there are only 20 amino acids, we have $d=20$. So, what's the point of the embedding? Well, maybe our embedding layer is still learning something important about each residue and incorporating this information into its embedding during training. Let's see!

## Experiment

The following runs show many examples of different convolution layer patterns which are repeated with and without embedding layers. These models are trained such that the last convolution layer has the same number of filters as the dimensionality of the Transformer layers. This allows more flexibility when selecting the number of attention heads for the Transformer layers since d_model must be evenly divisible by n_heads.

Interesting! Despite the wide array of different model configurations I used, the models that used the embedding layer always performed better (see the two clusters of runs on the right). The aggregated chart above left (in blue and orange) makes this pretty clear.

You can see below that, according to the "Parameter Importance" measurement , the number of trainable parameters seems to be important. Maybe performance isn't dependent on whether or not an embedding is used, but rather on how many parameters the model has! I'll inspect this in the next section.

# Embedding layers, not just larger models, improve performance

## Motivation

As I mentioned in the previous section, it seemed like the number of parameters in each model was more important than whether or not the model had an embedding layer. Let's try another experiment to clarify whether or not this is true.

## Experiment

To test this hypothesis, I ran 3 models.

• The first, embedding-control, has convolution and embedding layers with a Transformer layer size of 256.
• The second, no-emb-less-params, forgoes the embedding layer, but keeps the same Transformer layer size. As a result, there is an overall decrease in the number of parameters (from 13 million to 4 million).
• The third, no-emb-same-params, also forgoes the embedding layer but increases the size of the Transformer layer to 658 in an attempt to approximately match the same number of parameters as the control (~ 13 million).

Again, we see that models with embedding layers (purple) perform better than other comparable models, even when we control for the number model parameters. The Parameter Importance chart below supports this conclusion as well.

This is interesting because the amino acid "vocabulary" is so much smaller than that of natural languages, so perhaps the embedding layer is doing something else to incorporate protein structure information into its amino acid representations!

I'll perform a quick follow-up experiment soon to visualize these embeddings. Maybe they learned something about amino acids such as their hydrophobicity or chemical properties. Who knows? Either way, I'll be sure to continue using models with embedding layers.

# Conclusion

Thanks for following along with me on a typical set of experiments that I do for my research project! I hope I've been able to teach you a thing or two about protein structure as well as some of the deep learning methods researchers are using to make predictions.

## Acknowledgments

Thank you to my advisor, David Koes, for the support and guidance.

Thank you, also, to Nicholas Bardy, who has developed some truly amazing ways to interactively visualize protein structure predictions on wandb! Check out his new feature, wandb.Molecule (demonstrated below), which lets you log various kinds of molecular data directly to wandb for visualization (.pdb, .mol, .sdf ...). Make them full screen and you can even rotate and zoom! The other visualizations you have seen on this page were made by exporting PyMOL data to PNG and GLTF (3D object) files, which can then be logged.

This work is supported by NIH T32 training grant T32 EB009403 as part of the HHMI-NIBIB Interfaces Initiative.