Researchers at the Stowers Institute for Medical Research, working with colleagues from Stanford University and the Technical University of Munich on a technical tour de force, developed advanced explainable artificial intelligence (AI) to decipher regulatory instructions encoded in DNA. In a report published online on February 18, 2021 in Natural geneticsThe team found that a neural network trained on high-resolution maps of protein-DNA interactions can reveal subtle DNA sequence patterns throughout the genome and provide a deeper understanding of how these sequences are organized to regulate genes.
Neural networks are powerful AI models that can learn complex patterns from various types of data such as images, speech signals or text in order to predict the associated properties with impressively high accuracy. However, many view these models as uninterpretable because the prediction patterns learned are difficult to extract from the model. This black box nature has hindered the widespread application of neural networks to biology, where the interpretation of predictive patterns is of paramount importance.
One of the great unsolved problems in biology is the second code of the genome – its regulatory code. DNA bases (usually represented by the letters A, C, G and T) encode not only the instructions for building proteins, but also when and where these proteins should be made in an organism. The regulatory code is read by proteins called transcription factors that bind to short pieces of DNA called motifs. How certain combinations and arrangements of motifs specify regulatory activity, however, is an extremely complex problem that has been difficult to determine.
Now, an interdisciplinary team of biologists and computer researchers led by Stowers investigator Julia Zeitlinger (PhD) and Anshul Kundaje (PhD) from Stanford University has designed a neural network called BPNet for Base Pair Network that can be interpreted to uncover it regulatory Code by predicting transcription factor binding from DNA sequences with unprecedented accuracy. The key was to conduct transcription factor-DNA binding experiments and computer modeling with the highest possible resolution down to the level of individual DNA bases. This increased resolution enabled them to develop new interpretation tools to extract the most important elementary sequence patterns such as transcription factor binding motifs and the combinatorial rules by which motifs together act as a regulatory code.
"That was extremely satisfactory," says Zeitlinger, "because the results fit perfectly with the existing experimental results and also revealed new findings that surprised us."
For example, the neural network models enabled the researchers to discover a conspicuous rule that regulates the binding of the well-studied transcription factor Nanog. They found that Nanog binds cooperatively to DNA when several of its motifs are periodically present so that they appear on the same side of the helical DNA helix.
"There is a long trail of experimental evidence that such motive periodicity is sometimes present in the regulatory code," says Zeitlinger. "The exact circumstances were difficult to pinpoint, however, and Nanog was not a suspect. It was surprising to discover that Nanog had such a pattern and see additional details of its interactions, as we were not specifically looking for that pattern."
"This is the main advantage of using neural networks for this task," says Dr. Iiga Avsec, first author of the paper. Avsec and Kundaje created the first version of the model when Avsec Stanford worked in the laboratory of Dr. Julien Gagneur at the Technical University in Munich.
"More traditional bioinformatics approaches model data using pre-defined rigid rules based on existing knowledge. However, the biology is extremely extensive and complex," says Avsec. "By using neural networks, we can train much more flexible and nuanced models that learn complex patterns from scratch without any prior knowledge and thus enable new discoveries."
BPNet's network architecture is similar to that of neural networks used for face recognition in images. For example, the neural network first recognizes edges in the pixels, then learns how edges form facial elements such as an eye, nose, or mouth, and finally recognizes how facial elements together form a face. Instead of learning from pixels, BPNet learns from the raw DNA sequence and learns to recognize sequence motifs and, ultimately, the higher order rules by which the elements predict the binding data with base resolution.
Once the model has been trained for high accuracy, the learned patterns are extracted using interpretation tools. The output signal is traced back to the input sequences to reveal sequence motifs. The final step is to use the model as an oracle and systematically query it with specific DNA sequence designs, much like testing hypotheses experimentally to uncover the rules by which sequence motifs function in a combinatorial fashion.
"The nice thing is that the model can predict a lot more sequence designs that we could test experimentally," says Zeitlinger. "By predicting the outcome of experimental perturbations, we can also identify the experiments that are most informative for validating the model." Indeed, using CRISPR gene editing techniques, the researchers experimentally confirmed that the model's predictions were very accurate.
Because the approach is flexible and applicable to a wide variety of different data types and cell types, it holds the promise of a rapidly growing understanding of the regulatory code and the effects of genetic variation on gene regulation. Both the Zeitlinger Lab and the Kundaje Lab are already using BPNet to reliably identify binding motifs for other cell types, to relate motifs to biophysical parameters and to learn other structural features in the genome, such as those associated with DNA packaging. So that other scientists can use BPNet and adapt it to their own needs, the researchers have made the entire software framework available with documentation and tutorials.