Salihovic - Boyce Thompson Institute

Heidi Salihovic

Year: 2019

De Novo Assembly of Rubisco Genes with the Help of Machine Learning

Rubisco is an important enzyme in the process of carbon fixation and is the most abundant protein in the world. However, the sequences of Rubisco genes are not available for many plants. Increasing the number of assembled subunits would allow identification of natural variation and could reveal the role of amino acids important for its function. As a result, my goal was to write a program to obtain more Rubisco sequences. Trinity, a de novo assembler, was used to assemble transcripts using RNA-Seq data available from the Sequence Read Archive (SRA) database. Trinity assembled correct large subunit transcripts accurately but assembled a mix of correct and “chimeric” (having parts of different sequences) small subunit transcripts. This is because each plant species has multiple homologs of small subunit genes that are easily misassembled. To differentiate between correctly and incorrectly assembled transcripts machine learning was used. Two neural networks were made, one a simpler artificial neural network and the other a convolutional neural network. The neural networks were trained using coverage because incorrectly assembled transcripts tended to have little to no coverage in some area of their transcripts. In every case tested the machine learning algorithm predicted more accurately which Trinity outputs were correct than if all of Trinity’s outputs had been assumed to be correct. For some species the model was able to predict the correctness of the Trinity transcripts with a high level of accuracy but more data is needed to improve accuracy for others. Thus, a future goal is to increase the amount of data by running Trinity using more species. The same machine learning algorithm can then be trained using the larger dataset. When the model achieves satisfactory accuracy, it can be used together with Trinity to identify correctly assembled transcripts for plants without sequenced Rubisco genes.

My Experience

I have gained several important skills to take away from this research experience both professionally and academically. Professionally, I learned how to better collaborate with others. This was exemplified through my lab meeting presentation because I had to think about how to convey my findings in a more concise manner to an audience that knew little my subject area. Affirming the importance of this collaboration the comments and questions I received afterwards lead me to a better understanding of my own research by reshaping the way I thought about my project. I now will use the skills I learned about communication and collaboration to improve future discussions of my research. Academically I learned how to build a machine learning algorithm to answer a biological question. I can use this way of thinking, applying algorithms typically used to solve other problems, to solve biological problems as I pursue a career in computational biology.