Exploring and Benchmarking Solutions for Storing Polyploids and Indels on the Genomic Open-source Breeding Informatics Initiative (GOBii)
Today, there is a high demand for revolutionary technologies that allow for fast, inexpensive, and accurate sequencing of genomic data. This demand has initiated the development of the next- generation sequencing technologies which has allow for great advances in the biological and plant breeding sciences. Genomic selection is a breeding method that provides an increase in genetic gain in less time by predicting the performance of new plant varieties based on historical genomic data. Unfortunately, the mainstream applications that provide this service require complex computational infrastructure for managing the sequencing data. Often, such infrastructure is not accessible to public plant breeding programs. The Genome Open-source Breeding Informatics Initiative (GOBii) is an initiative that aims to provide access to genomic selection and increase the genetic gain in crop breeding programs in Africa and South Asia. There are major technical challenges that the initiative has to deal with in order to achieve its goal. The challenges include efficient storage of huge amounts of genomic data, fast and accurate data extraction, and computation of the sequencing data. The system built by the GOBii project is planned to be open-source and highly scalable for large breeding programs. One of the main priorities is to find a database management system that best suits the purpose of the project. Although there are numerous open-source technologies that allow for fast and accurate data storage and extraction, it is unknown which database would be able to handle best the genomic data that the project focuses on. The current study focuses on the analysis of PostgreSQL, a free and open-source relational database management system, and HDF5, a file system family designed to store and organize large volumes of data. Although HDF5 cannot easily handle variable length polymorphisms such as indels, or different ploidy scenarios, two solutions using HDF5 have been implemented. Also a solution using PostgreSQL 11 has been developed for the purpose of the project. In order to determine the management system that would have the best performance for these polymorphisms, testing experiments were conducted on three genotyping matrices representing plant samples and markers, places in the DNA that are polymorphic. The solutions were tested on various user queries and their performances were recorded. According to the results, the PostgreSQL solution highly outperforms the two HDF5 solutions.
I am an undergraduate student at Ramapo College of NJ majoring in Computer Science. Since high school, I have been interested in the development and integration of software tools in addressing fundamental biological questions. For this reason, during my second year of college, I challenged myself and declared a minor in Bioinformatics. Later, I applied for the BTI 2019 Summer Internship with a goal to broaden my knowledge and gain valuable experience. This internship has given me the opportunity to (1) develop a benchmarking research project with the Genomic Open-source Breeding informatics initiative (GOBii) team, (2) generate large volumes of genomic data and develop tools for their fast and accurate storage and extraction, (3) conduct testing experiments on a multi-core processor server, and last but not least (4) attend numerous events and seminars that allowed me to communicate science and learn a lot about other plant biology research projects. I am sincerely grateful for these great opportunities and I am keen to further enhance my knowledge and skills in the field of Bioinformatics.