Comparing Database Management Systems in Order to Store Cassava Genetic Data
Project Summary
As faster and cheaper DNA-sequencing technologies develop, so do the challenges of handling the immense amount of genetic data produced. The challenges for database systems include developing efficient methods to organize data and optimizing performance for storing and retrieving data. The benefits of using a database system to store genetic data include multi-user access, online data distribution, and performance scaling. The overall goal of this project was to compare methodologies for storing and querying the latest set of cassava SNP genotypic data in an efficient manner. We compared PostgreSQL, a relational database management system (DBMS) and MongoDB, a non-relational DBMS, by measuring how quickly each database ran a query. We designed three queries that performed different functions including counting the number of mutations in each accession, counting the number of deletions in each accession, and selecting accessions with specific mutations. We also measured the time it took to run each query on different data storing formats used in PostgreSQL including text, JSON, JSONB, and bytea data formats. Development of the loading scripts and query scripts was an iterative process, where the performance was profiled and tuned using various tools, and further development can continue to improve the system used in this project. Data from next-generation sequencing methods are being produced at a rapid rate from many experiments across the world. Therefore, it is necessary to research different methods to store data and to compare databases especially in the case of genetic data.