“Improving Genotypic Data Storage Using The Hadoop Cluster”
Project Summary:
Genotypic data is the perfect example of a data type that can grow so large that it becomes difficult to store and query in an efficient and reliable way. Genotypic data is becoming increasingly important for the identification and analysis of genes in research, breeding and medicine, so the optimization of the infrastructure to perform such analysis has become crucial for these endeavors. In this work, we used a distributed database technology called Hadoop that can scale with data and computation growth. Hadoop is used in the industry to facilitate access to “Big data” and has also be been proposed as an efficient genotyping storage solution. In addition, the Hadoop File System (HDFS) offers built-in redundancy for increased reliability compared to a traditional file system. In our implementation, we choose the parquet file format to store genotypic data, which has the benefit of fast columnar access and better compression. The exact file structure of the parquet file had a strong effect on performance, for example, it was possible to obtain faster queries depending on the transpose of the storage matrix. Spark and the SparkSQL module were used to query and provide some statistics on our data. To make the data accessible to the outside world an Application Programming Interface (API) was developed.
My Experience:
These past two months have been very interesting. I have learned a lot about the research world and it makes me to want to pursue a research career. BTI gave us the opportunity to work on very interesting research projects. I enjoyed the freedom and it was the first time I have worked on a single project for that much time. I learned a lot by trying to figure out how to make my project work. I’m glad to have learned technologies like Hadoop or Spark that are cutting edge in today’s computer science. I would like to thank Lukas Mueller and Nick Morales for all the help they provided. This was also an interesting personal experience because I got to live for two months in the US which open my eyes on a lot of cultural differences.