Exploring the benefits of distributed parallel computing on large biological datasets
The scientific community is accumulating biological data at an exponential rate. Next generation sequencing technologies easily accumulate many terabytes of genomic data per week. Investigators require new methods to efficiently query and analyze data of this scale. Apache Hadoop is open source software designed to replicate the features of Google MapReduce and Google File System (GFS)1. The basic principle behind the MapReduce paradigm is to dissect a computationally demanding task into smaller sub-tasks that can be distributed amongst a cluster of nodes. The Hadoop Distributed File System (HDFS) is the second major function of the Hadoop package. HDFS is designed to be an extremely redundant and expansive storage medium that can serve a viable purpose in a scientific setting where data parity and large data quantities are common. Our preliminary investigation of utilizing Hadoop in a scientific setting involved two key steps. First, thorough study of the benefits of distributed parallel computing vs. linear computing on relevant datasets was required. This encompassed both benchmarking the physical hardware and testing MapReduce applications and comparing the results to their linear counterparts. The results indicated that Hadoop offered significant improvements in computation time in specific cases such as filtering large datasets. Second, a MapReduce application was designed to query Genotyping-by-Sequencing (GBS) data. Normally, querying these datasets with linear processing takes extensive time and computational power,however, using parallel computing, query times were reduced by several orders of magnitude. This project will enable scientists to make novel discoveries using large biological datasets.
My Experience
The Bioinformatics Internship at the Boyce Thompson Institute (BTI) has been an eye opening experience into science as a whole. From the opportunity to attend seminars on a weekly basis to the immersive experience of working in a scientific environment on a daily basis has lead me to appreciate and gain insight into the lifestyle of a scientist. I became interested in the program offered at BTI due to my innate interest in computing and biology. Throughout the past ten weeks I have thoroughly explored biology in a omputational context and gained an understanding of the plethora of benefits that computation can offer to life scientists. The background and skills I have gained during this internship will be a valuable resource as I venture forward into future research endeavors during graduate school.