Begum - Boyce Thompson Institute

Dil Begum

Year: 2011

Filling of Reference Genome Gaps Using Next Generation Sequencing

From famous entraes such as spaghetti and pasta to mouthwatering salsa, tomato has a wide variety of uses in world’s cuisine. Tomato is also very important in our diet. According to a magazine (Scott-Dixon, Krista), tomato is an antioxidant powerhouse. So where do they come from? What makes them so different in taste, color and appearance? Scientists are working day by day to improve tomato quality. Since the tomato genome has been sequenced, this information can be used to help us understand how genes work together to effect growth, development and functionality of an entire genome. However, when the tomato genome was sequenced gaps were left as a result of the assembly process thus creating areas of missing sequences. Therefore, the purpose of this study is to fill as many gaps as possible and possibly reduce the number of gaps using the de novo contigs assembled from Illumina short reads. The results were produced using tools such as BWA, novoallign, SamTools, Picard, BLAST, SOAPdenovo, BedTools as well as Perl scripts and some Linux regular expressions.

My Experience

The purpose of the project was to use both the reference genome with information about where short reads map, and to generate contigs that can be mapped to the reference genome to identify contigs that are located near gaps. The reads were obtained using the Illumina next generations sequencing technology. Different tools were used to generate contigs from the reads such as SamTools, Picard, SOAPdenovo Assembly, BLAST, BedTools, Perl programming language as well as some Linux regular expressions. After generating the contigs from the SOAPdenovo, contigs were run against BLAST which gave information about contigs ID, chromosome ID, e-values, percentage matches, etc. I was only interested in three fields such as chromosome ID, contig start and contig ends. With this information, I wrote a script that took the first two best outputs from BLAST and outputted into a BedTools format. I also wrote a second script that generated the locations of the gaps from start to finish and also outputted them in a BedTools format. Then I used BedTools with the output from both scripts to show where each contigs lined up to the chromosome, what chromosome they lined up to and how far they were from each other in terms of base pairs. For this project I chose regions less than 20bp.