Identification of structural variations between genomes of cultivated tomato Solanum lycopersicum and its wild progenitor S. pimpinellifolium
Cultivated tomato, Solanum lycopersicum, is an abundantly consumed crop worldwide, acting as a substantial source of nutrients and nourishment. Through selective breeding, multiple traits, such as size and production, have improved within these cultivated varieties; however, such breeding has resulted in allele loss, thereby narrowing genetic diversity. S. pimpinellifolium, the wild progenitor species of S. lycopersicum, possesses several favorable traits missing from the cultivar, prompting breeders to use this species as a new source of alleles for the domesticated species. In order to aid these breeding efforts, we strived to identify the presence of large structural variations (SVs; > 10 bp) near genes involved in important biological or agronomic processes between genomes of S. lycopersicum cultivar ‘Heinz 1706’ and S. pimpinellifolium accession LA2093. Minimap2 was employed to align the two genomes. Assemblytics and in-house scripts were used to extract the SVs. We then used Python scripts to validate and output SVs which overlapped with or were very close to a gene sequence (in promoter, CDS, or non-CDS regions). Each protein sequence from both genomes was analyzed to identify protein functional domains. The genes underwent Gene Ontology mapping and annotation, and GOATOOLS was used to perform GO enrichment analysis of the resulting data. Through this analysis, breeders can begin to identify favorable alleles present in the progenitor yet absent within the cultivar, thus using said information to breed improved tomato variants expressing beneficial phenotypes.
As an intern in the NSF Plant Genome Research Program, I gained a wonderful opportunity to hone my bioinformatic skills in order to analyze a fascinating research topic. I am grateful for the opportunity given to me by the Fei lab and my mentor, Dr. Gao. Although I was already familiar with how to conduct research in the lab or on the computer (though not specifically with plant systems), I gained an important understanding of how a variety of different approaches can be taken to extract and analyze information from genomic data. I have confidence that I can use these new techniques and tools to further my own research endeavors, be such for the remainder of my undergraduate studies or for my future graduate research.