Zhangjun Fei

Professor

Developing and curating powerful genomic resources and computational tools and applying integrative bioinformatics approaches to harness vast ‘omics’ datasets for a deeper understanding of crop origin, domestication, and key traits.

Google Scholar

Fei Lab Website

Research Focus

How can large-scale plant genomics datasets be efficiently integrated to advance biological discovery and crop improvement?

Email: zf25@cornell.edu

Office/Lab: Room 223

Adjunct Professor
Section of Plant Pathology & Plant-Microbe Biology
School of Integrative Plant Science
Cornell University

Graduate Fields: Plant Pathology & Plant-Microbe Biology; Plant Biology

Wu S^#, Sun H^#, Zhao X^#, Hamilton JP, Mollinari M, Gesteira GS, Kitavi M, Yan M, Wang H, Yang J, Yencho GC, Buell CR, Fei Z* (2025) Phased chromosome-level assembly provides insight into the genome architecture of hexaploid sweetpotato. Nature Plants 11:1951-1959

Research Briefing Decoding the complexity of the hexaploid sweet potato genome Nature Plants 11:1712-1713

Zhang X, Tang C, Jiang B, Zhang R, Li M, Wu Y, Yao Z, Huang L, Luo Z, Zou H, Yang Y, Wu M, Chen A, Wu S, Hou X, Xu Liu X*, Fei Z*, Fu J*, Wang Z* (2025) Refining polyploid breeding in sweetpotato through allele dosage enhancement. Nature Plants 11:36-48

Research Briefing Understanding the genomic basis to empower sweet potato breeding Nature Plants 11:14-15

Chen W, Xie Q, Fu J, Li S, Shi Y, Lu J, Zhang Y, Zhao Y, Ma R, Li B, Zhang B, Grierson D, Yu M*, Fei Z*, Chen K* (2025) Graph pangenome reveals the regulation of malate accumulation in blood-fleshed peach by NAC transcription factors. Genome Biology 26:7

Hu X, Xu C, Li X, Li L, Bao Y, Gu M, Li X, Huo L, Gong J, Li X, Wang M, Xu K, Yin X, Fei Z*, Sun X* (2025) Subgenome dominance in allotetraploid Actinidia valvata regulates RNA m⁶A modification for waterlogging tolerance. Advanced Science 12:e03974

Lai E, Guo S, Wu P, Qu M, Yu X, Hao C, Li S, Peng H, Yi Y, Zhou M, Fu G, Li X, Liu H, Zheng Y*, Wang X*, Fei Z*, Gao L* (2025) Genome of root celery and population genomic analysis reveal the complex breeding history of celery. Plant Biotechnology Journal 23:946-959

Research Overview

The advance of high-throughput technologies has given rise to a wealth of genome-wide data encompassing environmental, genetic, and evolutionary diversity. This has revolutionized agricultural research and crop breeding, yielding abundant resources and insights that drive key innovations. However, it remains a major challenge to effectively digest these massive datasets to formulate hypotheses, explore genome evolution, and elucidate regulatory mechanisms underlying critical biological processes. To address this challenge, my group has focused on developing genomic tools, databases, resources, and novel algorithms to analyze and integrate large-scale ‘omics’ datasets, with the goal of uncovering and understanding important biological phenomena.

Research in my lab focuses on:

Developing biological databases and computational tools for efficient storage, management, dissemination, and mining of diverse ‘omics’ datasets
Building large-scale genomic resources to advance research and crop improvement
Applying integrated bioinformatics and genomics approaches for trait discovery, crop improvement, and knowledge advancement.

Databases

Bioinformatics

iTAK – A package to identify and classify plant transcription factors and protein kinases.
VirusDetect – An automated pipeline for efficient virus discovery using deep sequencing of small RNAs.
Plant MetGenMAP – a web-based tool for comprehensive mining and integration of gene expression and metabolite changes in the context of biochemical pathways.
iAssembler – A de novo assembly package for transcriptome sequences generated using 454 or Sanger platforms.

Lab Members

In the News

May 5, 2026

Super-powered population genomics: Watermelon super-pangenome paves the way for precision breeding

May 5, 2026

Watermelon is a quintessential summertime fruit, evoking images of warm, sunny afternoons and cookouts with friends and family. You can easily picture its striped, green rind and pink flesh, imagine...

February 10, 2026

Breeding a better cucumber: new genetic map reveals 171,892 structural variants

February 10, 2026

Cucumber is an economically important crop worldwide, ranking as the third most-produced vegetable after tomatoes and onions. Yet breeding improved varieties—plants that are more resilient, produce better-shaped fruit, or are...

August 12, 2025

BTI, Meiogenix, and FFAR Announce $2 Million Breakthrough Tomato Genetics Collaboration

August 12, 2025

Research Lays the Foundation for Breakthroughs in Global Food Security In a landmark $2 million initiative, the Boyce Thompson Institute (BTI) and biotechnology company Meiogenix have launched a collaboration to develop drought- and disease-resistant tomatoes by tapping...

August 8, 2025

Decoding Sweetpotato DNA: New Research Reveals Surprising Ancestry

August 8, 2025

The sweetpotato feeds millions worldwide, especially in sub-Saharan Africa, where its natural resilience to climate extremes makes it crucial for food security. But this humble root vegetable has guarded its...

December 12, 2024

Study Reveals Role of Allele Dosage in Improving Sweetpotato Traits

December 12, 2024

Sweetpotatoes are an agricultural powerhouse that feeds millions globally. However, their complex genetics make it challenging for breeders to understand and improve traits like yield, disease resistance, and nutritional content....

November 25, 2024

Study Finds Genetic Mechanisms Behind High-Yield Apple Trees

November 25, 2024

Apples rank among the world’s most valuable fruit crops, with production spanning more than 100 countries. Some apple trees naturally develop into what farmers call “spur-type” varieties—compact trees that are...

More news

Research Experience

Internships

BTI offers a summer research experience program for undergraduate and high school students.

Intern Projects in the Fei Lab

Genomics and bioinformatics have revolutionized plant research and crop breeding. Reference genomes have played a central role in advancing basic research, gene/QTL cloning, molecular marker discovery, marker-assisted breeding, and our understanding of genome evolution and crop domestication. However, reference genomes derived from only one or a few accessions cannot fully capture the genetic diversity within a crop species, leading to the loss of significant and valuable genetic information. To address this limitation, Dr. Fei’s group has focused on comprehensive investigations into the pangenome of horticultural crops to better understand the genetic basis of their origin, domestication and key agronomic traits.

Previous Interns

Aaron Alexander

Generating a Phased Genome Assembly of the Hexaploid Sweetpotato Cultivar, ‘New Kawogo’

Sweet potato is among the most important staple crops. Sweet potato cultivars rich in Vitamin A have been produced and promoted in areas where childhood Vitamin A deficiency is common. Sweet potato improvement is challenged by a lack of knowledge of the genetic and molecular basis of key agronomic traits. This research project aims to generate a phased genome assembly of the hexaploid sweet potato cultivar, ‘New Kawogo’. PacBio HiFi sequencing was used to produce highly accurate long reads and the reads were assembled into phased contigs with Hifiasm, a de novo assembler that produces haplotype-resolved genome assemblies by integrating chromatin conformation capture (Hi-C) sequencing data. Due to the complexity of the hexaploid sweet potato genome, chimeric contigs resulting from erroneously connected sequences from different haplotypes were present in the initial assembly. By taking advantage of the phased genetic maps, Hi-C contact maps and genome synteny, misassemblies were corrected in the ‘New Kawogo’ assembly. These corrected contigs have been used to produce a haplotype-resolved chromosome-level genome assembly of ‘New Kawogo’, which provides a valuable resource for the discovery of genetic controls of important traits and genomics-assisted improvement of sweet potato. This assembly serves as a foundation for the genetics and biology of ‘New Kawogo’ and will accelerate sweet potato breeding.

I am grateful for the opportunity to conduct summer research in the Fei lab at BTI under Shan Wu’s mentorship. Being part of the team working on the phased genome assembly of the hexaploid sweet potato cultivar ‘New Kawogo’ has been immensely rewarding, as this research supports biofortification efforts to promote Vitamin A-rich cultivars in regions of the world affected by childhood Vitamin A deficiency. I expanded my knowledge of genomics, including genetic markers, Hi-C contact signals, and genome synteny. The guidance and collaborative environment have greatly enriched my learning and research skills. With the help of my mentor and the BTI BCBC bioinformatics course, I learned the basics of UNIX, coding in R, and bioinformatics tools like BLAST, Seqkit, and BUSCO. I am also grateful for the REU programs, such as weekly seminars and the DGS Graduate School Panel, which helped me learn about and decide to pursue higher studies.

Intern Info

Year 2024

School Virginia Commonwealth University

Faculty Advisor Zhangjun Fei

Mentor Shan Wu

Grace Coppinger

Identification of structural variants affecting fruit quality traits between wild and cultivated watermelons

Watermelon, Citrullus lanatus, is among the top five most consumed fruits globally. Its domestication has led to significant changes of fruit quality traits, including higher sugar content compared to its wild relatives such as Citrullus amarus. Genomic structural variants (SVs) have been reported to contribute to domestication traits. However, our knowledge about SVs between cultivated and wild watermelons and their phenotypic effects remains incomplete. To address this, a comparative study was conducted between a representative accession of C. lanatus subsp. vulgaris and a representative accession of C. amarus. Alignment of the high-quality genomes of these two accessions identified 111,738 SVs larger than 20 bp, which affected 19,482 and 19,096 genes in the cultivated and wild genomes, respectively. We found that 1,107 in the cultivated genome and 838 genes in the wild carried SVs affecting coding sequences, among which 25 had annotated functions related to disease resistance. RNA-seq data from cultivar and wild fruit flesh tissues were used for gene differential expression analysis. As a result, 2,732 genes in the cultivated genome and 4,884 genes in the wild genomes were differentially expressed during the fruit development. Through integrating SV and differential gene expression information, we identified candidate genes potentially crucial for fruit development and sugar content. This study has advanced our understanding of SVs and genes potentially affecting fruit quality traits and disease resistance. The SVs identified here serve as useful resources to facilitate future watermelon breeding efforts.

My experience at Boyce Thompson Institute this summer revolves around the incredible people I met. From engaging in lab work throughout the day to having fun playing sand volleyball, this has been a memorable summer. I am so grateful for all the friendships I’ve made during this time, and I look forward to witnessing all the remarkable achievements my peers will accomplish in the future. During my time at the Fei lab, I improved my bioinformatics skills and expanded my scientific knowledge and experience. I have a newfound appreciation for those practicing computer science and analyzing large data sets. This rewarding and challenging experience has amplified my desire to pursue a Ph.D. I am excited to continue my scientific career and put my new skills to use.

Intern Info

Year 2023

School University of South Alabama

Faculty Advisor Zhangjun Fei

Mentor Honghe Sun

Mukund Gaur

Identification of Differentially Expressed Genes in F1 from the Cross between S. lycoperiscum M82 and S. pennellii LA0716

The modern cultivated tomato is characterized by a lack of genetic diversity due to extensive selective breeding. In contrast, wild tomatoes continue to display broad morphological and metabolic diversities including sugars, organics, and volatiles. Identifying genes with allele-specific expression (ASE) related to key agronomic traits in the F1 hybrid derived from the cross between the S. lycoperiscum M82 and the wild S. pennellii LA0716 could deepen our understanding of the regulation of tomato nutritional quality and flavor. This project used RNA sequencing (RNA-Seq) data from different fruit tissues at different developmental stages of the F1 cross. RNA-Seq reads were processed and mapped to the genomes of the two parents, respectively. Based on the mapping quality to each parent genome, the reads were assigned to one of the parents, and raw and normalized (FPKM) read counts for each of the two alleles were calculated for all genes. ASE genes were then identified and weighted gene co-expression analysis was performed. Gene ontology analysis for genes within the module with highly differential allele expression identified enriched biological processes, and genes with functions relevant to fruit flavor and nutrition were identified. The identification of these candidate genes could allow for future functional analysis that will deepen our understanding on how cross breeding with wild species could improve flavor and nutritional quality in cultivated tomatoes.

Through this experience at the BTI lab, I was able to gain invaluable knowledge about not only the tomato genome and the RNA-seq technology, but also in data analysis techniques and strategies. Throughout this process, I was able to learn about how to use the R, Linux, and Python programming languages for data analysis, and also used code management tools such as RPubs and Github. It was amazing to get to experience working in a laboratory environment and have access to the tools offered, and conduct an experiment with real-world applications. I did not come into this program with a lot of background in biology or plant science, so I was able to learn about both fields through my own research and the weekly seminars, which were a really great way for me to learn about current research in plant science.

Intern Info

Year 2023

School Ithaca High School

Faculty Advisor Zhangjun Fei

Mentor Jiantao Zhao

Benjamin Beer

Project Summary:

Watermelon (Citrullus lanatus) is a prevalent crop in many countries thanks to its high nutritional value and refreshing taste. Despite this popularity, it is still challenging to successfully cultivate watermelons with negligible crop loss. This is due in part to watermelon’s narrow genetic diversity caused by a domestication bottleneck which leaves it vulnerable to various diseases. Wild species within the Citrullus genus have exhibited resistances to some of the most important diseases. Effectively hybridizing cultivated watermelons with the wild forms could offer opportunities to create a more resistant crop. To successfully utilize the genetic diversity preserved in the wild watermelons and guide efficient selection of breeding materials that carry beneficial traits, analysis on gene presence/absence variations (PAVs) was performed in 480 watermelon accessions belonging to cultivated watermelon and three wild relatives, C. mucosospermus, C. amarus and C. colocynthis, with a super pan-genome capturing genes existing in different watermelon species. Characterizing gene PAVs in the watermelon super pan-genome demonstrated the divergence among the wild and cultivated watermelons. Through comparative analysis, functionally important genes with significantly changed occurrence frequencies between the wild and cultivated watermelons were identified. These included disease resistance genes that were lost in the cultivated watermelon and could be brought back from the wild watermelons.

My Experience:

I learned a lot during my time working in Fei lab. I acquired a variety of technical skills such as working in Linux environments and writing scripts in R, while also being able to expand upon previously developed skills, namely Python and Bash. Learning how to use bioinformatics tools like samtools and Blast2GO was interesting and gave me further insights on the nuances of genetics. This internship also served as an introduction to the scientific community. The symposiums, other lectures, and lab work all showed me common practices and skills used in research while also greatly broadening my knowledge of plants. Overall, this was a very positive and informative experience that served as a great introduction to professional research. Lastly, I would like to thank my mentor, Shan Wu, for teaching me many of the skills I’ve acquired while at BTI and Dr. Fei for giving me this opportunity.

Intern Info

Year 2022

School Ithaca High School

Faculty Advisor Zhangjun Fei

Mentor Shan Wu

Adam Cason

“SpinachBasev2: An updated central portal for Spinach genomics tools and information”

Project Summary:

Spinach (Spinacia oleracea) is an important agricultural crop because of its nutritional value, popularity, and other potential applications to human health such as being a vector for edible vaccines. In the summer of 2018, a project was undertaken by Fei Lab to build a central database for spinach genomic information and a site, SpinachBase, was created. This site houses data and resources related to the first published spinach genome of cultivar Sp75. In the following years, four more spinach genomes were sequenced of cultivars Monoe-Viroflay, Viroflay, 03-009, and Cornell No.9, creating a need to update the public store of spinach genomic knowledge. The release of these additional genome sequences, along with updates made to the Drupal/Tripal web system upon which the original database is built, necessitated the creation of a new spinach genomic database. Thus, SpinachBasev2 has been constructed and is intended to update and replace the existing SpinachBase site, while continuing to provide helpful tools and analyses relating to spinach genetic and genomic data. These tools include a BLAST similarity search function, a keyword search tool, a synteny viewer, and a genome browser. Various bioinformatics programs were used to format and annotate the genomic data for the database such as NCBI BLAST, Blast2Go, AHRD, InterProScan, and MCScanX. This database will allow spinach researchers and breeders to more easily and efficiently find and investigate data related to spinach genetics and genomics.

My Experience:

Attending a small, liberal arts university has been great, but has not allowed me to work or research in a large laboratory setting during the regular school year. Here at BTI, I have been able to gain a whole new experience of interacting with a PI along with other undergraduates, graduate students, and postdocs. Being able to work on my own bioinformatics project with my mentor has taught me how to use a myriad of different data annotation programs and coding languages, and has helped me to see what a career in bioinformatics really looks like. I have also become a more well-rounded scientist in general through the weekly lectures from diverse plant science researchers and classes dedicated to science ethics and science communication. All in all, this experience has prepared me in many different areas to confidently pursue graduate school and a career in the plant sciences.

Intern Info

Year 2022

School Samford University

Faculty Advisor Zhangjun Fei

Mentor Jingyin Yu

Charis Qi

“Construction of a graph-based watermelon pan-genome and investigation of genetic variation in cultivated watermelon and its wild relatives”

Project Summary:

Watermelon is one of the most popular and economically important fruit crops worldwide. The cultivated watermelon, Citrullus lanatus subsp. vulgaris, was created from over 4000 years of domestication. However, the cultivated watermelon is vulnerable to various diseases, while wild watermelons display resistance to many of these diseases. Wild watermelons have been widely used in modern breeding to introduce disease resistance. However, this process is slow, as it is unclear which specific genes and variants are giving disease resistance in wild watermelons. The goal of the project was to identify potentially beneficial genes, including resistance genes, to be introduced into breeding. First, nine genome assemblies from the cultivated watermelon and its direct wild progenitor (C. lanatus subsp. cordophanus) and close wild relative (C. mucosospermus) were aligned to the reference genome ‘97103’ to identify SVs, which were used to construct a graph-based pan-genome. Seventy-five representative accessions were genotyped for the SVs through mapping Illumina short reads to the constructed pan-genome. SVs with significantly higher frequencies in the two wild populations compared to the cultivated population were identified. Affected genes by these SVs with the disease resistance related functions were further identified, including those encoding NBS-LRR resistance proteins. Selective sweep analysis was also performed to detect genes related to fruit quality that are under selection during domestication. One of the genes identified encoded a RING-type E3 ubiquitin transferase, and this gene is located in a sugar content QTL, QBRX2-1, suggesting its potential role in fruit flesh sugar accumulation during watermelon domestication. The genes and variants discovered here could be candidates for further functional characterization and for watermelon breeding.

My Experience:

My time in the Fei lab was very insightful in many aspects and gave me my first immersive research experience in a field that I am interested in. Throughout my project, I was exposed to different programs and coding languages commonly used in bioinformatics, including many of which I was previously unfamiliar with. I also have a better understanding of what it means to continue down a career path in this field. On top of all of this, working with highly experienced professors, graduate students, and postdocs in this field was a very eye-opening experience. I also loved meeting other interns with similar interests and career passions as I do. Overall, I had a great time here. This summer at BTI has taught me a lot and gave me a better sense of direction for my post-undergraduate years.

Intern Info

Year 2022

School Davidson College

Faculty Advisor Zhangjun Fei

Mentor Honghe Sun

Stefanos Stravoravdis

Identification of structural variations between genomes of cultivated tomato Solanum lycopersicum and its wild progenitor S. pimpinellifolium

Cultivated tomato, Solanum lycopersicum, is an abundantly consumed crop worldwide, acting as a substantial source of nutrients and nourishment. Through selective breeding, multiple traits, such as size and production, have improved within these cultivated varieties; however, such breeding has resulted in allele loss, thereby narrowing genetic diversity. S. pimpinellifolium, the wild progenitor species of S. lycopersicum, possesses several favorable traits missing from the cultivar, prompting breeders to use this species as a new source of alleles for the domesticated species. In order to aid these breeding efforts, we strived to identify the presence of large structural variations (SVs; > 10 bp) near genes involved in important biological or agronomic processes between genomes of S. lycopersicum cultivar ‘Heinz 1706’ and S. pimpinellifolium accession LA2093. Minimap2 was employed to align the two genomes. Assemblytics and in-house scripts were used to extract the SVs. We then used Python scripts to validate and output SVs which overlapped with or were very close to a gene sequence (in promoter, CDS, or non-CDS regions). Each protein sequence from both genomes was analyzed to identify protein functional domains. The genes underwent Gene Ontology mapping and annotation, and GOATOOLS was used to perform GO enrichment analysis of the resulting data. Through this analysis, breeders can begin to identify favorable alleles present in the progenitor yet absent within the cultivar, thus using said information to breed improved tomato variants expressing beneficial phenotypes.

My Experience

As an intern in the NSF Plant Genome Research Program, I gained a wonderful opportunity to hone my bioinformatic skills in order to analyze a fascinating research topic. I am grateful for the opportunity given to me by the Fei lab and my mentor, Dr. Gao. Although I was already familiar with how to conduct research in the lab or on the computer (though not specifically with plant systems), I gained an important understanding of how a variety of different approaches can be taken to extract and analyze information from genomic data. I have confidence that I can use these new techniques and tools to further my own research endeavors, be such for the remainder of my undergraduate studies or for my future graduate research.

Intern Info

Year 2019

School Eastern Connecticut State University

Faculty Advisor Zhangjun Fei

Mentor Lei Gao

Charles Wang

Identifying Bottle Gourd Genes Responsive to the Infection of Papaya Ringspot Virus

Bottle gourd is an important crop that has helped both ancient and modern civilizations thrive. Besides its applications in medicine, musical instruments, and containers, bottle gourd fruit has relatively high nutritional value in its early stages of development, establishing it as a major food staple in developing countries. Another essential property of bottle gourd lies in its ability to be used as rootstock for grafting to other cucurbit crops. By exploiting this property, farmers can drastically increase their annual yield because grafting increases the scion’s tolerance to abiotic and biotic stresses. That being said, it is necessary that we develop novel methods of increasing pathogen resistance in bottle gourd, as doing so will not only improve the uses that we already have for this crop, but it will also magnify the benefit to other cucurbit crops. On the other hand, Papaya ringspot virus (PRSV) is a major limiting factor of cucurbit production, and it effectively thwarts the benefits of bottle gourd by inhibiting development in young plants and deforming the fruit of mature specimens. To gain insight into the molecular mechanisms underlying bottle gourd resistance to PRSV, we assessed global transcriptome changes in both resistant (USVL5) and susceptible (USVL10) bottle gourd accessions upon infection at 7, 14, and 21 DPI (days post infection). At the beginning of our study, we obtained RNA-Seq transcriptome data, which were subsequently cleaned through the removal of adaptors, low-quality sequences, virus contamination, and rRNA sequences. We then mapped each read to the reference bottle gourd genome, and differentially expressed genes (DEGs) were then identified at each time point for each accession based on the read counts obtained from alignments to the reference genome. Ultimately, we found more DEGs in the susceptible accession upon the PRSV infection than in the resistant accession, especially at 7 DPI. Furthermore, by analyzing the expression patterns of DEGs in the two genotypes and at different days post infection, we identified four clusters of genes with different expression characteristics. Gene Ontology (GO) term enrichment analyses were carried out on these clusters, and significantly overrepresented biological processes and molecular functions were identified. Our results indicated that while genes involved in DNA replication, protein refolding, and stress responses were upregulated in Clusters 1 and 2, genes related to photosynthesis and hormone production were downregulated in Clusters 3 and 4. Interestingly, two genes encoding Argonaute (AGO) proteins, which are essential to antiviral RNA silencing, were upregulated in Clusters 1 and 2, respectively. RNA silencing is one of the known mechanisms underlying plant resistance to virus, and further investigation into the genes involved in the antiviral RNA silencing pathway, such as those encoding RNA-dependent RNA polymerase (RDR), Dicer-like (DCL) proteins, and AGO proteins, showed that their expression levels all started out high in the USVL10 accession at 7 DPI. Overall, our study provides greater insight into the responses of bottle gourd accessions with different levels of resistance to PRSV, which could be applied to future crop improvement through breeding and genetic modification.

My Experience

At the beginning of the PGRP internship, I walked into BTI’s facility with the hope that I could successfully combine my interests in computer science and biology while also learning more about research, data analysis, and lab etiquette. Now, as my internship is drawing to a close, I can state – with a degree of certainty – that my experience at BTI has taken my understanding of all these areas to another level. Although I had conducted many research projects at school, my work at BTI exposed myself to a professional lab setting unlike any that I had worked in before. Challenges that I faced, such as adapting to other programming languages that I had no prior experience with, were admittedly daunting, but overcoming them was truly the biggest step in contributing to my growth as a member of the Fei Lab. All in all, working at BTI has given me a higher level of appreciation for conducting lab research, and in the future, I hope to integrate the skills that I have learned this summer in a manner that not only benefits myself, but also the community as a whole.

As a final note, I would like to thank my mentor, Shan Wu, for guiding me through the research process, as well as Dr. Fei for giving me such an incredible opportunity to work at BTI. In addition, I would also like to thank Mr. Dempsey, Mrs. Pawlowska, and Mrs. McDonald, for they are the individuals who continually motivate me to work hard and follow my dreams.

Intern Info

Year 2019

Faculty Advisor Zhangjun Fei

Mentor Shan Wu

Patrick Yuan

Genetic Characterization of Cucumber (Cucumis sativus) Using Whole Genome Re-sequencing

The cucumber (Cucumis sativus), member of the Cucurbitaceae botanical family, is a widely grown creeping vine plant that produces fruits which are commonly used as vegetables. Important traits in crops, such as yield, resistance to diseases and insects, ease of harvest, and nutritional value, depend on genetic variation. The cucumber germplasm in the National Plant Germplasm System (NPGS), which consists of approximately 1300 accessions, have been previously characterized using Genotyping by Sequencing (GBS) technology, which covers only a small portion of the genome. A core collection of 395 cucumbers was inferred from this cucumber germplasm, among which genomes of a total of 149 cultivated cucumbers have been resequenced to date. Analysis of the genome resequencing data, together with that from 5 wild cucumbers (C.s var. hardwickii) representing the outgroup, resulted in a variation map of more than 1.6 million high-quality single-nucleotide polymorphisms (SNPs) distributed across the cucumber genome at approximately 1 SNP / 153bp. Principal component analysis (PCA), population structure, and phylogenetic analyses using this variation map, as well as a heatmap of a Hamming distance matrix, supported three major clades of these cucumber accessions: one with origins in India/South Asia, a second with origins in East Asia, and a third with origins from Central/West Asia, Turkey, Europe, Africa, and North America. Additionally, the neighbor-joining phylogenetic tree identified cucumbers from India/South Asia as the closest relatives to the wild cucumbers, and the heatmap revealed a relatively high level of genetic diversity within the India/South Asia accessions; both of these findings are consistent with the current understanding of India as the center of origin for cultivated cucumbers. Population structure analysis further identifies North American and African groups as subclades of the third clade. The variation map produced in this project, combined with that from the rest of the core cucumber accessions, provides a valuable resource that can help identify variants significantly associated with important traits.

My Experience

My experience in the Fei lab this summer has allowed me to explore how computer science is applied in bioinformatics. With the help of my mentor and fellow BTI interns, I learned about the steps and tools that are used when collecting, processing, and analyzing genetic data, as well as various biological concepts such as genetic diversity and population structure. This internship has also improved my programming skills and helped me learn new languages, such as Perl and R. Overall, my experience in the Fei lab was my first in academic research and has helped me understand biological applications of computer science, as well as potential careers that I could pursue in the future.

Intern Info

Year 2019

Faculty Advisor Zhangjun Fei

Mentor Xin Wang

Christopher Neely

“Pan-genomic analysis of Solanum habrochaites, a wild tomato plant”

Project Summary:

Solanum habrochaites is a diploid, wild tomato species that grows on the slopes of the Andes Mountains. Its unique phenotype includes glandular trichomes on the fruit, and these trichomes have been shown to be related to sesquiterpenes and other chemicals that repel insects. Because of these and other specificities, it has been commonly used as an important source of novel genes for tomato breeding. Therefore, we are interested in better understanding the genomic differences between S. habrochaites and the cultivated tomato, S. lycopersicum. To accomplish this, we constructed a pan-genome based on recent sequencing data from seven available S. habrochaites accessions. For each accession, we de novo assembled quality-filtered reads, aligned the assembled contigs to the reference genome from S. lycopersicum, and then extracted unaligned sequences. A total of 354.4 Mbp of non-reference sequences were obtained and annotated, yielding 4,002 protein-coding genes, of which 3,736 were functionally annotated. Enrichment analysis showed that Gene Ontology (GO) terms related to protein binding, ligase activity, and DNA helicase activity were significantly overrepresented in the non-reference portion of the pan-genome. The presence/absence variation (PAV) analysis showed that the core genome is comprised of over 27,000 genes, and that most genes are shared by all the accessions, with few genes specific to 5 or fewer accessions. Further analysis of genes specific in S. habrochaites will facilitate interpretation of its specificities and provide instructive information for future breeding practices in tomatoes.

My Experience:

Working at the Boyce Thompson Institute in the Fei Bioinformatics Laboratory has given me a wonderful glimpse of the life of a researcher. Completing this project required a lot of studying and planning. I learned that the answer for how to do something often comes from the community of researchers seeking to complete similar tasks or to answer similar questions. I became a lot more comfortable with finding information on my own and with relying on my mentor for larger picture issues. I also learned the importance of effective communication with my mentor during the project. Working at BTI has strengthened my resolve to be a professional researcher, and I know that I want to continue my career in in the field of biology. Once I return to the West Coast, I plan to continue to expand my skill set in analyzing big data.

Intern Info

Year 2018

School Rio Hondo Community College

Faculty Advisor Zhangjun Fei

Mentor Lei Gao

Sophia Hu

“Identifying long non-coding RNA, misannotated and novel genes in the watermelon genome using PacBio Iso-Seq”

Project Summary:

Watermelon (Citrullus lanatus) is an economically important and widely cultivated vegetable crop in the cucurbit family, which also includes cucumber, pumpkin, squash and muskmelon. An improved watermelon genome would be an important resource for watermelon research and its close relatives. In this project, to improve the watermelon genome annotation and to identify long non-coding RNAs (lncRNAs), we generated large-scale transcriptome sequences using PacBio Iso-Seq technology from mixed watermelon tissues. Errors in the transcriptome sequences were corrected using Illumina RNA-Seq data and then full-length transcript isoforms were extracted. A total of 96.5% of the isoforms could be aligned to the watermelon reference genome.

Based on the alignments we identified a total of 1,326 lncRNAs in the watermelon genome, including 49 intronic, 845 intergenic and 432 antisense. We also found 350 novel genes that were previously not annotated in the reference genome, which could code for proteins such as a defensin-like protein and a Mads1 protein etc. We also identified 851 potential errors in the previous annotations, where genes annotated as separate in the reference genome, should be combined because multiple full-length reads spanned those genes. The improved gene predictions in the watermelon genome as well as the newly identified lncRNAs are valuable resources for research on watermelon and an overall better understanding of the cucurbit family.

My Experience:

My internship at BTI, has been a very valuable and memorable experience. Prior to BTI, I had taken both computer science and biology classes however I had never combined both for research. Through this experience, I have gained a better grasp of utilizing the command line, received exposure to a multitude of pipelines and software commonly used in the field of bioinformatics, and experienced a research project in its beginning, middle and end stages. My mentor, Xin Wang, was very supportive and guided me as well as challenged me throughout the project. After listening to BTI researchers about their work and its real world applications, my interest and curiosity to learn more about plants and bioinformatics has increased significantly.

Intern Info

Year 2017

School University of Maryland, Baltimore County

Faculty Advisor Zhangjun Fei

Peter Kohler

“Ethylene mediated epigenome changes in ripening climacteric melon”

Project Summary:

Ripening is a complex process that dramatically alters fruit color, flavor, firmness, and nutritional content. Improving our understanding of this process promotes the ability to optimize crop yields, disease resistance, and fruit quality. Ethylene is a natural plant hormone and is best known as a crucial regulatory component for climacteric fruit ripening. However, how ethylene controls ripening at the epigenome level is poorly understood.

To illustrate whether ethylene mediate gene expression through DNA methylation, a climacteric melon and ACC oxidase RNAi mutant with inhibited ethylene production were selected for the bisulfite sequencing (BS-Seq) analysis. By comparing the methylation level between wild-type and mutant melon, an overall demethylation patterns are observed in the mutant fruit. We identified 10,243 and 52,829 differentially methylated regions (DMRs) for CG and CHH cytosine context, representing 0.32% and 1.79% of the melon genome, respectively.

Furthermore 670 CG and 5,194 CHH DMRs intersect with 570 and 2,574 genes, respectively, of which 73% and 75% were differentially expressed based on a previous transcriptome study using the same samples. Hypo-methylation of promoters of ethylene signal transduction and biosynthesis pathway genes in the wild-type melon indicates the role of DNA methylation in the ethylene positive feedback loop. Hypo-methylated promoters and upregulated expression of some ripening-related genes in the wild-type melon, e.g. STAY-GREEN, alcohol acyl transferase, indicate ethylene regulated these ripening genes by DNA methylation.

My Experience:

Coming from a heavy computational background, this internship has been wonderful experience of learning for me, as it provided a project whose core activities catered to my strengths, while also requiring me to obtain an understanding of its biological significance. I have gotten to learn Perl, deepen my knowledge of GNU make and R, and practice analyzing the interrelated facets of biological systems, which require a different sort of consideration than the systems I am used to dealing with. I have had to learn to manage large data sets, and consequently also obtained a rare opportunity to practice performance-conscious programming.

What I appreciate most, however, is how being in BTI’s environment—listening to presentations and getting into conversations—has increased my grasp of the broader biological picture, made me aware of exciting developments and discoveries, and answered some of my long-held questions and confusion. I feel that I have been given a solid introduction to the concerns, interests, tools and mindsets of this field, which is exactly what I was hoping to obtain this summer.

Intern Info

Year 2017

School Liberty University

Faculty Advisor Zhangjun Fei

Michael Morikone

Identification of Long Non-Coding RNA and Alternative Splicing Events During Tomato Fruit Development

Project Summary

Long non-coding RNAs (lncRNAs) are a class of regulatory RNA that are longer than 200 base pairs and do not code for protein. Alternative splicing (AS) is a process in gene transcription where a multi-exon gene is spliced into two or more different mature transcripts. Solanum lycopersicum, a model plant for fleshy fruit development, was studied through the use of RNA-Seq analysis for lncRNAs and AS events in order to better understand the molecular mechanisms of fleshy fruit development. Five distinct stages of tomato fruit with three biological replicates were used to monitor the changes in cell expanison, maturation and ripening. My project was based around generating a comprehensive list of isoforms from the RNA-Seq data in order to identify lncRNAs, AS events, and new protein-coding genes. It was determined from this dataset that there were 565 lncRNAs with 162 of these having significant changes in expression during fruit development. It was also shown that 289 paired isoforms had significant differential expression between stages. Additionally, there were 214 new protein coding genes that were identified from the dataset. A new lncRNA was identified that is an example of a cis-natural antisense transcript, which showed an inverted expression pattern to a zeta-carotene isomerase gene, a key component of lycopene biosynthesis, suggesting a possible regulatory role of this lncRNA in carotenogenesis. Identification of these lncRNAs and AS events will provide valuable resources for future research on the molecular mechanisms of gene regulation in fruit development, leading to the betterment of fruiting crops.

My Experience

I am an undergraduate student at California State University San Bernardino where I am close to graduating with degrees in both bioinformatics and biology. The bioinformatics program at my home institution is a combination of discrete biology, computer science, and mathematics courses, so having this bioinformatics experience at BTI has been very valuable. While I have had other bioinformatics related research experiences, my time at BTI has proved to be the most helpful as a budding scientist. The relationship with my mentor was very informative and allowed me to learn many things that I have not had the opportunity to otherwise. Outside of my project, there were also seminars and lab meetings that were commonly held that provided a scientific forum which further fed my interests in bioinformatics. After this internship experience, I am much more confident in my pursuit of a Ph.D. in bioinformatics or computational biology.

Intern Info

Year 2016

School California State University, San Bernardino

Faculty Advisor Zhangjun Fei

Lisa Yoo

Identifying dynamically expressed genes during sweetpotato root development

Project Summary

The sweetpotato (Ipomoea batatas) is the seventh most important crop in the world. With its high nutritional quality and relatively low labor cost, it serves as a major food security crop for many sub-Saharan African nations. Though traditional breeding techniques have been implemented in order to boost the yield and quality of the sweetpotato, these have not been very effective. Therefore, it is necessary that we gain more genomic and genetic information about this crop to facilitate the development of next generation breeding tools. There are three types of sweetpotato roots: fibrous roots (FR), pencil roots, and storage roots (SR), of which the storage roots are mainly consumed. For this project, we generated transcriptome profiles of total roots at 10 and 20 DAT (days after transplant) and fibrous roots and storage roots at 30, 40, and 50 DAT, for Beauregard, one of the world’s most popular varieties of sweetpotato. The RNA-Seq reads were mapped to the reference genome of I. trifida. To examine differential gene expression, we made comparisons between samples of different developmental stages in SR, and between FR and SR samples. We further evaluated these results using gene ontology and enzyme pathway analysis. Our results allowed us to identify candidate genes in the carbohydrate metabolism pathway that are important in storage root development, which is potentially useful in context of increasing the yield and quality of the sweetpotato.

My Experience

My summer at BTI was such an incredible learning experience. Coming into the program, I had almost no background in plant biology, research, or computer science. However, as the internship is drawing to a close, I can now say that I have developed knowledge and skills in all of these areas. Everyday, I was faced with new obstacles, whether it be writing my research proposal, or figuring out how to run a script on the server, or presenting my research in front of my lab. Through dealing with these tasks, I learned responsibility and independence. Working in a real laboratory has given me a higher level of appreciation for the challenges and excitement of conducting research.

I’d like to thank my mentor, Shan Wu, for helping me through every step of my research process. I’d also like to thank Dr. Fei, the whole Fei lab, the other interns, and Tiffany Fleming for making my experience so valuable and memorable.

Intern Info

Year 2016

School Ithaca High School

Faculty Advisor Zhangjun Fei

Mentor Shan Wu

Haley Wight

De Novo Assembly and Annotation of the Glomus Versiforme Genome

Project Summary

Glomus versiforme is a species of arbuscular mychorrhizal (AM) fungi. This type of fungus forms a mutualistic relationship with most vascular plants: the fungus colonizes around the root of the plant to acquire carbon, and supplies the plant with phosphate in return. Phosphorus is being added to fertilizers because of its importance to plant growth which is expensive and inefficient. Genomic information would facilitate the exploration of the underlying mechanisms involved in this ecologically important symbiosis. Although Glomus is the largest genus of AM fungi, no member of the Glomusgenus has been sequenced. Our collaborators used Next Generation Sequencing (NGS) technology to gather a large collection of short sequencing reads. The focus of my project was to assemble this data into a continuous genome and annotate protein coding genes. To prepare this data for assembly the adapters and bar codes were removed, repeated reads were collapsed, and sequence errors were corrected. Through the analysis of the large scale short sequences, we estimated the genome size of G. versiforme to be around 311.6 Mb, making this the largest sequenced fungal genome to date. Sequence analysis also indicated the genome is highly homozygous and highly repetitive. The high-quality cleaned short reads was then assembled de novo using SOAPdenovo2 and the resulting assembly has a total size of 151.2 Mb and N50 of ~15 Kb. Once the genome was assembled, a set of conserved eukaryotic genes, ESTs, and protein evidence was aligned to train ab initio protein predictors. The results from these predictors were consolidated into a genome annotation of 9,546 genes using the MAKER pipeline. This genome assembly and annotation will provide a basis for future research on symbiosis mechanisms of G. versiforme.

My Experience

I am a Bioinformatics major at Ramapo College. However, most of the courses are focused on either computer science or biology independently. Being involved in this research allowed me to completely integrate my interest of both fields. This internship had a wealth of resources: I was able to work with Next-Generation Sequencing data for the first time on about 30 cores. The most valuable resource was the guidance from my mentor, who taught me new ways to troubleshoot problems, advanced techniques within the UNIX environment, and many other skills necessary to succeed in the research environment. BTI also held professional events outside the lab environment for interns which gave me a forum to learn about other research projects as well as seek guidance from others in my field. Overall, this internship has confirmed my interest in bioinformatics and has given me the confidence to pursue a Ph.D. in bioinformatics.

Intern Info

Year 2015

School Ramapo College of New Jersey

Faculty Advisor Zhangjun Fei

Rachel Blakely

Identification of copy number variations (CNVs) in wild and Eurasian cucumber populations

Project Summary

The Eurasian cucumber population has been domesticated in order to select for traits that improve growth and consumption. In doing so, the population has undergone a number of genetic changes. Discovering which genes vary between populations and which remain the same can help scientists understand the observable differences between the populations. It is valuable to investigate such differences in common fruits and vegetables because as a food source, it is necessary to ensure that they can be bred to be as nutritional as possible. In many cases, domestication has removed or altered genes with negative effects on crop traits and in the process may have also removed beneficial genes such as those that help with disease resistance or nutritional value. Discovering these inadvertently altered genes will allow breeders to be put back into the genome. In order to investigate the differences between the wild and Eurasian cucumber populations, the copy number variations (CNVs) were identified using cn.mops. The program distinguished 943 different regions in the cucumber genome, which contain CNVs. Among these regions was one containing the F locus, which is implicated as the region that causes gynoecious cucumbers. Many of the identified regions contain known genes, but other regions do not have any known genes. Discovering these genes in the CNV regions will provide useful knowledge for future genomics-based breeding that could improve important aspects of the plant.

My Experience

I have worked in two other Bioinformatics labs as an undergraduate, however, I have learned more from this experience at BTI than from either of the others. Working with a mentor for the first time, I had to adjust from working independently to collaborating with a group of researchers.. While unsure of what to expect from the mentor-mentee relationship, this guidance enabled me to understand the work I was doing better than I had in any previous experience. As a result, I gained confidence in my ability to successfully conduct research. As my first experience working in the field of plant biology, this internship was a great success. I have learned a lot about the dynamic of working in a research lab and about the impact and importance of plant biology research.

Intern Info

Year 2014

School Worcester Polytechnic Institute

Faculty Advisor Zhangjun Fei

Kevin Nguyen

Utilizing breakpoint detection algorithms to locate structural variations in wild and cultivated cucumbers

What makes each organism and species unique is their genetic makeups. The majority of these differences can be attributed to structural variations (SVs). Ranging in types from duplications to insertions and deletions, the SVs influence what genotypes are included and thus what phenotypes will be expressed. In an effort to document which genes give strength or weakness, it is important that these SV are identified and annotated. With this kind of information, evolution and population structures can be inferred as well as be utilized in marker-assisted breeding. In this project, the focus will be on detecting deletions, a type of SVs, in a cultivated cucumber, Cucumis sativus L, and a wild cucumber. Several years back, the genome of cucumber was successfully sequenced and annotated; recently the genome for the wild species was sequenced using next-generation methods. Using certain properties from the next-gen sequencing, the two genomes can be aligned to each other. In order to detect the deletions, previously published literatures were surveyed in order to find the appropriate algorithm. Pindel, developed by The Genome Institute at Washington University, was picked for detecting breakpoints in the alignment to location and measure the deletions. Later on, other algorithms were implemented to extract and filter key information, such as breakpoint location and deletion length, based on certain parameters. The result is a list of genes and phenotypes that were lost in the wild type.

My experience at BTI is the first one that was more focused on the computational aspect of research. While I do have quite a bit of background in programming, it was the first time I used perl and various pipelines such as Samtools and Pindel. It took some time to pick them up, but once I understood them I could see how useful and powerful they are in research. This was also my first time collaborating with a full lab. From what I was used to, it was just one on one interactions with my mentor; this time I had a whole team to ask questions and work with. This served as a reminder that research is not a solo act but a group effort. Having the opportunity to work in Dr. Fei’s lab has strengthened my desire to commit to real world research that combines biology and computer science.

Intern Info

Year 2013

School University of Maryland

Faculty Advisor Zhangjun Fei

David Selassie Opoku

Identification of Virus Genome sequences from RNA-seq data of a field-grown tomato plant

Viral diseases in crops have a detrimental agricultural and economical impact globally, especially in developing countries. However, efforts to mitigate the impact of crop viruses are hampered by the lack of low-cost and efficient tools that can geographically detect and characterize crop viruses. With the recent advent of next generation sequencing technologies, novel methods could be developed to efficiently identify plant virus genomes by employing these technologies. In this project, we propose a novel method of deep sequencing plant transcriptomes (RNA-seq) to detect virus genomes. By de novo assembly of an RNA-seq dataset generated from fruits of an Ithaca field-grown tomato plant (cultivar M82), we were able to identify three virus genomes, although the plant did not show any visible disease symptoms: potato virus Y, southern tomato virus and tomato mosaic virus. The identified potato virus Y and southern tomato virus are same as previous reported genomes (GenBank Acc#: X12456 and EF442780, respectively), while the tomato mosaic virus is a new isolate, which shares 86% nucleotide sequence identity to the previous reported genome (GenBank Acc#: AF332868). With this approach, it will be highly efficient to geographically identify and characterize virus genome for major food crops; a key step towards the overall goal of reducing crop loss due to viral diseases.

My Experience

The opportunity to work as a bioinformatics intern in the Fei lab at the Boyce Thompson Institute gave me the opportunity to finally combine my knowledge from biology and computer science in real world research, an opportunity not available at my liberal arts college. I enriched my skills in programming while learning new biological concepts and solidifying old ones through work with next generation sequencing tools. The most exciting part of this summer experience, was the chance to work on the tomato virus genome project, a precursor to a bigger project towards identification of virus genome in Pan-African sweet potato that the Fei lab will be working on. This opportunity did not only strengthened my desire to study bioinformatics or computational biology at the graduate level but as a student from Ghana, also the possibility to focus on plant and agricultural research in developing regions such as sub-Saharan Africa.

Intern Info

Year 2011

Faculty Advisor Zhangjun Fei

Learn More

Zhangjun Fei

How can large-scale plant genomics datasets be efficiently integrated to advance biological discovery and crop improvement?

Research Overview

Honghe Sun

Xuebo Zhao

Xuanbo Zhang

Super-powered population genomics: Watermelon super-pangenome paves the way for precision breeding

Breeding a better cucumber: new genetic map reveals 171,892 structural variants

BTI, Meiogenix, and FFAR Announce $2 Million Breakthrough Tomato Genetics Collaboration

Decoding Sweetpotato DNA: New Research Reveals Surprising Ancestry

Study Reveals Role of Allele Dosage in Improving Sweetpotato Traits

Study Finds Genetic Mechanisms Behind High-Yield Apple Trees

Internships

Aaron Alexander

Generating a Phased Genome Assembly of the Hexaploid Sweetpotato Cultivar, ‘New Kawogo’

Intern Info

Grace Coppinger

Identification of structural variants affecting fruit quality traits between wild and cultivated watermelons

Intern Info

Mukund Gaur

Identification of Differentially Expressed Genes in F1 from the Cross between S. lycoperiscum M82 and S. pennellii LA0716

Intern Info

Benjamin Beer

Project Summary:

My Experience:

Intern Info

Adam Cason

“SpinachBasev2: An updated central portal for Spinach genomics tools and information”

Project Summary:

My Experience:

Intern Info

Maddie Shaklee

“Finding Genome Regions Associated with Pepper Fruit Shape Using Genome-Wide Association Study”

Project Summary:

My Experience:

Intern Info

Charis Qi

“Construction of a graph-based watermelon pan-genome and investigation of genetic variation in cultivated watermelon and its wild relatives”

Project Summary:

My Experience:

Intern Info

Stefanos Stravoravdis

Identification of structural variations between genomes of cultivated tomato Solanum lycopersicum and its wild progenitor S. pimpinellifolium

My Experience

Intern Info

Charles Wang

Identifying Bottle Gourd Genes Responsive to the Infection of Papaya Ringspot Virus

My Experience

Intern Info

Patrick Yuan

Genetic Characterization of Cucumber (Cucumis sativus) Using Whole Genome Re-sequencing

My Experience

Intern Info

Keeley Collins

“SpinachBase: A new database for spinach research and development”

Project Summary:

My Experience:

Intern Info

Christopher Neely

“Pan-genomic analysis of Solanum habrochaites, a wild tomato plant”

Project Summary:

My Experience:

Intern Info

Sophia Hu

“Identifying long non-coding RNA, misannotated and novel genes in the watermelon genome using PacBio Iso-Seq”

Project Summary:

My Experience:

Intern Info

Peter Kohler

“Ethylene mediated epigenome changes in ripening climacteric melon”

Project Summary:

My Experience:

Intern Info

Michael Morikone

Identification of Long Non-Coding RNA and Alternative Splicing Events During Tomato Fruit Development

Project Summary

My Experience

Intern Info

Angela Taylor

Exploring Tomato Virus Diversity in China using Deep Small RNA Sequencing