Lukas Mueller
Professor
Developing databases and software tools that organize plant genetics data, helping scientists and breeders accelerate crop improvement—especially for staple crops in food-insecure regions.
How can genomics contribute to improved crop breeding?
Adjunct Professor
Section of Plant Breeding and Genetics
School of Integrative Plant Science
Cornell University
The evolutionary dynamics of genetic mutational load throughout tomato domestication history
Razifard, H., Visa, S., Menda, N., Mueller, L., Tieman, D., van der Knaap, E. and Caicedo, A.L.
Sustaining public plant breeding programs across generations
Hale, I., Koebernick, J., Hershberger, J., Rife, T., Arbelaez, J.-D., Anderson, N., Bekkerman, A., Bohn, M., Bourland, F., Burke, T., Chee, P., Evans, K., Fumia, N., Feldmann, M., Gasic, K., Hague, S., Heilman-Morales, A. M., Kemp, A. H., Iglesias, C., Mueller, L., … Kantar, M.
BrAPI v2-A unified framework for data integration and collaboration for breeding and genetic resources
P Selby, R Abbeloos, AF Adam-Blondon, FJ Agosto-Pérez, M Alaux, I Alic, ...
Research Overview
In recent years, technological advances in fields such as sequencing have transformed certain aspects of biology into an information-based discipline. To make this abundance of data—often called Big Data—useful to researchers and breeders, it needs to be organized and made accessible. Towards this goal, the Mueller lab designs and implements databases that assist scientists in their research and plant breeders in more efficient crop improvement.
Our databases and software make transcriptomic, genotypic and phenotypic data from thousands of experiments accessible to the public, often focusing on under-researched staple crops from food-insecure regions. A method called Genomic Selection that uses high-throughput genotyping technologies, such as genotyping-by-sequencing (GBS), and large phenotyping data sets allows for rapid prediction of desirable traits in new plant crosses.
Based on these tools, the Mueller laboratory collaborates on a variety of different projects. With the Nextgen Cassava project, we have created Cassavabase, a database specifically designed for cassava breeders in Africa. We coordinate the Solanaceae Genomics Network—a compilation of all the genetic information known about solanaceous plants, such as tomato, petunia, and Nicotiana. We are also developing breeding databases for yam, sweet potato, and the cooking banana. Finally, the Mueller group is involved in multiple genome sequencing projects, including tomato, coffee, petunia, and Nicotania benthamiana.
BreedBase
Breedbase is a comprehensive breeding management and analysis software. It can be used to design field layouts, collect phenotypic and other information using Phenoapps software, support the collection of genotyping samples in a field, store large amounts of high density genotypic information, and provide Genomic Selection related analyses and predictions.
Cassavabase
Access to data and tools for breeders and researchers, including genomic selection algorithms and analysis capacity, a cassava genome browser, cassava ontology tools, phenotyping tools, and social networking.
Citrus Greening Solutions
A systems-based pipeline approach for delivering commercial, grove-deployable solutions using a novel therapeutic delivery strategy and citrus transgenics.
Musabase
A breeding database designed for advanced breeding methods in banana breeding.
Rtbbase
A collection of Root Tubers & Banana Databases, which hold genomic and phenotypic information for next generation breeding applications.
Sol Genomics Network
A site for genome data of Solanaceae species such as tomato, potato, and pepper, and related to the tomato genome sequencing project.
Sweetpotatobase
Part of the Genomics Tools for Sweet Potato (GT4SP) Improvement Project focused on developing a set of “next generation” breeder tools for African sweetpotato breeders in Africa.
Yambase
A database about breeding data for Yam (genus Dioscorea). Yam species that are being used for breeding include , Dioscorea rotundata, Dioscorea cayenensis (both are native to Africa and the major cultivated species), Dioscorea aleata (native to Southeast Asia), and Dioscorea praehensilis, as well as several other species.
Nicotiana benthamiana
A project to improve the assembly of the Nicotiana benthamiana genome sequence and link the sequence to the nineteen chromosomes. We are also working to improve the gene annotation.
Lab Members
Chris Costa Simoes
Senior Research Associate
Srikanth Karaikal
Benjamin Maza
Bioinformatics Analyst
Naama Menda
Postdoctoral Scientist
Christine Nyaga
Ryan Preble
Bioinformatics Analyst
Titima Tantikanjana
Senior Research Associate
In the News
Breedbase software to help speed crop improvement
To help plant breeders speed crop improvement around the world, Lukas Mueller of the Boyce Thompson Institute worked with an international team of 57 people to create Breedbase, a database software that was described...
Wild tomato genome will benefit domesticated cousins
Wild relatives of crops are becoming increasingly valuable to plant researchers and breeders. During the process of domestication, crops tend to lose many genes, but wild relatives often retain genes...
Tomato’s Wild Ancestor Is a Genomic Reservoir for Plant Breeders
Thousands of years ago, people in the region now known as South America began domesticating Solanum pimpinellifolium, a weedy plant with small, intensely flavored fruit. Over time, the plant evolved into S....
ITHACA, N.Y. – For the first time, the genome of the tomato, Solanum lycopersicum, has been decoded. It becomes an important step toward improving yield, nutrition, disease resistance, taste and color...
Internships
BTI offers a summer research experience program for undergraduate and high school students.
Intern Projects in the Mueller Lab
Interns in the Mueller Lab work on a variety of bioinformatics and genomics projects and gain experience in the following areas: genome assembly, structural and functional annotation, biochemical pathways, comparative genomics, ontology development, and data presentation and visualization.
Previous Interns
Ana Sofia Castellanos Mosquera
CRISPGET: The CRISPR Guide Evaluator Tool to support guide selection based on off-target scoring systems
The CRISPR/Cas nuclease systems are game changers within the different genome editing tools. They were discovered as a part of the bacterial immune system and since then multiple meaningful applications have been developed. In crop improvement, for example, it has been applied to cereal crops to increase the number of rice grains and add herbicide and plant disease resistance. During the design of these CRISPR/Cas systems, scientists should consider the specificity of the guide chosen for the Cas endonuclease because it may create off-targets, which are cuts done by Cas outside the desired target gene. Such off-targets can create confounders for researchers and therefore should be avoided.
This summer internship was a very significant experience in my life. I gained skills in developing web applications and increased my confidence to build software and face challenges in bioinformatics. In the beginning, I thought this would be just an academic experience. However, after 10 weeks, I realized that I acquired skills both in software development and grew a lot as a human being. This was an opportunity to know me better and to strengthen my resilience and assertive communication. Also, I met different people who made me realize that science should be done in a collaborative and interdisciplinary way. I am so grateful for the opportunity to be part of the Mueller Lab and with my mentor Adrian Powell who was there to support me and gave me the tools to successfully develop the project and learn more about bioinformatics.
Intern Info
Nistar Steinerman
Auditing Database Tables in Breedbase
In an era when genetic sequencing and phenotyping are quickly producing vast amounts of data, database systems such as Breedbase are becoming increasingly necessary in genomics research. Websites that use the Breedbase ecosystem support a range of tasks related to crop breeding, which enact operations on the database in the form of insertions, updates, and deletions of data. Breedbase users requested that a record of these changes be made available on the website as a reference, for archival purposes and correction of errors. This feature was built by running a PostgreSQL database patch, which creates audit tables and inserts audit data every time a change is made to the database. These data, including the timestamp, type of operation, username of the account that committed the change, and state before and after the change, are displayed in dedicated audit pages and sections on the website. These webpages were designed using HTML and JavaScript/jQuery, and database connections are handled via a controller written in Perl. With the advent of audit tables in Breedbase, users will be able to handle their data more efficiently and securely. This is impactful as Breedbase programs mainly focus on staple crops such as cassava, which are integral to food security and economic prosperity.
The BTI REU program has been such a wonderful, educational, and memorable experience. Coming from a university where opportunities for plant science research are limited, it has been very exciting to join a bustling, vibrant plant research community. I am still honing my academic focus, and working in the Mueller Lab, a bioinformatics research group, has been helpful with learning about how computer science and biology intersect. It was also fascinating to observe how an interdisciplinary lab operates and collaborates with other labs at BTI and with partner institutions. Other aspects of the REU program, such as the weekly seminars and science communication workshops, have been very impactful for me in the way that I think about research. I think this research experience has equipped me with skills and perspectives that will be invaluable as I continue on my academic journey. I am so grateful to my mentors, Adrian Powell and Lukas Mueller, as well as Megan Truesdail and Delanie Sickler, for their efforts in coordinating this internship.
Intern Info
Ariel Maroney
My Experience:
Spending my summer at the Boyce Thompson Institute taught me so much about agricultural biotechnology. I was assigned to the Mueller lab under Alex Ogbonna, from whom I learned about bioinformatics and the steps of processing big data in the field of biology. Specifically, we considered and began charting the historical origins and spread of the cassava plant through Brazil. I also spent time in other labs, in order to get a rounded experience. One of my favorites was harvesting tomato seeds in the Giovannoni lab, where I participated in processes like measuring brix values and performing gel electrophoresis. Another favorite was dissecting parasite-infected bees in the Dyce/McArt lab, in order to harvest the parasites to further ongoing studies of how to protect pollinators. It was truly incredible to get hands-on experiences in all these different labs, as it gave me a broad feel for what a future in the field would be like. My biggest takeaway from my internship is how much work goes into even the smallest scientific advancement. It takes a lot of people and resources to continue progress towards any new discovery, and every bit of research is dependent on generations of people who have dedicated their lives to further understanding and modifying our world. As I go on into college, the valuable experiences I had at BTI will certainly inform and advance my work as I pursue a biology-related degree.
Intern Info
Arianna Kazemi
The Sweet Potato Expression Atlas
The domesticated sweet potato (Ipomoea batatas) is a staple food, particularly in sub-Saharan Africa. It has many nutritional benefits, including high carbohydrate, fiber, and vitamin content. Ensuring a hearty and healthy sweet potato crop is therefore crucial to maintaining this important food source. Current threats to sweet potato yield include drought, certain fungal diseases including root rot, and sweet potato weevils, incredibly damaging pests. Creating a crop that is both resistant to these threats and has greater starch and protein yields could be possible by turning to the 14 crop wild relatives of sweet potato, each of which has unique traits that could improve sweet potato growth. The genomes of two of these relatives, I. trifida and I. triloba, have recently been sequenced. Using these genomes and existing transcriptomic data collected from these relatives under a variety of stress conditions, genes and networks of interest can be identified for further study.
Such study is made more accessible through the creation of a sweet potato gene expression atlas, a web-based tool that allows users to view levels of gene expression in these two species through a variety of graphical tools. With the atlas, the location and conditions under which genes are differentially expressed can be visualized while coregulated genes can be easily identified, suggesting potential relations and pathways. This tool was created with HTML and JavaScript making up the webpage and a Perl controller accessing a PostgreSQL database. The gene expression atlas will make it easier to develop hypotheses about gene function and differences between these two species of Ipomoea, information which could lead to discoveries that can increase I. batatas yields and maintain this staple crop’s productivity under conditions of biotic and abiotic stress.
My Experience
Through my work in the Mueller Lab this summer, I gained a great appreciation for open and accessible science. Being in an environment where our work aimed to make research available to all, which could, in turn, improve not only crop production but also benefit lives gave this project a clear purpose. Seeing the collaborative nature of the lab was also motivating, especially since the Mueller Lab is a non-traditional computational group, and demonstrated to me how bioinformatics requires just as much communication as traditional research. This sense of working for a common goal will stay with me as I continue my career as a scientist.
At BTI, I learned new computing languages and became very familiar with the troubleshooting process. With the help of my mentor, I feel more confident in my abilities as a computational biologist and have strengthened my desire to work on similar projects that combine technology with biology. Through my experiences both in the lab and with my fellow students, I feel more assured about continuing with my educational journey in graduate school and am grateful to everyone who made this program possible.
Intern Info
Tosin Olayinka
Genomic Analysis of NLR and PR Genes in Coffea Arabica and Its Ancestral Parents
Coffee is one of the world’s most widely consumed beverages and an economically important crop for both its growers and its processors. Commercially grown coffee is usually one of two species – Coffea arabica or Coffea canephora. These two species differ in terms of flavor, favored climates, and, most importantly, disease resistance. C. arabica is the more desired of the two species due to its sweeter, less bitter taste and its more delicate flavor. However, C. canephora is easier to take care of and has superior general disease resistance.
C. arabica is an allotetraploid – a species resulting from a hybridization event between two different diploid parental species – and is the result of an ancient cross between Coffea eugenioides and the aforementioned C. canephora. This hybridization is interesting because both the allotetraploid and one of the diploid parents are widely cultivated and, of the two, it is the diploid which has superior disease resistance. This is unusual because, in general, polyploid plants tend to be more robust than their diploid ancestors as a result of increased diversity in key gene families. Using tools from comparative genomics, we can determine the potential mechanisms behind this discrepancy.
In order to do this, we looked at two gene families heavily involved in plant immunity to pests and disease – nucleotide-binding leucine rich repeat (NLR) and pathogen-related (PR) genes. Using high quality functional gene annotations of the three species as a base, NLR genes were identified using specific DNA motifs as signals and PR genes were identified using their associated InterPro IDs. In addition to this, genes in C. arabica were characterized into a subgenome that was either ancestral to C. eugenioides and C. canephora. With this information, comparisons of C. arabica subgenomes with their respective ancestral species on a genomewide and gene family scale were made; making comparisons on gene content, orthology, and synteny. With this were able to show that the C. arabica genome contains numerically less NLR genes with less diversity than its parents.
Finally, expression data from two varieties of C. arabica were compared. One from a variety which is resistant to coffee berry disease (Catimor) and the other from one which is not (Cattura). These were compared across three different timescales and conditions (infected and non-infected) and genes which were deferentially expressed between susceptible and resistant varieties were identified. In addition, genes were clustered based on similar changes in expression over time and graphed together using the TCSeq Bioconductor package. For both the time series and differential expression analyses, gene ontology enrichment was performed using the topGO package in R and used to better interpret the data.
My Experience
This summer has been an interesting experience both inside and outside of the lab. While I do have experience working with next-generation sequencing (NGS) data, I have not studied an organism with the same breadth and depth that I did this summer with the three Coffea species. Linking different experiments and statistics to a certain hypothesis was both the most frustrating and the most satisfying part of my time here. The confusion coming from an unexpected figure and the satisfaction coming from producing elegant results both helped me grow and become a better scientist.
I was also lucky enough to be surrounded by a cohort of students, professors, and researchers who were passionate about a wide range of different topics – both inside and outside the realm of science. People who, when asked, were eager to help me and others out. Hopefully, I am able to do the same for others in the near and far future.
Intern Info
Thomas Chan
“Marker & Pedigree Hybrid Visualization Tool”
Project Summary:
Cassava (Manihot esculenta) is a staple crop and key source of carbohydrates in many tropical regions of the world. Inbreeding depression has occurred in most cassava populations, resulting in mutations that have reduced yield. In turn, its deleterious mutations pose a threat to food security in developing regions. The creation of a haplotype-pedigree hybrid visualization too will greatly aid cassava breeders to perform marker assisted selection and eliminate deleterious alleles.
The tool was created using JavaScript and HTML for the web page, D3 for the figure, and Perl and PostgreSQL to access the database. The first phase of the project included designing the website and writing functions to collect user input. Then second phase focus on the back end, such as using user input to retrieve the genetic marker information from the database and editing the pedigree tool to support marker visualization. When using the tool, users must first specify list of accessions (or populations) and a genotyping protocol to yield a list of the shared genetic markers. Upon selecting up to seven markers from the list, users can then display the available pedigree of the accessions. These are displayed as parents and children with color-coded markers and dosage values displayed to the right of the respective node.
The visualizer will help breeders follow the flow of alleles through accession crosses and help inform future breeding decisions. Future additions include genetic linkage data of the chosen markers and display of allele pairs with the marker name. The tool can be employed with other plants to eliminate deleterious alleles or perform crop marker assisted breeding.
My Experience:
The Plant Genome Research Program Internship allowed me to explore my interest in both biology and computer science in a way that can create meaningful change. During my time at the Boyce Thompson Institute, I worked on both the user interface and back end of the application; it was especially rewarding to see the completion of the tool from beginning to end. Furthermore, I was introduced to David Lyon and Guillaume Bauchet, my two mentors who provided wisdom and guidance throughout the development process and proved to me that three minds are better than one. This summer research experience has opened my eyes to the possibilities of biological and computational applications to the real world and has solidified my devotion to use technology for the advancement of human health and well-being. Needless to say, I am forever thankful to the opportunities I’ve been afforded by the program.
Intern Info
Nicolas Dufour
“Improving Genotypic Data Storage Using The Hadoop Cluster”
Project Summary:
Genotypic data is the perfect example of a data type that can grow so large that it becomes difficult to store and query in an efficient and reliable way. Genotypic data is becoming increasingly important for the identification and analysis of genes in research, breeding and medicine, so the optimization of the infrastructure to perform such analysis has become crucial for these endeavors. In this work, we used a distributed database technology called Hadoop that can scale with data and computation growth. Hadoop is used in the industry to facilitate access to “Big data” and has also be been proposed as an efficient genotyping storage solution. In addition, the Hadoop File System (HDFS) offers built-in redundancy for increased reliability compared to a traditional file system. In our implementation, we choose the parquet file format to store genotypic data, which has the benefit of fast columnar access and better compression. The exact file structure of the parquet file had a strong effect on performance, for example, it was possible to obtain faster queries depending on the transpose of the storage matrix. Spark and the SparkSQL module were used to query and provide some statistics on our data. To make the data accessible to the outside world an Application Programming Interface (API) was developed.
My Experience:
These past two months have been very interesting. I have learned a lot about the research world and it makes me to want to pursue a research career. BTI gave us the opportunity to work on very interesting research projects. I enjoyed the freedom and it was the first time I have worked on a single project for that much time. I learned a lot by trying to figure out how to make my project work. I’m glad to have learned technologies like Hadoop or Spark that are cutting edge in today’s computer science. I would like to thank Lukas Mueller and Nick Morales for all the help they provided. This was also an interesting personal experience because I got to live for two months in the US which open my eyes on a lot of cultural differences.
Intern Info
Alice Hu
“Analysis of Iochroma cyaneum Gene Families”
Project Summary:
Gene duplication and loss is an important force driving the evolution of species. The resulting gene families are an essential aspect to understanding functional differences between species. We looked at the Iochroma cyaneum genome, which had not yet been well characterized, and compared it to other plant species (the genomes of Arabidopsis thaliana , cultivated tomato, coffee, a wild relative of tomato , potato, eggplant , pepper , Petunia axillaris, morning glory, grape, and rice) to clarify its position on the phylogenetic tree and identify gene families that seem to be rapidly evolving. With the assistance of multiple computer programs, we were able to incorporate the genes from the listed species into gene family analysis. I. cyaneum is an important species, since it has been used to study the evolution of floral morphology and pigment in addition to containing various phytochemicals having potential pharmaceutical uses. Using BUSCO (Benchmarking Universal Single-Copy Orthologs), we assessed the completeness of the I. cyaneum genome and annotation, and displayed the results in a genome and a protein barchart. The results of the OrthoFinder program were compared to previous results from an older program, OrthoMCL, to detect contractions and expansions of Iochroma gene families and assess differences in gene family detection between the two software packages. We ran KinFin on the Orthofinder results to associate representative functions to orthogroups, and RAxML to generate the phylogenetic tree. The tree showed that Iochroma was most closely related to pepper.
My Experience:
My experience at BTI this summer has been exciting and informative. Before this internship, I had never worked in a research environment or a professional setting. Throughout these past six weeks, I’ve gained a new interest in bioinformatics and computer science and learned to use the resources I have to my advantage. I would like to thank my mentor, Suzy Strickler, for guiding me through my research and exposing me to the world of scientific research. I am very thankful for this incredible opportunity to participate in this internship.
Intern Info
Matthew Larson
“Developing a Data Comparison Tool for Musabase and MGIS Using the BrAPI Interface”
Project Summary:
The Mueller Lab has developed a breeding database for bananas, known as Musabase for the purpose of helping breeders to create new varieties of crops for farmers. To expand their data and knowledge to farmers and breeders worldwide, the Mueller Lab is interested in establishing a data connection and comparator between their database, Musabase and another large banana germplasm database known as Musa Germplasm Information system (or MGIS). MGIS is the most extensive source of banana genetic resources and is actively used by breeders from the southern hemisphere to perform crosses and improve current varieties for smallholder farmers. For this project, I will be assisting in the development of a data comparison tool for Musabase and MGIS Databases using the BrAPI interface. Developing an existing communication between Musabase and MGIS databases would help breeders to gain access information from both websites from a unique interface and be able to compare information between MGIS and Musabase to assess data information quality. My plan is that using the current BrAPI interface, I will create a user-friendly interface for breeders to select and compare imported information from both Musabase and MGIS. In order to accomplish this, I will make a command line for breeders to request imported and retrieve on both sides. Once data is retrieved, I will need to visualize the data for comparison on the user interface of Musabase. The user interface will also be given the option to store information onto its database.
My Experience:
As a high school intern at Boyce Thompson Institute, I am deeply grateful that I was able to partake in this opportunity to enrich myself in plant biology and research. Immersing myself in a research environment enabled me to apply information rather than just absorbing information. My time at BTI also broaden my knowledge of using computational tools to understand and analyze data. This internship was monumental to my decision to pursue the field of biology in college. I would like to thank my mentors, Guillaume Bauchet and Nicolas Morales, for this enriching experience.
Intern Info
Celine Manigbas
“Analysis of the Asclepias syriaca Genome and Gene Families”
Project Summary:
Asclepias syriaca, known as the common milkweed, is found throughout northeastern and southeastern parts of the United States. Cardenolides are a subclass of cardiac glycosides found in Asclepias, and they contain steroidal toxins poisonous to insects and animals when consumed. The larvae of monarch butterfly, however, utilize Asclepias as their main food source and protection. There is a lack of high quality genomic information regarding A. syriaca to explore the cardenolide biosynthetic pathway and to comparatively analyze against other species in the Apocynaceae family that do not produce cardiac glycosides. The genome of A. syriaca was recently sequenced by the Jander lab using PacBio with >300x coverage, generating longer reads than in the published genome of A. syriaca, and assembled using Falcon Assembler. This assembly could provide more genomic information for annotation and gene prediction, and this could contribute more information for further genomic research concerning milkweeds, its evolution, and similar plants.
The assembly was error corrected using Arrow and was repeat-masked. RNA-seq data mapped to the genome and error corrected using Mikado and Portcullis were used to train the ab inito gene predictors Snap and Augustus. These predictors were used in the MAKER pipeline along with RNA and protein evidence to synthesize the data into structural gene annotations. Blast2Go was used for functional annotation. The gene families of published Asclepias syriaca genome, Catharanthus roseus, Rhazya stricta, Coffea canephora, Theobroma cacao, and Solanum lycopersicum were then identified using Orthofinder. KinFin was used to associate functions to the orthogroups. Gene family expansion was identified using CAFE.
My Experience:
This summer challenged me and taught me a lot about the field of bioinformatics research. This REU was my first exposure to working with Big Data of plant genomes. Prior to this internship, I had very limited experience in bioinformatics. Along the way, I learned a lot about the ever-changing and advancing field through the use of different programs, and it was exciting to work with the cutting edge programs towards research that is relevant to the real world. Before coming here, I was not so clear on the future path I wanted to take and if computational research was the route for me. But after this experience, I realized that I am truly interested in going to the field of bioinformatics.
Intern Info
Asha Duhan
“Identification of a genomic region in lycopersicoides associated with resistance to Pseudomonas syringe pv. tomato”
Project Summary:
Tomato (Solanum lycopersicum) is an economically important crop and a nutritious staple used to enrich diets. Speck is a disease caused by Pseudomonas syringaepv. Tomato, and specifically targets tomatoes, resulting in dark spots on the tomato fruit and leaves. Speck can be lethal to tomatoes and results in a drastic decrease in fruit yield and marketability. An abundance of natural variation, including resistance to disease, exists in wild relatives of tomato and many species in the tomato clade can be crossed with cultivated tomato.
In this project, S. lycopersicoides LA2951, a speck-resistant accession, was analyzed to determine differences in the genome to that of tomato, such as structural differences that may
The future goals of this project are to narrow down the large resistance locus to a few genes that can later be isolated and bred into the tomato line. This research could lead to a greater understanding of plant-pathogen interactions and serve as a model for plant resistance.
My Experience:
I really enjoyed my internship at BTI. Coming into the program, I had minimal knowledge in the field of bioinformatics. As the program progressed, I have gained invaluable experience and exposure in both the field of bioinformatics and in plant laboratory research. I was fortunate to be able to experience both the field of bioinformatics and laboratory research, and both aspects of my project were extremely helpful and further enriched my knowledge on how research is conducted. I would like to thank my mentors, Suzy Strickler, Adrian Powell and Sammy Mainiero on helping me with my research project and also introducing me to the world of scientific research. This internship has furthered my interest and curiosity in discovery driven science, and I am now considering pursuing biology or a related field in college.
Intern Info
Alexander Ivanov
“BrAPI connecting the world’s plant breeding data through a mobile application”
Project Summary:
Breeding technology is advancing very rapidly, and it is important to stay connected between databases to share information faster. It is also important that researchers stay connected so that information can be revised and improved. My application does just that.
My application is written in Android Studio to allow the user to search through different databases at brapi.org. First, I need to make sure that the user has internet connection, if he doesn’t, then he would be notified immediately. Next I added intents, abstract descriptions of an operation to be performed, to transition from one page to the next whenever the user chooses a database or category before viewing the data text file. If the user does not see the information he needs, he can always change the page size by clicking on text box below to get more results, and specify the information he needs on the specifications page.
After completion, I learned how to build a simple android application, about the connection between two activities, how to check for connection or connect to the internet, and how to design stylish pages. My mentor and I created an application that allows any user who is logged in to brapi.org to connect and query much faster to the available plant breeding databases than using standard search methods. In the near future, I am planning on releasing the beta version of this free application on the Google Play store for all users.
My Experience:
I gained many experiences from this internship that I realize are necessary for me to start a computer programming career. This internship helped me understand that not everything can be solved right away, and often requires to communicate with others to discuss problems that need to be resolved immediately.
Before entering this program, I never fully understood the purpose of programming, and often failed to seek help from others. Now that I have created an app that has practical applications, I understand that programming requires dedication, and there is a plethora of unknown surprises when creating an app, so asking for help is just a matter of time. I want to thank my mentor, Nicolas Morales, for always helping me find ways of solving complex problems with simple codes, and helping me understand trying to resolve everything on my own is not always the best option.
Intern Info
Anna Yaschenko
“Characterizing long non-coding RNA in the Asian Citrus Psyllid through genome annotation and molecular biology”
Project Summary:
Huanglongbing (HLB), or citrus greening disease, is jeopardizing the citrus industry around the world. HLB is an incurable disease of citrus that results in yield loss, decline of tree health and ultimately death of the plant. It is associated with the pathogen Candidatus Liberibacter asiaticus (CLas), which is spread by Diaphorina citri, or the Asian citrus psyllid (ACP), when psyllids harboring CLas feed on healthy citrus. It is important to note that in order to be effective vectors of CLas, psyllids must acquire CLas as nymphs, the juvenile life stage of the psyllid. My project focuses on interactions between CLas and ACP nymphs, specifically investigating the role of long noncoding RNA (lncRNA). Recent work by the Heck lab identified 83 differentially expressed lncRNAs in the gut of adult ACPs, when exposed to CLas as nymphs, through multi-omics. By identifying and characterizing these lncRNA, we can better understand the relationship between the vector ACP and the pathogen CLas. I identified lncRNA in the ACP genome using multiple bioinformatics pipelines and validated the presence of lncRNA in nymphs with reverse transcription polymerase chain reaction (RT-PCR) and cloning methods. Specific lncRNA found in the adult ACP gut were confirmed to be present in nymphs. Interestingly, expression seemed to vary based on host plant. The end goal of this research is to identify the role of lncRNA in CLas transmission, which will hopefully lead to the development of methods that slow or even stop the spread of HLB.
My Experience:
My experience at the Boyce Thompson Institute has been phenomenal. From patient, understanding mentors to a friendly work environment, BTI has allowed me to grow as a scientist and gain skills I had never been exposed to before. Though this summer was challenging, I was able to overcome the difficulties I faced in my research. It would not have been possible without my wonderful mentors, Angela Kruse, Surya Saha, and Prashant Hosmani, and the endless support provided by both the Mueller and Heck Labs. Without them, I would have never considered plant genetics as a possible career track. The lab and technical skills I learned at BTI will continue to aid me as a researcher as I embark upon the adventure of my career. I encourage anyone who is considering a career in research to apply to this program and immerse themselves in the inspiring community that is BTI.
Intern Info
Kyndra Zacherl
“Pedigree Verification in Cassava through Analysis of Single Nucleotide Polymorphisms”
Project Summary:
Manihot esculenta, more commonly known as cassava, is a tropical root crop that serves as the primary food source for 500 million people around the world, particularly in Sub-Saharan Africa. Efforts to improve cassava breeding are tracked using CassavaBase (cassavabase.org), a public database developed by the Mueller Lab as part of the NEXTGEN Cassava Breeding project. CassavaBase allows breeders to store their data in a free and open format. This data includes the pedigrees of thousands of breeding lines. In this project, a pedigree verification tool based on genetic similarity was developed in Perl and implemented in CassavaBase. This tool examines genetic data, analyzes a select set of single nucleotide polymorphisms (SNPs) from the genotypes of the parents and child identified in a pedigree, and determines whether the given combination is possible (e.g. both parents having two copies of an allele and it being absent in the child would be impossible). For lines that do not appear to be a genetic match with their documented parents, it is then possible to search for the true parents through genetic comparison against a larger population of potential parents.
Breeding higher-yielding and more resilient crops will be critical as land area available for agriculture shrinks and populations increase. Studies have shown that the cassava is one of the only staple crops that may resist or even benefit from climate change, an increasingly important concern as global temperatures rise. By ensuring the accuracy of pedigree records in CassavaBase, this tool can contribute to worldwide cassava breeding efforts, including the goals of the overreaching NEXTGEN project: shortening the breeding cycle, improving yield, increasing genetic diversity, and increasing the exchange of cassava breeding information.
My Experience:
My internship at Boyce Thompson Institute has given me the opportunity to collaborate with extremely talented bioinformaticians and researchers from around the world. I have gained experience in a considerable amount of areas including: using Linux command line, writing and running scripts for data analysis in R, accessing and managing databases, construction of packages, controller, and modules in Perl, website design in HTML and JavaScript, the use of virtual machines for development, and the importance of file management and backups. Much of my work required collaboration with lab members who possess unique skill sets and having to communicate with them clearly and effectively to obtain the information that I needed was a valuable experience. This internship experience has shown me that I work very well in a research setting, and in the future I would consider returning to biological research as a potential career field.
Intern Info
Suzi Barboza-Pacheco
Comparing Database Management Systems in Order to Store Cassava Genetic Data
Project Summary
As faster and cheaper DNA-sequencing technologies develop, so do the challenges of handling the immense amount of genetic data produced. The challenges for database systems include developing efficient methods to organize data and optimizing performance for storing and retrieving data. The benefits of using a database system to store genetic data include multi-user access, online data distribution, and performance scaling. The overall goal of this project was to compare methodologies for storing and querying the latest set of cassava SNP genotypic data in an efficient manner. We compared PostgreSQL, a relational database management system (DBMS) and MongoDB, a non-relational DBMS, by measuring how quickly each database ran a query. We designed three queries that performed different functions including counting the number of mutations in each accession, counting the number of deletions in each accession, and selecting accessions with specific mutations. We also measured the time it took to run each query on different data storing formats used in PostgreSQL including text, JSON, JSONB, and bytea data formats. Development of the loading scripts and query scripts was an iterative process, where the performance was profiled and tuned using various tools, and further development can continue to improve the system used in this project. Data from next-generation sequencing methods are being produced at a rapid rate from many experiments across the world. Therefore, it is necessary to research different methods to store data and to compare databases especially in the case of genetic data.
Intern Info
Danielle Dixon
Elucidating genetic variation in ‘Candidatus Liberibacter asiaticus’ transmission between Asian citrus psyllid isofemale lines
Project Summary
Citrus greening disease or Huanglongbing (HLB) is the most devastating citrus disease worldwide. HLB causes citrus to be inconsumable due to a reduction in fruit quality, decrease in overall fruit production and eventual death of the citrus tree. The disease is associated with this bacterium Candidatus Liberibacter asiaticus (CLas), which infects citrus hosts by being acquired and transmitted by the Asian citrus psyllid (ACP). CLas must infect the ACP by traveling through its saliva, gut wall, blood and return to the saliva to be successfully transmitted. It has been found that the rate acquisition and transmission rates of CLas are variable. Much of the work in citrus greening is focused on the development of improved early detection methods for the disease by understanding both of these factors. My project focuses on improving the current psyllid genome through manual curation, as well as helping characterize transmission rates among different ACP isofemale line via quantitative Polymerase Chain Reaction (qPCR). Gene families that are involved in psyllid immunity have been identified and are curated with the objective of understanding how they contribute to CLas infection of the psyllid. While, qPCR allows us to quantify the amount of CLas present in citrus leaves post ACP feeding, we can then analyze the transmission rates. The ability to connect transmission rates with different psyllid lines may allow for further experiments to reveal the connection between phenotypic and genotypic variation that lends to the spread of CLas. Objectively, our goal is to slow the disease spread and stop losses in the citrus industry.
My Experience
My summer at the Boyce Thompson Institute and Cornell University was wonderful. I had the opportunity to explore another scientific method by gaining computational skills in the field of bioinformatics. Before this summer I was not attune to the connections between molecular biology techniques and bioinformatics, such as how protein expression analysis may contribute to the structural change of a gene model. I am immensely grateful for the caring and open community that I experienced while interning at BTI. Working individually with my mentors and collaboratively with members of the Cilia and Mueller lab I felt supported and capable of success while working toward our goal. My overall experience was equally challenging and rewarding. Additionally, I was privy to various perspectives about the different paths I can pursue as I continue on in the field of scientific research and eventually toward my Ph.D.
Intern Info
Matthew Guo
Building a genomic resource for a major African staple crop: the Cassava Expression Atlas
Project Summary
Cassava is a key staple crop for more than 500 million people across Africa, Asia and South America. Its ability to grow in poor and dry environmental conditions, as well as the high starch content in its roots, makes it a very desirable crop. Despite these benefits, cassava research has long been neglected in comparison to commodity crops such as wheat and rice. Additionally, its high heterozygosity level and low ability to flower are challenging biological features for research and breeding.
The Cassava Expression Atlas aims at developing resources for cassava genomics and offers an interactive tool to access and analyze cassava RNA-seq data. To implement the expression atlas we used an RNA-seq test dataset including three different cultivars of cassava, Arg7, SC124, and W14, under drought and control conditions, with a focus on the leaf and root organs of the plant. A Postgres database and various graphics were created to host and display the processed RNA-seq data.
While the dataset used in this work is relatively small, the Cassava Expression Atlas aims to provide an extensive data resource for the crop, including data on many biotic and abiotic conditions, plant organs, and cultivars. With such an available resource, ideally, any gene expression information can be retrieved and be made available to the international cassava research community.
My Experience
My time spent during the summer as an intern at BTI has been an enjoyable and valuable experience for me. At the start of the program, I had no prior research experience, and while I enjoy both technology and science, I realized that there was a lot I didn’t know and would need to learn. Thanks to my mentors, I was able to find out more about UNIX and the Linux command line, how RNA-seq works, and the kinds of data one can retrieve and analyze using RNA-seq. More importantly, I was able to take the information that I learned and apply it to my research project effectively. This internship has been a truly invaluable experience for me, and the skills I have learned will hopefully help me in pursuing my future college and career goals.
Intern Info
Filip Jander
Solanaceae Gene Family Analysis
Project Summary
This project consisted of analyzing the newly sequenced Iochroma cyaneum genome and comparing it to Solanum lycopersicum (tomato), Solanum tuberosum (potato) and Arabidopsis thaliana. The purpose of the project was to identify gene families and provide the groundwork for interesting/unique genes and gene families in Iochroma and the Solanaceae. I ran scripts on the servers to generate data that was used to create a Venn diagram to show the gene family relationships. The research was successful, as the data that we got was expected because Arabidopsis thaliana didn’t have as many gene families in common as tomato, potato and Iochroma. A large number of singletons were found in the Iochroma annotations, likely due to the lower quality of the gene structural annotations. However, overall the Iochroma gene annotations were of acceptable quality.
The second part of the project involved creating a phylogeny tree to develop a better understanding of how Iochroma cyaneum evolved from the rest of the Solanaceae family. For this part of the project I worked with various programs to analyze eight species in the Solanaceae family. The result of this portion of the project was the completed phylogeny tree of the eight species using a dataset of ortholougs genes. This phylogeny is based on a larger number of data than previous phylogenies for these species. The Iochroma cyaneum genome is most likely the largest source of error in this project because of the inaccuracies of draft genome assemblies and annotations.
My Experience
I learned a lot about unix operating systems and how they are useful for large data sets and bioinformatics. This was a great learning opportunity for me because it will help me in the future if I ever want to go into computer science. Additionally, I learned some Perl which can help me in the future if I ever choose to take on bioinformatics as a profession. Working a 9-5 job taught me a lot about time management and what it’s like to be in a real work setting. Collaborating with my mentors was also a good experience for me because it helped me understand my research to get a good idea of what I was doing. Overall this internship taught me a lot and has made me consider a future in scientific research.
Intern Info
Allison Izsak
Gene discovery, annotation and orthology in the Asian citrus psyllid genome
Project summary
The Asian citrus psyllid, Diaphorina citri, is the vector host to the citrus industry’s most threatening bacterial pathogen, Candidatus Liberibacter asiaticus. This bacterium is the causative agent of citrus greening disease and has cost the citrus industry more than $4 billion in revenue loss. The focus of my project this summer was to help find a solution to this devastating disease by working with the psyllid genome and looking for genes that might be involved in pathogen transmission or pathogen survival.
Not much is known about the genetics and genomics of the psyllid, so a literature search of related vector systems was conducted and a candidate gene list highlighting genes that play a role in immunity and gut-microbe homeostasis was compiled. The candidate genes that were successfully identified in the psyllid via BLAST were then manually annotated based on predicted gene models in Web Apollo. Additionally, an OrthoMCL analysis was done using the proteomes of 8 related hemipterans, including the psyllid, to identify conserved hemipteran proteins as well as proteins that are common to all sequenced hemipterans, but missing in the psyllid. This helped to evaluate the completeness of the psyllid genome assembly.
The results from this project will enable other researchers working on this problem to focus their efforts on finding ways to inhibit proteins/genes that have been identified and manually annotated in the psyllid and, therefore, help find a solution to citrus greening faster.
My Experience
Working at BTI this summer was great. As someone who had never done research prior to this internship, I was excited and amazed to find out that the project I would be working on was of such high importance. To be given real responsibility and treated as an important member of the team was extremely rewarding. Everyone I worked with wanted me to succeed and gave me all of the support I needed to be able to do so. I really enjoyed being in an environment where learning about different aspects of plant biology, and science in general, was encouraged. Also, at the beginning of the summer I was totally new to bioinformatics research, but now I’m leaving with a skill set that I can use in many different fields. I feel lucky to have been exposed to so many areas of research and am eager to continue exploring plant science in future studies.
Intern Info
Ivana Rodriguez
Analysis and characterization of conserved non-coding sequences in species of the Legume Family (Fabaceae)
Project Summary
A character of considerable evolutionary and ecological significance is nodulation, the symbiotic fixation of atmospheric nitrogen by soil bacteria housed in specialized structures (nodules) of various angiosperms, particularly the diverse legume family. Conserved non-coding sequences (CNSs) are regions of DNA in close proximity to genes, and they serve regulatory purposes not yet fully understood in gene function. Previous work has demonstrated the mechanisms by which gene regulation provides crucial contributions and influences evolutionary change among species. Analysis of these highly conserved sequences may provide a novel way by which to track evolutionary changes and relationships among Leguminosae species on a genomics scale. More precisely, conserved non-coding sequence have great potential to supply novel evolutionary insights and may answer the question of whether these economically crucial nitrogen-fixing species have acquired their traits through a common ancestor, or through independent (convergent) means.
This project focused on assembling a pipeline by which to streamline identification of conserved sequences using whole genome data for the legume species. The location of these sequences was determined relative to coding regions to gain insight into the classes of genes the CNSs may be regulating. The identified sequences were further characterized by defining over-represented motifs, which were then queried against databases of transcription factor binding sites to obtain putative functions. Once we can deduce CNS sequences that are associated with nodulation-specific functions, additional hypotheses concerning the origins and evolution of nodulation can be deduced. Ultimately, this bioinformatics approach will serve to complement progress towards discovering evolutionary origins of nodulation.
My Experience
My internship at Boyce Thompson helped solidify my passion for discovery and research in plant genetics and genomics. My time spent at Cornell this summer helped me fulfill new goals I never thought I’d have the confidence to reach. Prior to the internship, I had almost no experience with computational biology, but with the help of my mentors, I’ve become well-oriented with bioinformatic data analysis. I was able to reach new heights in my understanding of what it means to be a scientist, and for me, that is an exhilarating and valuable feeling. I am immensely thankful to everyone in Dr. Mueller’s lab, as well as Dr. Suzy Strickler and Dr. Doyle for their endless support, faith and patience.
Intern Info
Jonathan Gomes Selman
Extending CRISPR design software features for tomato protein kinase silencing
Project Summary
The discovery of the CRISPR/Cas9 method in 2012 represents a revolutionary advance in genome editing. The system relies on two main functional groups: A Cas9 protein that cuts the targeted DNA strand and an sgRNA (single guide RNA) that guides the protein to specific locations in the genome. One of the primary challenges in implementing CRISPR systems is designing optimal guide RNAs with minimum off-target matches. Currently, several algorithms have been developed to score guide RNAs effectiveness by a variety of experimentally determined factors. My research has extended the scope of currently published CRISPR tools, such as CCTop and CRISPR-P. With added functionality, researchers will have greater specificity when selecting sgRNAs. Researchers will be able to analyze guide RNA positions within a gene, rewarding preference for RNAs near the 5’ end, and view whether they target specific gene domains for increased desirability. Additionally, researchers will have the option to design sequences intended to target multiple genes within a family. By extending guide RNAs to target several genes, scientists studying multiple related genes or genes from the same family will more easily be able to experiment with simultaneously silencing several genes. Utilizing these extended techniques, I have designed multiple guide sequences for the Receptor Like Cytoplasmic Kinase (RLCK) gene family, a family that plays an important role in plant immune responses. Overall, extending the functionality of current CRISPR tools allows for more specialized guide sequences to be obtained, enabling scientists a greater range of experiments when researching genes through genetic manipulation.
My Experience
Working as an intern at the Boyce Thompson Institute (BTI) has been a truly invaluable experience. Although I started the internship with a minimal background in bioinformatics and plant biology, through the amazing guidance of my mentor Dr. Noe Fernandez, I have gained an immense amount of knowledge about these areas. Every day presented an array of different challenges, teaching me the rigor and dedication needed to conduct research. My experience over the last six weeks has shown me the excitement of being in the pursuit of discovery, and has opened my eyes to the possibility of continuing to conduct research in the future. The opportunity to be imaginative, creative, and to think critically about a problem strongly appeals to my interest as an individual. I would like to extend my sincere gratitude to Tiffany Fleming, Nicole Waters Fisher, my mentor Dr. Noe Fernandez, Professor Lukas Mueller, and the entire Mueller lab for making my summer internship at BTI such an amazing and memorable experience.
Intern Info
Angela Zhang
Transcriptome characterization and evolution in cultivated and wild tomato species
Project Summary
Solanum lycopersicum, more commonly known as the tomato, is one of our most important agricultural crop. Not only is the bright red and fleshy fruit of the tomato a good source of vitamins and antioxidants, it also provides an important model system for fruit development. Despite its importance in agriculture however, Solanum lycopersicum suffers from a severe lack of genetic variation. Due to a combination of bottleneck effects and centuries of inbreeding, individual tomato plants are nearly genetically identical to each other. As a consequence of the decreased variation, this cultivated plant becomes much more susceptible to disease and changes in climate. Wild tomato species on the other hand, are rich in genetic variability and contain many adaptations to less-than-ideal conditions, such as arid climates and high altitude. Such species may be vital in breeding programs to create a more evolutionarily fit tomato.
This summer we assembled the transcriptomes of the wild tomato species, Solanum pervianum andSolanum pimpinellifolium and annotated the genomes of Solanum chilense, Solanum habrochaites, and Solanum pennellii with tools such as Trinity and Maker respectively. We also completed de novo genome assembly of Solanum chilense, a species more heterozygous than tomato,with Platanus. With the data from these wild tomato species, we constructed a phylogeny using programs such as ClustalX, Dnaml, and PAML. While tomato phylogeny trees had already been constructed in previous efforts, they were based on a much smaller sample size than we have available to us now. Using a larger sample size can help resolve phylogenies for wild species and provide greater insight into the evolutionary divergence of these tomato species. Overall, the work we completed offered interesting insights within the wild tomato species and will aid further research in improving tomato crop.
My Experience
My summer at Boyce Thompson Institute and in Lukas Mueller’s lab has allowed my bioinformatics skills to grow exponentially. Not only am I much more comfortable with UNIX systems and a multitude of bioinformatics tools, I have gained a better appreciation for the broad potential of big data in supporting the growth of plant science research. Outside of bioinformatics, I was able to witness the diversity of plant research conducted by some of the best scientists in their field. I leave the summer much more experienced and am ready to apply my newly learned skills to my continuing immersion in bioinformatics.
Intern Info
Javon Mullings
Multi gene analysis tool for Virus Induced Gene Silencing
Project Summary
Virus Induced Gene Silencing (VIGS) is a very useful method of studying gene function in plants. For this method to be effectively employed, the target gene fragment introduced into the virus vector has to have a specific sequence. Identifying that specific sequence will allow for the silencing of target genes while influencing as few off target genes as possible. To aide in this effort, Sol Genomics Network (SGN) created the SGN VIGS Tool to help researchers design VIGS constructs with a user-friendly and highly customizable web tool. The original version of the tool only worked with one sequence at a time, yet researchers could require the silencing of hundreds of genes to study particular gene functions in large screening experiments. With this in mind, a new algorithm was developed to improve the tool to accept multiple sequences at a time. Implementation of the algorithm was done on a Stand Alone version of the SGN VIGS Tool, incorporating several programming languages and software. The result of this algorithm is a new feature on the SGN VIGS Tool, Bulk, that will accept a file of multiple Fasta sequences, and return the appropriate construct sequences, and target and off target gene information. The user also has the option to upload expression data that can accompany the results of their target and off target gene information. Future improvements will involve the addition of primer3 to further aide researchers in making multi gene virus constructs more efficiently.
My Experience
In creating the VIGS Tool Bulk, I was exposed to new programming languages and software such as Perl, Catalyst, and the Command-line interface. Often my code would return errors or I would encounter new processes, and need to do in depth research online to fix the issue. Consequently my knowledge and proficiency in those languages and software have increased significantly. I now also have extensive experience with web tool development. The members of the Mueller lab were great resources all summer, especially my mentor, Noe Fernandez. With his guidance, I was able to gain a greater understanding of the relationship between biology and computer science, and how to code more efficiently—something that will be invaluable if I pursue a career in bioinformatics.
Intern Info
Amelia Lovelace
Diversity analysis of wild tomato species in the Lycopersicum clade using transcriptomes
S. lycopersicum, the domesticated tomato’s genetic diversity has been drastically reduced due to bottlenecks during domestication and as a result, useful allele diversity has been lost in the gene pool. Fortunately, wild tomato species have high genetic variation and thus have been utilized for restoration of the gene diversity in cultivated tomatoes. However, in order to fully understand the domesticated tomato’s genetic potential, the diversity of the wild tomato species must be further analyzed. RNA-seq data from the SRA and Next Generation Sequencing on various wild tomato tissue samples were utilized for analyzing genetic diversity. Cleaned paired-end Illumina reads from S. arcanum, S. peruvianum, S. pimpinellifolium, S. cornemulleri, S. chilense, and S. pennellii were mapped to the Heinz genome using the reference-based assembly program,Tophat2. The accepted hits files for each accession were then merged using Samtools. In order to analyze the coverage for the wild species data from various tissues in reference to the Heinz genome, Bedtools was utilized to get the number of reads for each Heinz gene that were found in a wild tomato species. Due to the sufficient coverage of these samples, a consensus sequence was generated based on the read mapping using Samtools. The consensus sequence for each species can then be compared using ClustalW to generate a phylogeny tree. A manual was generated in an effort to facilitate the analysis of future data as more wild tomato samples get sent in for Next Generation Sequencing.
My Experience
This internship has helped me learn a great deal about my research interests as well as what it means to conduct bioinformatics research. I was initially overwhelmed by the amount of catching up I had to do in order to complete my project; but looking back, I am surprised by the amount that I have accomplished in this short ten week internship. Not only am I more familiar with bioinformatics software and programming, but I now have a better idea of what I want to study in graduate school. Although, this internship has helped me realize how much I miss working in the greenhouse and wet lab, the skills I have developed in this bioinformatics program will be applicable for my future research in plant genetics at graduate school. Most importantly, this internship has fueled my passion for plant research in that it has revealed many interesting research topics.
Intern Info
Matthew Crum
Genome assembly and analysis of wild tomato for markers associated with Tomato Yellow Leaf Curl Virus
The tomato is one of the most important fruit crops and humans have been domesticating them for hundreds of years in order to maximize the crop’s yield. Through the use of selective breeding, tomatoes have been able to produce much greater yields, but this process has also reduced the genetic diversity of domesticated tomatoes and limited their ability to adapt and fight diseases. Scientists and breeders have been trying to battle this problem through introgression of new and diverse genes and alleles from wild tomatoes species which still maintain their genetic diversity into the domestic tomato. By taking advantage of the disease resistance of wild-type tomatoes, scientists have been able to transfer resistances to certain diseases into domestic tomatoes. In order to identify which genes or alleles the domestic tomato needs in order to resist a disease, first the genetic basis for the wild tomatoes resistance must be identified. Using Next-Generation-Sequencing combined with bioinformatics analysis tools, it has become possible to sequence genomes and compare them much more efficiently. Using Bowtie2 (a tool which aligns Next-Gen-Sequence reads to a reference genome), samtools (a set of tools used to manipulate Next-Gen-Sequence reads), and Gbrowse (a genome visualization tool) we have been able to map and analyze multiple wild tomato accessions, including both resistant and susceptible inbreds, and compare their genomes in order to find loci which may contribute to resistance of Tomato Yellow Leaf Curl Virus.
My Experience
This experience has helped me grow in areas which span the purely scientific and academic to the development of inter-personal skills necessary to work in team settings. I have been more thoroughly introduced to using the Linux environment, improved my perl scripting skills, and been taught to use various bioinformatics tools such as Bowtie2 and samtools. This internship gave me a chance to observe firsthand the strength of applied bioinformatics in resolving biological problems that would have required far more resources otherwise. Perhaps most importantly, my experience with my mentor, Naama Menda, has taught me the skills necessary to work in a research setting. The freedom I was allowed in working out the kinks in my project helped me develop self-reliance while Naama’s assistance when I became stuck demonstrated the value, and often necessity, of asking for help and collaborating with others to accomplish a common goal.
Intern Info
Saji Akhil
Exploring the benefits of distributed parallel computing on large biological datasets
The scientific community is accumulating biological data at an exponential rate. Next generation sequencing technologies easily accumulate many terabytes of genomic data per week. Investigators require new methods to efficiently query and analyze data of this scale. Apache Hadoop is open source software designed to replicate the features of Google MapReduce and Google File System (GFS)1. The basic principle behind the MapReduce paradigm is to dissect a computationally demanding task into smaller sub-tasks that can be distributed amongst a cluster of nodes. The Hadoop Distributed File System (HDFS) is the second major function of the Hadoop package. HDFS is designed to be an extremely redundant and expansive storage medium that can serve a viable purpose in a scientific setting where data parity and large data quantities are common. Our preliminary investigation of utilizing Hadoop in a scientific setting involved two key steps. First, thorough study of the benefits of distributed parallel computing vs. linear computing on relevant datasets was required. This encompassed both benchmarking the physical hardware and testing MapReduce applications and comparing the results to their linear counterparts. The results indicated that Hadoop offered significant improvements in computation time in specific cases such as filtering large datasets. Second, a MapReduce application was designed to query Genotyping-by-Sequencing (GBS) data. Normally, querying these datasets with linear processing takes extensive time and computational power,however, using parallel computing, query times were reduced by several orders of magnitude. This project will enable scientists to make novel discoveries using large biological datasets.
My Experience
The Bioinformatics Internship at the Boyce Thompson Institute (BTI) has been an eye opening experience into science as a whole. From the opportunity to attend seminars on a weekly basis to the immersive experience of working in a scientific environment on a daily basis has lead me to appreciate and gain insight into the lifestyle of a scientist. I became interested in the program offered at BTI due to my innate interest in computing and biology. Throughout the past ten weeks I have thoroughly explored biology in a omputational context and gained an understanding of the plethora of benefits that computation can offer to life scientists. The background and skills I have gained during this internship will be a valuable resource as I venture forward into future research endeavors during graduate school.
Intern Info
Andrew Dunford
RNA-seq Assembly of Transcriptomes in Various Tissues Collected from Solanum habrochaites, S. lycopersicum, S. pennellii, S. pimpinellifollium and Introgression Line 4-3
With the advent of high throughput nucleotide sequencing, genomics and transcriptomics have become prolifically important topics in the field of molecular biology. Through the study of an organism’s transcriptome, scientists can gain deep insight into how genes are expressed in different tissues or under certain conditions. Understanding gene expression is important for understanding how they relate to traits such as disease or drought resistance which may be important in agriculture or industry. There are challenges, however, in assembling and analyzing the large sets of data produced by these next generation sequencing processes, creating demand for bioinformatic techniques. In this study, transcriptomic data in various tissues from S. habrochaites, S. lycopersicum, S. pennellii, S. pimpinellifolium and the introgression line were processed, mapped to the S. lycopersicum genome, and then analyzed for differential expression, single nucleotide polymorphisms (SNP), insertions, and deletion. To this end, a variety of software packages were used, including fastx-toolkit, the Tuxedo Suite, samtools, snpEff and bcftools in addition to scripts developed within BTI. The ultimate goal for this data is to bring it to a point where it can be loaded into an interface like GBrowse where it can be viewed by biologists who may be interested in studying differential expression between these tomato species or samples.
My Experience
In this internship I gained experience with a wide set of bioinformatic tools, as well as coding and web-developing techniques not directly related to biology. Although I was working with tomato data, the techniques I used, which were mostly for assembly and analysis of transcriptomes, are applicable to a wide spectrum of organisms and model systems. The faculty I worked with was very helpful and helped clear up any questions I may have had. Although I probably won’t go directly into bioinformatics as my major field, I believe the skills I’ve gained will help me greatly in my studies of biochemistry in grad school and onwards.
Intern Info
Kristin Blacklock
De Novo Discovery and Comparison of Transposable Element Families in S. lycopersicum and S. pimpinellifolium
Transposable elements (TEs) are sequences of DNA capable of changing their relative position in the genome of an organism either by moving or copying themselves. Their discovery in the 1940s is credited to maize geneticist Barbara McClintock, whose suggestions of TE functionality were dismissed for decades thereafter. Recently, however, researchers have discovered several important aspects of TEs, including one unusual retrotransposon, Rider, whose activity in the SUN gene of the domesticated tomato (Solanum lycopersicum) has resulted in altered fruit morphology phenotypes. Thus, TEs may have played an important role in the speciation between the domesticated tomato and its wild ancestor, and so the identification of putative new active TE families in the S. lycopersicum genome that are absent or less abundant in the S. pimpinellifolium genome may be of particular interest for the advancement of tomato research.
This summer, I implemented a de novo transposable element discovery pipeline called the REPET Package on the tomato genome. Its two main components, TEdenovo and TEannot, are dedicated to the detection and analysis of repeats in genomic sequences, where TEdenovo returns a library of classified, non-redundant consensus sequences, and TEannot filters these results based on a similarity search with known TEs. Once obtained, the TE content of the domesticated and wild ancestor species was then compared to identify TE families with characteristics that suggest recent activity. For those TE families of interest, the presence or absence of individual elements was verified by aligning flanking sequences from the two species. The positions of TE polymorphism sites were compared to the locations of known genes to find instances of TEs that may be contributing to functional genetic variation.
My Experience
This summer internship has been an amazing experience in which I have grown both as a person and researcher. In these past weeks, I have gained not only great new friends, but also a deeper appreciation for bioinformatics and the answers we can find using computer science in conjunction with traditional biological research. I enjoyed my projects immensely, both the beginner project, which was to create a Catalyst-based web interface for Primer3 on the Sol Genomics website, and main project, which dealt with the de novo identification of transposable elements in the tomato genome. I have learned so much from my mentor and others, and now feel confident that my future career will involve a blend of computer science and biology.
Intern Info
Paul Van Eck
Pedigree Visualization and Genome Referencing
The Sol Genomics Network (SGN) database contains genealogical information on genetic stocks, but the SGN website lacked a friendly way of showcasing this information. To remedy this, a pedigree visualization tool was created and integrated into the site using Perl, HTML, JavaScript, and GraphViz. Pedigree charts show the genetic history of an organism over several generations, so such a tool will allow plant breeders and biologists to easily view the genealogy of a subject of interest. This allows for better integration of information from historical breeding records and genomic resources.
In a separate project, in an effort to have more plant genome resources available on SGN, the rice and grape genomes were loaded into the SGN database and configured to work with GBrowse, a tool for displaying and interacting with genomic annotations. Having these available is a great resource for plant genome analysis. Similarly, gene families, which are sets genes with similar sequences, were also loaded into GBrowse for easy viewing. These families consisted of potato, grape, tomato, arabidopsis, and rice. Paralog and ortholog groups derived from the gene families were also used in an attempt to detect genome duplications and their pattern of distribution in the genome.
My Experience
This summer internship at BTI was undoubtedly a worthwhile and lasting experience. I was given an excellent opportunity to relate my computer science major to the field of biology, and with the assistance of a wonderful group of mentors, I was able to learn a lot about what bioinformaticians do. Even if I do not continue studying bioinformatics, much of the knowledge I gained is still highly relevant to my major. I am much more proficient in working on Unix systems, and can now more comfortably work with Perl and JavaScript. Through this internship, I met many fantastic people, got to explore much of Ithaca, and had an overall unforgettable time.
Intern Info
Dil Begum
Filling of Reference Genome Gaps Using Next Generation Sequencing
From famous entraes such as spaghetti and pasta to mouthwatering salsa, tomato has a wide variety of uses in world’s cuisine. Tomato is also very important in our diet. According to a magazine (Scott-Dixon, Krista), tomato is an antioxidant powerhouse. So where do they come from? What makes them so different in taste, color and appearance? Scientists are working day by day to improve tomato quality. Since the tomato genome has been sequenced, this information can be used to help us understand how genes work together to effect growth, development and functionality of an entire genome. However, when the tomato genome was sequenced gaps were left as a result of the assembly process thus creating areas of missing sequences. Therefore, the purpose of this study is to fill as many gaps as possible and possibly reduce the number of gaps using the de novo contigs assembled from Illumina short reads. The results were produced using tools such as BWA, novoallign, SamTools, Picard, BLAST, SOAPdenovo, BedTools as well as Perl scripts and some Linux regular expressions.
My Experience
The purpose of the project was to use both the reference genome with information about where short reads map, and to generate contigs that can be mapped to the reference genome to identify contigs that are located near gaps. The reads were obtained using the Illumina next generations sequencing technology. Different tools were used to generate contigs from the reads such as SamTools, Picard, SOAPdenovo Assembly, BLAST, BedTools, Perl programming language as well as some Linux regular expressions. After generating the contigs from the SOAPdenovo, contigs were run against BLAST which gave information about contigs ID, chromosome ID, e-values, percentage matches, etc. I was only interested in three fields such as chromosome ID, contig start and contig ends. With this information, I wrote a script that took the first two best outputs from BLAST and outputted into a BedTools format. I also wrote a second script that generated the locations of the gaps from start to finish and also outputted them in a BedTools format. Then I used BedTools with the output from both scripts to show where each contigs lined up to the chromosome, what chromosome they lined up to and how far they were from each other in terms of base pairs. For this project I chose regions less than 20bp.
Intern Info
Jessica Jeffrey
GenBank Update
Solgenomics has been working on sequencing the tomato genome for some time now. The reference genome created can be found in GenBank where it was submitted for other researchers to access. GenBank is an on-line public repository of genome sequences run by the National Institutes of Health (NIH) and accessed by labs worldwide. Due to new advances and changing data, GenBank was not up to date. In order to update GenBank three scripts were needed. The file containing all the information for GenBank submission was in the incorrect format. A script already existed to reformat this data, however, it did not include orientation information for the contigs (pieces of sequenced DNA). Thus the existing reformatting script was edited to preserve the orientation information. The next step was to make the reference genome as accurate as possible, this was achieved by integrating other data types. The current reference genome was created from Next Generation Sequencing, a method that cuts DNA and sequences only the ends of these small pieces. Data was available from Bacterial Artificial Chromosomes (BACs) created from an E.coli biased sequencing technique. A new script was created to select the most accurate data, from a Basic Local Alignment Search Tool (BLAST) output. Another new script to integrate this information into the current reference genome was also created.
My Experience
This summer I learned many skills related to Bioinfromatics. I began the sumer by learning a new computer language, Perl. As I had some programming experience in the past, I was able to pick it up fairly quickly. My project was very exciting as I was creating scripts to do something that had never been done before, integrating BAC data into current genome information. I was lucky enough to have two mentors this summer, and they both were very helpful. They would check in on me regularly, and help me solve problems and errors that I ran into. I learned that it is possible to do programming and still have a link to the lab. Many of the interns I met this summer were working in labs creating data similar to what I was working with. I was able to improve my computer programming skills while learning about the most recent biological techniques. The most important thing I learned this summer is that I would love to pursue a career in Bioinfromatics.
Intern Info
Samuel Moijueh
Implementing an RSS feed into the sol genomics website using the PERL programming language
The Sol Genomics Network (SGN) is an open-source plant genomics database where researchers and agriculturalists can exchange information. However, this database previously did not have an RSS feed available to dynamically keep SGN users aware of any updates made to the database. Thus, the objective of this project was to implement an RSS feed that would make it easier for users to collaborate and share the latest information whenever it is updated, and access aggregated content from the central repository even if they are not in the SGN database. The Perl modules used to automate the process of generating feeds were XML::Feed, Catalyst::Model::XML::Feed, and Date::Calc. Essentially, the script resides in a URL; when the user clicks link (calls the URL), thescript parses the database for loci and then automatically generates the feed. A webpage was also designed using HTML, Javascript, among others to display all the available feeds on the Sol Genomics Network. An RSS feed will prove to a great addition to the Sol Genomics Network.
My Experience
This summer I worked on implementing an RSS feed into the sol genomics websites, a plant genome database, using PERL programming language. In today’s modern world, many websites, news services and blogs are using RSS feeds as a competitive means of syndicating or distributing their content over the web. Similarly, professors in academia, scientists in research institutions or even researchers in industry and development want to stay-up-date with the increasing knowledge in plant genomes – an RSS feed would be a great tool to ensure this. Thus, an RSS feed on the sol genomics datebase does not only make it easier for scientists to share and collaborate information but is also a highly marketable feature for generating revenue via RSS traffic advertising services such as Pheedo.
I am finishing my rotation at BTI with the bioinformatic interns. This experience has been helpful in getting me engaged into the field of bioinformatics. This summer I learned how to program in Perl. Specifically, I have learned how to write a simple program, run it and write tests to ensure its working properly, and if necessary debugging the program. I have also gotten more comfortable working in the Linux command line. Additionally, I learned quite a bit about the necessary websites and tools one should learn as a bioinformatician. For example, BLAST, SQL, github, RNA sequencing, dotplot, among others. Aside from acquiring these technical skills, this summer I developed a better sense of what I would like to study in graduate school. I like how the weekly seminars have exposed me to different kinds of research and what is required to earn a PhD. I also liked working with a mentor. The cxgn work channel reminded me that there was always help available. I have enjoyed my work at BTI. I recommend incoming interns or undergraduate scientists to come with an open mind and be prepared to work.
Intern Info
Carolyn Ochoa
Carolynis a recent graduate of Ramapo College in New Jersey. She worked in the Mueller lab on identifying transcription factors that will bind to the promoter regions of tomato unigenes involved in the metabolic pathways from SolCyc.
Intern Info
Malloy Freeberg
Mallory is currently a senior at Saint Vincent College in Latrobe, PA. In the Mueller lab, Mallory’s summer internship focused on identifying common motifs/domains within the untranslated regions of unigenes for members of the Solanaceae family using bioinformatics tools.