
Predicting Double-Strand Break Fate in Maize using Unsupervised Learning
Meiotic recombination is an integral part of eukaryotic reproduction and diversity. The process involves the formation of programmed double-strand breaks (DSBs) across chromosomes, which result in crossovers (COs) or non-crossovers (NCOs). In maize, only ~20 out of 200-500 initial DSBs turn into COs. The main goal of our experiment was to computationally predict DSB fate within a database of maize DSBs, using associated feature data. We also had a database of COs to use as a positive control, and a database of random genomic intervals as a negative control. Since we do not know which exact DSBs result in COs, we decided to use unsupervised learning in order to find hidden patterns and cluster datapoints with similar features. After normalizing and fitting feature data onto DSB intervals, we ran a clustering algorithm on the full dataset, which resulted in one large and several small clusters. Interestingly, one of the smaller clusters, cluster 3, had a distinct feature layout, with features often associated with COs. To corroborate our findings, we ran the same clustering algorithm on the DSBs and COs together, and found that 97% of the COs ended up in cluster 3, an overwhelming majority. As mentioned before, there is currently no accurate way to determine which DSBs result in COs. However, since we now know which cluster each DSB results in, a relation between certain DSBs and COs could potentially lay the groundwork for future research applications.
I’m very grateful to have had this opportunity to work with and learn from the Pawlowski lab through the BTI and Cornell Plant Genome Research REU this past summer. I believe that taking part in this program has taught me a lot about computational biology as a field of research, and I feel very fortunate to have had this opportunity during my first summer of college.
I am incredibly thankful for the collaborative support from the entire lab. Under the guidance of my mentors Quinn Johnson and Dr. Wojtek Pawlowski, I was able to gain hands-on experience with designing and conducting research. Throughout our procedure, I became acquainted with a plethora of computational approaches new to me, including Random Forest machine learning, PCA data analysis, and UMAP unsupervised learning. I also greatly strengthened my problem-solving and troubleshooting skills, and gained a lot of valuable feedback through weekly lab meetings. In the years ahead, I am eager to continue my journey as a computational biologist, and to apply what I learned here both as an undergraduate student and beyond.