bioinformatics – Dr. Brian R. King

Chuqiao Ren, ’15

Project: A novel ensemble classifier for protein contact map prediction
Duration: Summer 2013 – Spring 2015
Funding: Bucknell University Program for Undergraduate Research, BRK Startup Fund, Geisinger BGRI Grant, CS Dept. Fund

ABSTRACT

One of the greatest challenges in bioinformatics is how to predict the 3-D structure of a protein by understanding the relationship between a sequence and its amino acid structure. A protein contact map is a useful way of representing protein 3-D conformations. It is based on a distance matrix, which is a symmetric matrix that contains the Euclidean distance between each pair of C-alpha atoms in each residue in the folded protein.

Our goal is to improve existing machine learning algorithms for predicting a protein contact map from protein sequence, and develop a novel algorithm that improves the performance of existing contact map predictors.

ACHIEVEMENTS

Honors Thesis – Successfully defended, April 2015
Short paper and poster – ACB BCB ’14 – ACM International Conference on Bioinformatics, Computational Biology and Biomedicine, Sept 20-23, Newport Beach, CA [link]
Poster Presentation – Fourth Annual Susquehanna Valley Undergraduate Research Symposium, SVURS 2014, August 5, Geisinger Research, Danville, PA
Poster – Kalman Research Symposium 2014, March 29, Bucknell University, Lewisburg, PA.

POST GRADUATION UPDATES

Chuqiao successfully defended her honors thesis in April, 2015. She is staying for a bit longer this summer to help finish a journal publication and submit before she departs us. She is currently planning on pursuing her graduate degree in computer science at Columbia University, starting Fall 2015. Congratulations, Chuqiao!

Charles Cole ’14

Project: Using Machine Learning to Predict the Health of HIV-Infected Patients
Duration: Summer 2012 – Spring 2014
Funding: Bucknell University PUR, Biology Dept. and CS Dept. Funding

ABSTRACT

HIV is one of the most devastating viruses to hit mankind in modern history. About half of people infected will acquire AIDS. For some, however, the virus will lay in a stage known as “clinical latency” for 10, perhaps up to 20 years; in this stage, the symptoms are mild, sometimes even non-existant. This study aims to investigate the potential existance of specific patterns in the genome of HIV, and the prognosis of the infected patient. Discovery of such patterns could help aid researchers in improved understanding of the genetics of HIV, assisting in identifying potential patterns that researchers should look for to help infected doctors predict patient prognosis more accurately. Moreover, the identification of specific mutations or recurring patterns that are highly deleterious to the infected patient could aid in the development of drugs to target those genes containing the deleterious mutations.

ACHIEVEMENTSS

Honors thesis defense passed – April 25, 2014
Short paper and poster: ACB BCB ’13 – ACM International Conference on Bioinformatics, Computational Biology and Biomedicine, Sept 22-25, Washington DC
Oral presentation: Third Annual Susquehanna Valley Undergraduate Research Symposium, SVURS 2013, August 6, Geisinger Research, Danville, PA
- Winner for oral presentation – One of three chosen out of 67 submissions!
Poster: Kalman Research Symposium 2013, April 13, Bucknell University, Lewisburg, PA.

POST GRADUATION UPDATES

Charles was accepted into to a pre-med program at Temple University, and will be starting medical school immediately thereafter.

Brigitte Hofmeister ’14

Project: Modeling the Evolution of Influenza
Duration: Summer 2013 – Fall 2013
Funding: Bucknell University Program for Undergraduate Research, CS Dept. Funding

OVERVIEW

The primary aim of this project was to develop a model of evolution of the Influenza virus. We were interested in learning if there were any model that could predict future variants better than random. This work was far more complex than we initially envisioned. However, some really interesting

The first project (completed Summer 2013) was to develop an alignment-free model that can assess the similarity between protein sequences. The grand objective, however, was to induce a model of evolution among one or two of the gene products of Influenza. To do this, we started with an n-gram model of the protein, and compute a distance between sequences by not only considering n-grams that are identical, but also those that have high biological similarity. To this end, we incorporate a standard substitution matrix (e.g. BLOSUM62) in the distance calculation between n-grams that do not have a 100% match. This work ended up with our first project, and ultimately the primary outcome that had the most utility: Using n-gram protein models with substitution matrices for phylogenetic analysis.

ABSTRACT

Phylogenetic analyses, specifically phylogenetic tree constructions, are important for understanding evolution and species relatedness. Most methods require a multiple sequence alignment (MSA) to be performed prior to inducing the phylogenetic tree. MSAs, however, are computationally expensive and increasingly error prone as the number of sequences increase, as the average sequence length increases, and as the sequences in the set become more divergent. We introduce a new method called ngPhylo, an n-gram based method that addresses many of the limitations of MSA-based phylogenetic methods, and computes alignment-free phylogenetic analyses on large sets of proteins that also have long sequences. Unlike other methods, we incorporate the use of standard substitution matrices to improve similarity measures between sequences. Our results show that highly similar phylogenies are produced to existing MSA-based methods with less computational resources required.

ACHIEVEMENTS

Short paper and poster: ACB BCB ’13 – ACM International Conference on Bioinformatics, Computational Biology and Biomedicine, Sept 22-25, Washington DC [link] [PDF]
Poster: Kalman Research Symposium 2013, April 13, Bucknell University, Lewisburg, PA.

POST GRADUATION UPDATES

Brigitte is pursuing a doctorate at University of Georgia in Bioinformatics, starting Fall 2014

Matthew Segar, ’12

Project: A probabilistic method for assembly of next generation sequencing instrumentation
Duration: Summer 2011 – Spring 2012
Funding: Bucknell PUR, Provost’s Office, CS. Dept Funds

ABSTRACT

With the advent of cheaper and faster DNA sequencing technologies, assembly methods have greatly changed. Instead of outputting reads that are thousands of base pairs long, new sequencers parallelize the task by producing read lengths between 35 and 400 base pairs. Reconstructing an organism’s genome from these millions of reads is a computationally expensive task. Our algorithm solves this problem by organizing and indexing the reads using n-grams, which are short, fixed-length DNA sequences of length n. These n-grams are used to efficiently locate putative read joins, thereby eliminating the need to perform an exhaustive search over all possible read pairs. Our goal is to develop a novel n-gram method for the assembly of genomes from next-generation sequencers. Specifically, a probabilistic, iterative approach will be utilized to determine the most likely reads to join through development of a new metric that models the probability of any two arbitrary reads being joined together. Tests were run using simulated short read data based on randomly created genomes ranging in lengths from 10,000 to 100,000 nucleotides with 16 to 20x coverage. We have been able to successfully re-assemble entire genomes up to 100,000 nucleotides in length.

ACHIEVEMENTS

Honor’s Thesis: A probabilistic method for assembly of next generation sequencing instrumentation
- Matt was awarded the Harold W. Miller prize — a competitive university-wide award given to one or two students at graduation that complete a highly successful honors thesis. CONGRATULATIONS, MATT! The award was well-deserved.
Poster (International Conference) – Presented at 20th Annual International Conference on Intelligent Systems for Molecular Biology, ISMB 2012, July 15-17, Long Beach, CA
Poster – Susquehanna Valley Undergraduate Research Symposium, August 9, 2011 Geisinger Research, Danville, PA
Poster – Sigma Xi Summer Research Symposium, July 27, 2011, Bucknell University, Lewisburg, PA

POST GRADUATION UPDATES

Matt completed a masters in bioinformatics at Indiana University – Purdue University Indianapolis in Spring, 2014. He has now been accepted into the School of Medicine at Indiana University.

Alex Barteau, ’13

Project: Development of protein sequence analysis software
Duration: Summer 2011 – Spring 2012
Funding: University of Nebraska Medical Center
Collaboration: Dr. Chittibabu Guda

ABSTRACT

Proteins are the essence of every living organism. Every protein has a well-defined function, and must localize in the cell in order to carry out its function. This information about the protein is encoded in the protein sequence itself. Alex will be working on a project started by Professor King that analyzes protein sequences to look for recurring patterns that are related to protein localization and their function. These observed patterns can then be used to suggest information about new protein sequences. The aim of this project is to make the software publicly available for the biological and biomedical research community.

ACHIEVEMENTS

Poster – Sigma Xi Summer Research Symposium, July 27, 2011, Bucknell University, Lewisburg, PA
King BR, Vural S, Pandey S, Barteau A, Guda C. ngLOC: software and web server for predicting protein subcellular localization in prokaryotes and eukaryotes. BMC Research Notes; 2012; 5(351) [link] [PDF]