Brigitte Hofmeister ’14

Project: Modeling the Evolution of Influenza
Duration: Summer 2013 – Fall 2013
Funding: Bucknell University Program for Undergraduate Research, CS Dept. Funding


The primary aim of this project was to develop a model of evolution of the Influenza virus. We were interested in learning if there were any model that could predict future variants better than random. This work was far more complex than we initially envisioned. However, some really interesting

The first project (completed Summer 2013) was to develop an alignment-free model that can assess the similarity between protein sequences. The grand objective, however, was to induce a model of evolution among one or two of the gene products of Influenza. To do this, we started with an n-gram model of the protein, and compute a distance between sequences by not only considering n-grams that are identical, but also those that have high biological similarity. To this end, we incorporate a standard substitution matrix (e.g. BLOSUM62) in the distance calculation between n-grams that do not have a 100% match. This work ended up with our first project, and ultimately the primary outcome that had the most utility: Using n-gram protein models with substitution matrices for phylogenetic analysis.


Phylogenetic analyses, specifically phylogenetic tree constructions, are important for understanding evolution and species relatedness. Most methods require a multiple sequence alignment (MSA) to be performed prior to inducing the phylogenetic tree. MSAs, however, are computationally expensive and increasingly error prone as the number of sequences increase, as the average sequence length increases, and as the sequences in the set become more divergent. We introduce a new method called ngPhylo, an n-gram based method that addresses many of the limitations of MSA-based phylogenetic methods, and computes alignment-free phylogenetic analyses on large sets of proteins that also have long sequences. Unlike other methods, we incorporate the use of standard substitution matrices to improve similarity measures between sequences. Our results show that highly similar phylogenies are produced to existing MSA-based methods with less computational resources required.


  • Short paper and poster: ACB BCB ’13 – ACM International Conference on Bioinformatics, Computational Biology and Biomedicine, Sept 22-25, Washington DC [link] [PDF]
  • Poster: Kalman Research Symposium 2013, April 13, Bucknell University, Lewisburg, PA.


Brigitte is pursuing a doctorate at University of Georgia in Bioinformatics, starting Fall 2014