The Fast Mapping with Lucene project is a short-read mapping project that uses generation of "handles" for fast lookups of short reads. Lucene refers to a high-performance search engine that powers the implementation of the project: http://lucene.apache.org/
I (Patrick Walton) am a first-year graduate student in the Computer Science Department.
Implement and present the algorithm.
The Fast Mapping with Lucene presentation is on Google Docs and can be viewed online:
The source code is attached.
Gathered information about the theory behind short read sequencing.
Began researching short read sequencing. Designed the algorithm.
Implemented and verified the algorithm against the Human Genome Project's reference chromosome 2. Began work on the presentation.
Finalized the presentation and presented the short read mapper.
After implementing and presenting the algorithm, I noticed interesting similarities between the algorithm and the n-gram method for spell-checking: http://sujitpal.blogspot.com/2007/12/spelling-checker-with-lucene.html
Besides the obvious fact that both are implemented using Lucene, there is a more interesting theoretical similarity, in that both algorithms use the notion of Boolean queries (OR) of handles (n-grams) stored in the database to do fast lookups of similar strings.