Description
The Fast Mapping with Lucene project is a short-read mapping project that uses generation of "handles" for fast lookups of short reads. Lucene refers to a high-performance search engine that powers the implementation of the project: http://lucene.apache.org/
About Me
I (Patrick Walton) am a first-year graduate student in the Computer Science Department.
Goal
Implement and present the algorithm.
Results
The Fast Mapping with Lucene presentation is on Google Docs and can be viewed online:
http://docs.google.com/Presentation?id=dcg98v9t_1ct3h4dgx
The source code is attached.
Progress Log
Week 7
Gathered information about the theory behind short read sequencing.
Week 8
Began researching short read sequencing. Designed the algorithm.
Week 9
Implemented and verified the algorithm against the Human Genome Project's reference chromosome 2. Began work on the presentation.
Week 10
Finalized the presentation and presented the short read mapper.
Further notes
After implementing and presenting the algorithm, I noticed interesting similarities between the algorithm and the n-gram method for spell-checking: http://sujitpal.blogspot.com/2007/12/spelling-checker-with-lucene.html
Besides the obvious fact that both are implemented using Lucene, there is a more interesting theoretical similarity, in that both algorithms use the notion of Boolean queries (OR) of handles (n-grams) stored in the database to do fast lookups of similar strings.