Fast Mapping with Lucene

Description

The Fast Mapping with Lucene project is a short-read mapping project that uses generation of "handles" for fast lookups of short reads. Lucene refers to a high-performance search engine that powers the implementation of the project: http://lucene.apache.org/

About Me

I (Patrick Walton) am a first-year graduate student in the Computer Science Department.

Goal

Implement and present the algorithm.

Results

The Fast Mapping with Lucene presentation is on Google Docs and can be viewed online:
http://docs.google.com/Presentation?id=dcg98v9t_1ct3h4dgx

The source code is attached.

Progress Log

Week 7

Gathered information about the theory behind short read sequencing.

Week 8

Began researching short read sequencing. Designed the algorithm.

Week 9

Implemented and verified the algorithm against the Human Genome Project's reference chromosome 2. Began work on the presentation.

Week 10

Finalized the presentation and presented the short read mapper.

Further notes

After implementing and presenting the algorithm, I noticed interesting similarities between the algorithm and the n-gram method for spell-checking: http://sujitpal.blogspot.com/2007/12/spelling-checker-with-lucene.html

Besides the obvious fact that both are implemented using Lucene, there is a more interesting theoretical similarity, in that both algorithms use the notion of Boolean queries (OR) of handles (n-grams) stored in the database to do fast lookups of similar strings.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License