Project Member:
Lorena Topete
The goal of this quarter:
I will do the easy using 2 techniques, then compare which one was faster.
Project Description:
Short reads need to be mapped to the genome for re-sequencing.
The problem:
- Given a string of length L=30, find where it matches a substring within D=2 mismatches in a length N=3,000,000,000 sequence.
Evaluate the quality based on:
- The speed of the mapping algorithm.
- The memory use of the mapping algorithm.
- The accuracy of the mapping algorithm (for approximate approaches).
- Easy: Build a small scale mapper that can map strings of length 30 to sequences of length 1,000,000.
- Medium: Build a mapper that can scale to sequences of length 3,000,000,000. It can be slow.
- Very Hard: Build a fast mapper.
Related Papers:
Reinert G., Waterman MS, On the length of the longest exact position match in a random sequence, IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM, 2007 Jan-Mar; 4(1): 153-6.
also, here are some links:
Genome Assembly with Short Reads
Genome Analyzer Applications: Overview
Illumina Sequencing Technology
Human genome map for sale on eBay
The NGFN launches one of the World’s Largest Studies to Discover Genetic Causes of 25 Diseases
See great article references in the last slide of the power point presentation.
Solexa:
The Progress:
Week 4: Built the wikipage and stared research on topic. grade: A
Week 5: did some more research, looked in the hapmap website to download data. grade: A
Week 6: started coding with this ideas in mind:
- Will generate random sequences
- Will implement 2 approaches so it can give the user 2 choices of algorithms:
- The greedy algorithm
- The hash table approach
- Will show the time it took with each approach
grade: A- because I didn't do that much coding, but I structured my ideas
Week 7: coded in c++ using simple algorithm
- generated random 1million-base sequence
- mapped a single random sequence
grade: A
Week 8: kept coding
- Decided not to do hash tables anymore because of memory
- decided to compare with xor approach
- started saving the links to some journal articles to include something interesting in the presentation
grade: A
Week 9: didn't do as much because I was away to San Francisco for a conference.
grade: B because I am still on track, but I still need the slides, which I have not done
Week 10:
- Did slides
- generated automatic tester that would run X number of mappings (short reads) Y number of times
- Generated statistics for the difference in performance
grade: A
Final Presentation file uploaded. Click "files" on the bottom right corner to see it!
Source Code uploaded