The goal of this quarter:
I will do the easy using 2 techniques, then compare which one was faster.
Short reads need to be mapped to the genome for re-sequencing.
- Given a string of length L=30, find where it matches a substring within D=2 mismatches in a length N=3,000,000,000 sequence.
Evaluate the quality based on:
- The speed of the mapping algorithm.
- The memory use of the mapping algorithm.
- The accuracy of the mapping algorithm (for approximate approaches).
- Easy: Build a small scale mapper that can map strings of length 30 to sequences of length 1,000,000.
- Medium: Build a mapper that can scale to sequences of length 3,000,000,000. It can be slow.
- Very Hard: Build a fast mapper.
Reinert G., Waterman MS, On the length of the longest exact position match in a random sequence, IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM, 2007 Jan-Mar; 4(1): 153-6.
also, here are some links:
See great article references in the last slide of the power point presentation.
Week 4: Built the wikipage and stared research on topic. grade: A
Week 5: did some more research, looked in the hapmap website to download data. grade: A
Week 6: started coding with this ideas in mind:
- Will generate random sequences
- Will implement 2 approaches so it can give the user 2 choices of algorithms:
- The greedy algorithm
- The hash table approach
- Will show the time it took with each approach
grade: A- because I didn't do that much coding, but I structured my ideas
Week 7: coded in c++ using simple algorithm
- generated random 1million-base sequence
- mapped a single random sequence
Week 8: kept coding
- Decided not to do hash tables anymore because of memory
- decided to compare with xor approach
- started saving the links to some journal articles to include something interesting in the presentation
Week 9: didn't do as much because I was away to San Francisco for a conference.
grade: B because I am still on track, but I still need the slides, which I have not done
- Did slides
- generated automatic tester that would run X number of mappings (short reads) Y number of times
- Generated statistics for the difference in performance
Final Presentation file uploaded. Click "files" on the bottom right corner to see it!
Source Code uploaded