Lorena Topete: Mapping of Reads

Project Member:

Lorena Topete

The goal of this quarter:

I will do the easy using 2 techniques, then compare which one was faster.

Project Description:

Short reads need to be mapped to the genome for re-sequencing.

The problem:

  • Given a string of length L=30, find where it matches a substring within D=2 mismatches in a length N=3,000,000,000 sequence.

Evaluate the quality based on:

  • The speed of the mapping algorithm.
  • The memory use of the mapping algorithm.
  • The accuracy of the mapping algorithm (for approximate approaches).
  1. Easy: Build a small scale mapper that can map strings of length 30 to sequences of length 1,000,000.
  2. Medium: Build a mapper that can scale to sequences of length 3,000,000,000. It can be slow.
  3. Very Hard: Build a fast mapper.

Related Papers:

Reinert G., Waterman MS, On the length of the longest exact position match in a random sequence, IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM, 2007 Jan-Mar; 4(1): 153-6.

also, here are some links:


Genome Assembly with Short Reads

Genome Analyzer Applications: Overview

Illumina Sequencing Technology

Human genome map for sale on eBay

The NGFN launches one of the World’s Largest Studies to Discover Genetic Causes of 25 Diseases

See great article references in the last slide of the power point presentation.



The Progress:

Week 4: Built the wikipage and stared research on topic. grade: A

Week 5: did some more research, looked in the hapmap website to download data. grade: A

Week 6: started coding with this ideas in mind:

  • Will generate random sequences
  • Will implement 2 approaches so it can give the user 2 choices of algorithms:
    • The greedy algorithm
    • The hash table approach
  • Will show the time it took with each approach

grade: A- because I didn't do that much coding, but I structured my ideas

Week 7: coded in c++ using simple algorithm

  • generated random 1million-base sequence
  • mapped a single random sequence

grade: A

Week 8: kept coding

  • Decided not to do hash tables anymore because of memory
  • decided to compare with xor approach
  • started saving the links to some journal articles to include something interesting in the presentation

grade: A

Week 9: didn't do as much because I was away to San Francisco for a conference.
grade: B because I am still on track, but I still need the slides, which I have not done

Week 10:

  • Did slides
  • generated automatic tester that would run X number of mappings (short reads) Y number of times
  • Generated statistics for the difference in performance

grade: A


Final Presentation file uploaded. Click "files" on the bottom right corner to see it!

Source Code uploaded

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License