mapping of reads - Thomas Mallery

Mapping of Reads - Easy


Outline

  1. Obtain / Generate Input
  2. Design and Build Program
  3. Generate Test Cases / Test

Weekly Progress Reports

  • Week:
    1. initial layout / design
    2. Continouing design
    3. Decide to minimize memory with file pointers
    4. Optimize computational time with threads
    5. ….
    6. Last minute change / stop minimizing memory for thread safety

Input

Background

Data input for this program is encoded, using 3 bits, as follows:

  • 0x0 - EOF
  • 0x1 - a
  • 0x2 - t
  • 0x3 - g
  • 0x4 - c
  • 0x5 - missing

A single byte is loaded with two nucleotides, that fit into the upper and lower half bytes.
The 4th bit in each half byte is always 0.

To aid in reading left to right, the upper four bytes the first piece and the lower four the second.

For testing purposes using actual DNA data strings are not necessary.
So random sequences for both the DNA string and the string to search for were generated.

Generate Input

Generation is simple, the DNA sequence has a length of 1,000,000, and is filled with random values of a,t,g,c.

Encode char into my dna coded element

This module was added just as a utility function to take a DNA sequence that is has each nucleotide stored as an individual char and encode it using the given structure.

Program Design

Speed up calculation

  • Using Threads to divide up the work
    • Current Implementations
      • will load the entire segment that each thread is searching into memory, on systems with limited memory compared to the data set this approach will have to be modified.
    • My Implementation
      1. Use pthreads
        1. Each thread is given a segment of the DNA, which is stored in their own memory, to cover and each thread is given a pointer to the sub DNA segment to search for

Testing

  1. Generated Random Input
  2. Test Cases
    1. test that N, works as a wild card
    2. test that both strings work up to and stop at the end
    3. test that the DNA string gets searched through repeatedly starting at the first char
    4. test that if an error is found, the algorithm correctly checks the other side of the DNA sequence before moving down the DNA chain

Results

Running time, in seconds for number of Threads:

  1. 0.016
  2. 0.009
  3. 0.011
  4. 0.009

On a dual core machine, the most efficient method will be two threads as each one can run with minimal interruption. The exact times are relatively accurate, but the quickness depends on the distribution of the dna string relative the search string. Also the number of allowed mismatches will have a great impact.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License