Short Read Mapper (Brian Lam)

Project Description

Sequencing the genome is a relatively expensive and time-consuming task. The cost of reading the genome is reduced by reading many short segments of it, rather than the entire string. However, this creates complications because we do not know the original ordering of the short reads in the actual genome. The purpose of this project is to create a program which takes short reads and finds their index in the original genome, based on a predefined set of assumptions.

About Me

I am a 4th year computer science undergraduate student. I plan to graduate this quarter and go on to work in industry.

Project Goal

My goal is to at least make an efficient and easily extensible short read mapper that can map reads of length 30 to a genome of length 1,000,000. If I have time, I would like to improve the mapper's speed and memory efficiency to be able to map to a genome of length 3,000,000,000.

Project Files

Weekly Updates

April 30, 2009

Grade for the Week: B

Weekly Progress
I wasn't able to make any progress on developing the mapper this week because of midterms, but I had planned for this ahead of time.

Next Week's Plan
My plan for the coming week is to learn how to use the sequence simulator to create files with genomes of length 1,000,000, and also to create test cases with known indices. Also, I would like to implement a naive read mapping algorithm which simply stores the entire genome in memory and compares the short read base-by-base at each position.

Problems
I did not have any free time to work on the project because of midterms.

Problems Solved
Created a plan to begin making progress on the project.

May 7, 2009

Grade for the Week: C

Weekly Progress
Although I did start planning out and writing my C++ code which will actually be the short read mapper, I didn't learn how to use or experiment with the short read simulator.

Next Week's Plan
To lay a groundwork to build the read mapper on, I plan on learning how to use the short read simulator during the next week. Specifically, I need to:

  1. Create a reference genome of length L
  2. Modify known nucleotide positions to create a target genome
  3. Create short reads from the target genome to use as test cases

I also want to update the C++ code so that it can use the output of the short read simulator as automatically generated test cases.

Problems
I had more homework and studying than I expected, so I didn't have much time to work on the project. Also, I haven't yet learned how to use the short read simulator.

Problems Solved
I began coding a couple basic C++ functions that will be the basis of my short read simulator. Right now, I am planning on using the most naive approach to mapping short reads, which is sliding the short read index by index until a match is found. To do this, I have created the interfaces and code for two functions:

int shortReadMapper(char* read, char* genome, int maxErrors, int len);
bool readMatch(char* read, char* genome, int maxErrors);

The shortReadMapper() function is called with a short read and the reference genome, along with the maximum number of permissible errors or SNPs. The readMatch() functions compares the read to the reference genome at the specific location and returns true if the read matches with <= maxErrors mismatches, or false otherwise. shortReadMapper() iterates through all positions in the reference genome, and repeatedly calls readMatch() until a matching location is found.

May 14, 2009

Grade for the Week: B

Weekly Progress
I downloaded Nick's short read simulator, and learned how to use it to create a reference genome, a sample genome, and short reads from the sample. This lays the foundation for everything that I will do in the future with my short read mapper. One of the major goals I want to achieve with the simulator is to create a set of test cases which I can use to ensure that any modifications I make to my mapper does not decrease its accuracy. This should be very easy with known test cases, because I can just run the sequencer again and make sure the output remains the same.

Next Week's Plan
Next week I plan on finishing the C++ code which takes the output files of the short read simulator and translates them into a format that I can use for my short read simulator.

Problems
The output from the short read simulator contains much more information than I need to test my short read mapper, and is also written in Python, which I have very little experience with. Although the sample code provided with the simulator gives me a good idea of how to use the basic functionality, modifying the output to match my goals will be a little confusing.

Problems Solved
I downloaded and learned how to use the short read mapper. Also, I began to code the C++ file processing code that will translate what the simulator outputs into usable data structures.

May 21, 2009

Grade for the Week: A

Weekly Progress
I finished coding file parsing functions and coded the trivial algorithm for short read mapping. This took a little bit of modifying the way my old functions worked, and what parameters they took. Although the trivial algorithm turned out to be inefficient, it does at least give me a starting point and a baseline to measure future work against.

The basic layout of the trivial algorithm is as follows. We have a reference genome, many short reads of length L, and we allow up to D mismatches per short read. These parameters are taken as inputs to the functions, so different cases can be tested very easily. The reference genome and the short reads are provided to the mapper as files which are in the format produced by Nick's short read simulator.

For each short read, slide across the reference genome until we reach a position with less than D mismatches, and then return this position. Using the GetTickCount() function in the <windows.h> library, I recorded the following performance:

Reference Genome Length Execution
1,000 3 sec
10,000 30 sec
100,000 ?

For the length 100K reference genome, I let the program run for a few minutes before manually stopping it.

Next Week's Plan
During the next week, I want to clean up my code so that future additions and modifications are much easier. Specifically, I would like to convert my functions to be part of a class, instead of global functions as they are right now. Also, I need to figure out a faster way of mapping the short reads, because the trivial algorithm is not fast enough to map to a reference of length 100K, let alone 1M.

Problems
Although I got the basic mapping functionality to work, my code is hard to modify because I've just been adding a little bit at a time, instead of following an overall design I preplanned. Also, the performance of the trivial algorithm is very bad for even a reference genome of length 100K, which means I have to either optimize it or try another solution.

Problems Solved
I wrote a file parsing function based on the format that the short read simulator outputs in, streamlining the process to create test cases for the short read mapper. Using this function, I can now create a reference genome of any length, and also a set of short reads along with their expected indices automatically. This made it very easy to test the accuracy and performance of the trivial algorithm.

May 28, 2009

Grade for the Week: B

Weekly Progress
I completely refactored my code so that all of the functions are members of a class and therefore interact with each other much more cleanly. The public interface of the class is as follows:

class ShortReadMapper
{
public:
    ShortReadMapper();
    void setReferenceGenome(char *referenceFilename, int refLen, int readLen);
    void setMaxErrors(int max);
    void processShortReads(char* filename, int readLen);
    void printSNPs();
};

Next Week's Plan
Next week, I plan on designing and implementing a second, more efficient short read mapping algorithm. This is needed because my current trivial algorithm is not fast enough to map the required 1,000,000 reads in less than an hour.

Problems
I had two more midterms this week, so I did not have enough time to start coding a faster algorithm.

Problems Solved
Converting the code to be in a class allowed the functions' logic and structure to be much cleaner and easier to modify. Before, when I used to have a list of global functions defined in my main.cpp file, I would usually have to change the parameters the function took to make a significant change or implement a new idea. However, now I can simply change the member variables that the class contains. Also, since all functions share the class variables, the need to continuously pass around state variables is eliminated.

June 4, 2009

Grade for the Week: A

Weekly Progress
I designed and completed the hash index algorithm based on the notes presented in class, and created my PowerPoint presentation and presented my results to the class. My hash index data structure stores the index of every substring of a certain length from the reference genome, and then this is used to find possible matching positions for the short reads.

The main idea is that if we're allowing D mismatches, and we divide each short read into D+1 pieces, then at least one piece will match exactly with the reference genome. So if every short read has length L, then we will say that the piece length is L/(D+1). Given this setup, a high-level layout of the hash index algorithm is as follows:

  1. For every substring of piece length in the reference genome, store its index in the hash index.
  2. Break each short read up into D+1 pieces, and for each piece look up possible matching indices in the hash index. This gives us candidate positions that may match with our short read.
  3. For each candidate position, use the trivial algorithm to check if the short read actually matches the position.

Next Week's Plan
Polish the Wiki, but other than that I'm done!

Problems
It was difficult to write the entire hash index algorithm and debug it within a week, but with careful planning it was possible.

Problems Solved
The hash index algorithm is much faster than the trivial algorithm, and can therefore be used to map to the required reference genome length of 1,000,000. Below is the recorded performance for the hash index algorithm:

Reference Genome Length Execution
1,000 260 ms
10,000 550 ms
100,000 780 ms
1,000,000 1,620 ms

Comparing to the performance of the trivial algorithm (listed in the May 21 update), mapping to a reference genome of length 1,000,000 with the hash index solution takes about half as long as the trivial algorithm takes to map to a reference length 1,000. This is a huge improvement!

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License