Read Mapping - GDarnell

Project: Mapping of Reads

Our current sequencing technology only has the ability to produce random "short" reads of 30 bases within each chromosome, which can contain hundreds of million bases. The goal of this project is to determine where in the genome these random reads are from. One of the difficulties in this process is determining the threshold for the number of mutations we expect to find in each mapped sequence. The goal of this project is to produce an efficient (speed efficient and memory efficient) computer algorithm to map the random sequences accurately to the positions in the genome from which they are originated.

Project Member

I am currently a first year undergraduate student at UCLA. My declared major is Computer Science in the Henry Samueli School of Engineering and Applied Science. I have experience programming in the following computer languages: Java, C++, C, and MIPS. Over the past few years I have attained several professional computer accreditations, in both hardware and software, including the Microsoft Professional Certification.

Project Goal

The design for this project will be an algorithm that focuses more heavily on speed efficiency than memory efficiency. The motivation for this design comes from the current resources and goals that technology companies, like Google, have today. Google stores about 33 trillion database entries which amounts to hundreds of terabytes of information; however, each search takes a fraction of a second. Data storage is not at a premium; the most important goal of this project is to implement a read mapper which will execute on the scale of seconds and not days, or years.

Weekly Schedule

Week 1

Begin researching current algorithms for efficient read mapping.

I have begun researching algorithms for efficient read mapping but have not yet found any implementations.

For Week 2:
Read current papers and find implementations of map readers that are used today.
Devise preliminary algorithms for fast mapper.

Week 2

This week I have found many papers describing fast short read mapping architectures. These papers provide an overview of short read mapping and describe a high-level method for mapping reads. These algorithms are presently mostly at a high level however; which would complicate the process of implementing a very efficient and complex read mapper.

For Week 3:
Ideally I would like to find a lower-level, or more detailed description of an efficient read mapper on which to base my reader. My primary goal for this week is to start to implement the skeleton of the read mapper code. I plan to write a program which will be able to read in a DNA sequence from a file and generate short reads.

Week 3

I started to implement the basis of my short read mapper in c++. I have the ability to read a file containing a DNA sequence, randomly generate a unique DNA sequence, and generate short reads from the sequence. The short reads are generated with random insert lengths from a given interval.

For Week 4:
Develop a statistical method for determining whether the insert lengths between the proposed mapping of reads are reasonable; and use this method to add additional precision to the read mapping. I need to decide whether to implement a variation of the mapping algorithm presented in lecture, or follow a more complex method described by a few papers. I will perform some algorithm analysis in order to determine how well a simpler algorithm will perform for an "easy" or "medium" project with sequences of length 1,000,000 or 3,000,000,000. By week 4 I will have started to implement one algorithm for short read mapping.

Week 4

The goal for this week was revised after receiving the revised specifications for our project. This week I researched the background on short read mapping in order to biologically motivate the challenge and present relevant issues short read mapping can contribute to solving.

For Week 5:
Design the mapping algorithm around the assumptions and limitations I will define for next week. Also propose future implementation enhancement and extensions of my short read mapper to eliminate some of the current limitations.

Week 5

I completed my goal for this week of defining the background for my project and algorithm. Given this background I proposed future improvements and areas of interest to expand on.

For Week 6:
Implement the C++ code for the short read mapper. I have chosen my design to be the generic hashmap lookup database for mapping individual reads to the genome.

Week 6

After further researching into short read mapping I found a key optimization that uses bit operations to compare mismatches at a significantly more efficient rate than the naive comparison algorithm I was previously using. I implemented most of the c++ code I planned on; but, I have not yet implemented the mismatch evaluation efficiency.

For Week 7:
Implement the bit operations to reduce the runtime of comparing mismatches between the read and genome position. Benchmark short read mapper while varying genome length, number of reads, and read length.

Week 7

I finished implementing and benchmarking my short read mapping code.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License