Mapping of Reads - Raymond Moy

When genomes are being sequenced, the methods for the sequencing only reveal small samples a genome at a time. To construct a person's whole genome, it is easiest to take these samples and then match them against a known genome to find out where in a person's DNA each sample came from. This matching is possible because most DNA is similar from person to person, so when we are matching a sample of DNA to a know full genome, the sample should look very similar to somewhere on the full genome, with the exception of one or two different nucleotides. My project is to create a mapper that can map samples of length 30 nucleotides to somewhere in a length 1,000,000 nucleotide genome sequence. The sample of 30 nucleotides can have up to two mismatched nucleotides when compared to the matching genome subsequence.

About me

I am a third-year computer science student. I split most of my time between classes and copy editing for the Daily Bruin.

Goal for end of quarter

My goal for the end of the quarter is to create a mapper program that can take in a sample of 30 nucleotides as input and return where in a sequence of 3,000,000,000 nucleotides that sample is from. My secondary goals, in order of importance, are that the mapper takes up a small amount of RAM, the mapping is fast and the indexing is fast. By taking up a small amount of RAM, I mean that the mapper should take up less than 1 GB of RAM when it is running. For the mapping to be fast, I expect the mapping time of one read to average under a minute when the indexing time is included. When the indexing time is not included, I would like the mapper to map one read in an average of under 10 seconds. The goal of minimizing the indexing time is not as important as the goals mentioned previously, however, I would like to set a goal of under 1 hours for the time it takes to index a whole genome using my current indexing system.

Schedule

Week 4
Pick project and create wiki page.
Week 5
Think about structure of project and examine data. Possibly begin coding skeleton of project.
Week 6
Code bulk of project.
Weeks 7 and 8
During week 7, get a working version of the program. Begin optimizing. If I am not done with most of the main project by week 7, coding the project may go into week 8. Get a good, working version of the program completed by week 7 or 8, then I will begin expanding the program to address this project's medium-difficulty objectives.
Weeks 9 and 10
Develop a working prototype program that can map reads to a full genome of 3 billion nucleotides.

Weekly updates

Week 10

I changed my read mapper so that it can handle a full human genome. I then mapped 50 reads to a 2 billion nucleotide reference. For the reads with 0 or 1 mutations, they were mapped with 100% accuracy. For the reads with 2 mutations, the accuracy was lower at 95%. I have put together my presentation and have included it and my final code here: project code and presentation.

To-do list for next week
Nothing!

Evaluation for this week
I successfully implemented a read mapper that can map reads to a full genome. Grade: A

Problems that arose this week

Problems that were solved this week


Week 9

This week I first did some tests on my original solution to determine how many nucleotides I should index at a time. I found that for up to a 10 million nucleotide index, there is no performance decline. When I did an index of 15 million nucleotides, however, there was some performance decline. Therefore, I decided to use 300 indexes of about 10 million nucleotides each in order to index an entire 3 billion nucleotide genome.

To store these indexes, I decided that I would store each index as a text file. Each line of these files would hold the locations of a certain string of nucleotides. I thought that using this method might be efficient because I wouldn't have to load each index fully into memory. Instead, I could just parse through each file to the lines of text that I need and load each line of text into my program. Unfortunately, I was wrong.

To begin with, the indexing itself took several hours to complete, which was much longer per 10 million nucleotides than it had been previously taking. I attribute this increased time to index mostly to the file I/O that I had to perform to store the indexes. Another factor that contributed to the extra time it took to index the genome was that, because I had chosen to store the indexes in a text file, I was storing ints in a much less efficient way compared to how they are stored in main memory. Thus, I was actually storing a higher volume of data that in my original implementation.

Once I had created indexes for about half of a full human genome, I tested it. Unfortunately, I got some very odd results for some simple test cases. After debugging my program for a few hours without much success, I've decided to change my program so that it no longer stores indexes on the hard drive. Instead I will index each group of 10 million nucleotides every time I am mapping a group of reads. Obviously this method will be inefficient in the long term for mapping reads, however, for my prototype program, I think this method is sufficient for getting my read mapper working.

To-do list for next week

  • Add in some optimizations
  • Modify project so that it reads in 10 million nucleotides at a time, creates and index, maps to that index, and then moves on to the next 10 million nucleotides

Evaluation for this week
Although I didn't get much tangible progress done as far a coding was concerned, I think I did progress a great deal in terms of the design of my final project. I now have a final design for my read mapper.

Problems that arose this week
I had a problem with how to store my indexes to the hard disk.

Problems that were solved this week
I redesigned my program so that it no longer has to store the indexes to the hard disk. Now my programs will simply recreate the index each time they are mapping reads.


Week 8

This week I obtained a computer with enough free hard drive space to store a full genome index. However, after studying the problem of indexing a whole genome at once and looking at the computational power needed to map a read when you have to consider the entire genome, I've decided to scale down my solution. Instead of indexing the entire genome, I think a more prudent solution will be to index somewhere between 1 million and 30 million nucleotides at a time. Then, for a given read, I will identify possible mapping locations within each "subindex" and then bring the results together to determine the most likely overall mapping location. I believe this new solution will be faster than indexing an entire genome at once, mainly because I can store smaller-size indexes entirely in RAM. Also my new solution will be far more open to being run in parallel, should I want to explore that possibility later.

To-do list for next week

  • Determine a method for storing and then loading each "subindex"
  • Modify project so that it can use and load the indexes

Evaluation for this week
I'm already done with the main project, so everything else is kind of just extra. I think I did a decent job looking at solutions for how to expand my easy project solution to solve the medium project. Grade: A

Problems that arose this week
Thinking about the medium project made me realize that I might have a problem with storing an index for the full genome and how I might have to load only parts of it at a time.

Problems that were solved this week
To solve the problem of storing the whole genome at once, I came up with a different solution to the project that should require much less RAM at a time.


Week 7

I got the easy-difficulty project working! It actually wasn't too difficult once I got around the challenge of creating test data for my program.

Here are the preliminary data for my program:

Number of nucleotides in substring we try to map at a time Time to Index Memory Mapping Time Accuracy
5 4009.54 ms 49,824 K 12166 ms 100%
6 3101.66 ms 49,824 K 1213.2 ms 100%
7 3074.91 ms 49,824 K 101.199 ms 100%
8 3147.19 ms 49,824 K 26.8051 ms 100%
9 3230.66 ms 49,824 K 21.9813 ms 100%
10 3236.61 ms 49,824 K 9.50987 ms 98%
11 3255.69 ms 49,824 K 8.71899 ms 96%
12 3334.02 ms 49,824 K 8.50499 ms 96%
13 3377.48 ms 49,824 K 7.90883 ms 94%
14 3400.7 ms 49,824 K 8.16305 ms 90%

To-do list for next week

  • Explore the possibility of trying to do the medium-difficulty project
  • Explore obtaining a machine with enough free hard-drive space to pursue the medium-difficulty project
  • If I decide to, begin writing a program for the medium-difficulty project

Evaluation for this week
I got everything done that I wanted to get done this week. I have a program that fulfills the requirements of the easy-difficulty project. Grade: A

Problems that arose this week
I originally created my own program to make up test cases for my mapper. I thought that by using my own program it would be easier to format the test cases data for my own uses. It turns out, however, that the C++ random number generator loops every couple 100,000 times it is called, causing repeats in my sample genome.

Problems that were solved this week
To solve my problem, I first thought about modifying my program to seed the random number generator every few thousand calls. I decided to first try the python-based generator by Nick Furlotte first. His random number generator seems to be much more random, and for my 50 test cases, my mapper did not find any repeats in the reference genome created by his program.


Week 6

I didn't get as much done as I would have liked this week because I was pretty busy in my other classes. Hopefully I can make up for it next week.

This week I started out with creating test data and test cases for that data. Then I started working on how to index the data and how to set up my interface with the data files. Right now I've coded the skeleton for the indexer, but I'm still testing out a couple of variations on it to see which one is optimal.

To-do list for next week

  • Figure out which indexing algorithm makes for the optimal mapper
  • Implement mapping portion of project
  • Finish easy version of project

Evaluation for this week
I got pretty bogged down in my other classes this week. I definitely felt like I should have been able to complete at least a slow implementation of my project.
Grade: B-

Problems that arose this week
I'm still trying to determine an optimal indexing algorithm for the reference genome.

Problems that were solved this week
So far I've tried to use a less complicated method for indexing the reference genome. I believe it will work for the easy-difficulty project. If I move on to the medium-difficulty project, I do not believe it will be sufficient, but I would need more testing to verify that fact.


Week 5

I started out this week looking at the HapMap data to see if it would be useful for my project. After examining it, I don't think it will be useful for my project because it only contains SNPs. For my project, I would like a genome that has the non-SNPs as well because both SNPs and non-SNPs make up the reference genome I need to determine where the sample substrings of DNA come from. Thus, I think I'll be using the program to create random DNA sequences that was demonstrated in discussion last week.

While I was considering using the HapMap versus using random data, I was also thinking about how helpful data compression would be for the base pair data. I was mostly considering whether it would be worth it to transform base pair data that are input as characters into a representation that only requires two bits (since there are only four base pairs). Such a representation could reduce my memory usage by up to 75 percent for portions of my program that will have to represent base pairs. I have decided, however, not to try this compression, for a couple of reasons.

First, I'm not sure that such a representation would actually reduce my memory footprint that much. My current ideas for how to map the reads involve using indexes, which will probably need to use a lot more ints than base-pair representations. If I'm not using that many representations of base pairs, then making them use less memory is probably not going to greatly affect overall memory usage.

Second, I'm not sure the added complexity of the compression would be worth implementing. Especially in light of the first issue, I don't think that spending my time creating bit masks and such is all that wise.

Of course, if I end up trying to take on the medium-difficulty project for this topic — which involves 3,000,000,000 base pairs instead of 1,000,000 — I might reconsider.

I also read a research paper that addressed the same topics as my project and looked a little at the program MAQ, which is basically a more complicated version of the program I am trying to create. The reading expanded on some of the ideas I had for how to implement my project.

To-do list for next week

  • Finish sketching out a full design of my program
  • Begin implementing program

Evaluation for this week
I think I got a lot done this week, especially in considering ideas for how to implement my project. I didn't fully hash out my plans though, so I'll give myself an A-. Grade: A-

Problems that arose this week
I wasn't sure whether or not to compress the base-pair data.

The research paper I read used a relatively complicated method for mapping reads. I believe I could implement a similar method in my project, but I'm not sure that it's necessary for the easy-difficulty project. If I move up to the medium-difficulty project, the more complicated method will probably be crucial, but for now I'm not sure if I should use such a complicated solution or if I should go with a simpler solution in the same vein.

Problems that were solved this week
After some thinking and the considerations listed above, I decided not to compress the base-pair data.


Week 4

Progress
This week I picked a project and created this page. I also created a schedule for completing the project.

To-do list for next week

  • Do some research to find out background information on why this project is useful for the computational genetics discipline
  • Familiarize myself with the HapMap data and what format it comes in
  • Think about any transformations I will want to make to the data to optimize for space
  • Begin designing out the main methods I will use for mapping samples

Evaluation for this week
The schedule that I have created for the project should keep me on track and mindful of the pace at which I have to complete things. I will have to add more details as I progress through the project, if not in the "Schedule" section, then at least in the "To-do list for next week" section as I fill it out each week. I believe I have created a page that will guide me through this project. Grade: A

Problems that arose this week
None.

Problems that were solved this week
None.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License