Short Read Mapper - Matthew Nelson

Instead of sequencing the entirety of a person's DNA at once, sequencing technology takes advantage of a technique called "short reads" that registers small portions of a person's DNA sequence. However, the segments read by these short reads are random. A basic analogy would be a jigsaw puzzle that has to be pieced back together. However, in this jigsaw puzzle, some of the pieces overlap other pieces when the puzzle is put back together, and some of the pieces don't look exactly like the picture on the box. In this case, the picture is the genome. It is the purpose of this project to create a piece of software that will take these puzzle pieces and figure out where they go by comparing them to the picture on the box.

About me

I am a third-year student currently majoring in computer science. However, I am still unsure of what specific field I wish to enter once I graduate. I have always had an interest in genetics, and thus I enrolled in this course to get a taste of how computer science applies to the field.

Goal for end of quarter

I plan on finishing this mapper early on. By finish, I mean I plan on, at the bare minimum, completing the "Easy" portion of the project. If I have spare time, I plan on tackling the "Medium" portion of the project as well. I doubt I have enough experience to make a fully-optimized program, however, so I do not plan on attempting to make a "fast" version of my mapper.

Project status

All coding for the proof-of-concept version has been completed. The program will succesfully map reads back to a "genome" of length equal to 3000000000.

Algorithm

Since the maximum number of mismatches within the string of length 30 is 2, logic dictates that, if the string were broken up into three substrings of length 10, at least one of the strings must match perfectly if there is to be even the possibility of a hit. I will be using a hash table approach. However, a problem arises in that, if one takes the entries in the hash table in a sequence of 1,000,000 (or 3,000,000,000 for that matter) as the first ten base pairs are one entry, the next ten are another entry, etc., a match (or at least the desired match) may never be found for a given string of 30. Let us take, for example, a substring that matches starting at the position 2:

AGAGGCTTACCGATTATCGAG
String: AGGCTTACCG

If the hash table broke it up such that the first ten base pairs were an entry and the second set of ten pairs were an entry, the string would not get a match in the hash table (or at least a match that pointed to position 2). One solution to this is to map every single possible set of ten base pairs in the genome to the hash table. This means 999,991 entries in the hash table! Taking each entry (i.e. position in the genome) in the hash table to be 32 bits and adding in 4^10 pointers of size 32 bits for the vectors containing the entries, we've just taken the original genome that could be fit into 244kB and turned it into 7.81MB. This isn't so much a problem for the 1,000,000 length sequence, but it's terrible for the 3,000,000,000 length sequence. The entries for the hash table alone would take up 11.2GB of memory.

Instead, since I want to eventually make my program scalable to 3,000,000,000, I will be using a different approach. Instead of breaking up the strings of length 30 into three parts, I will be utilizing a smaller substring that I will move along the length of the string of length 30. This will not work for a substring of length 10. The reasoning is as follows:

Let's take the following string for example, where a 0 represents a match with the genome and a 1 represents a difference from the genome.
000000000100000000010000000000
This string represents a worst-case scenario where there are two differences spaced at least 10 apart. If this were broken up into substrings of 10 as-is, the last substring, the only one that has a full ten sequence that matches perfectly, may not get a hit in the hash table for the very same reason explained earlier. If we tried using a moving window for the substring of length 10 along the length of the string of 30 and tested each resulting substring, we would still get a mismatch in every single substring except the last one. Thus, if this approach is to be used (and result in a significantly smaller memory footprint), a smaller substring will need to be used.

The largest substring which will always give a hit, no matter what, when using a moving window on a string of length 30 is 7. This can be seen by taking the example I provided earlier with the 0s and 1s (or any string of 28 0s and 2 1s) and moving a string such as the following
111111122222223333333444444455
which represents the breakdown of the string of length 30 into substrings of 7 and moving it along the string of 0s and 1s. There will always be at least one full matching segment. This is because, with a substring length of <read length> / <allowed mismatches + 2>, you will always be guaranteed a matching substring in addition to an entire substring's length of buffer. This buffer will account for all of the possible offsets.

Now that we have the minimum length of the substring figured out, we can recalculate the amount of space that the hash table would take up. We have 1,000,000/7 entries in the hash table of 32 bits with 4^7 pointers for vectors. We now have a size of 622kB for our 1,000,000 length genome. This is a significant improvement over the 7.81MB value from before. The 3,000,000,000 length genome takes up approximately 1.6GB of memory using this method. A large value, to be sure, but once again, a significant decrease from the value calculated earlier.

Of course, having a substring length of 7 doesn't lend itself well to a genome of length 1,000,000 as there will be a few base pairs at the end which will not be included in any full match string. Therefore, the last 7 base pairs in the genome will get their own special match string which doesn't necessarily follow the standard breakdown of the genome into groups of 7.

The number of comparisons required to match a single short read against this genome is (30 – (7 – 1)) * 3,000,000,000 / 7 / 4^7 ~= 627,792 comparisons. The original algorithm (the one I modified) requires 3 * 2,999,999,991 / 4^10 ~= 8,583 comparisons. Therefore my algorithm will take longer, but it will also reside completely in memory on a less-expensive computer.

Therefore, using this algorithm and substrings of length 7, we achieve 100% coverage. But what about the other substring lengths?

The calculation to figure out how much coverage there is for strings with two mismatches is as follows: 1 - (<chance of mismatch falling in one of the full substring regions> * <chance of the second mismatch falling in the other full substring region> * <chance of the length 30 string being from a section with only two full short substrings>).

With a substring of length 8 for strings of length 30 with two mismatches, we get 1 - (16/30 * 8/30 * 1/8) ~= 98.22% coverage. There is exactly one offset that will not always be caught with substring length 8. This is depicted below. For strings with fewer mismatches, the coverage is 100%. This solution will also use 3,000,000,000 / 8 * 4B = 1.40GB of memory for the hash table entries and will require (30 – (8 – 1)) * 3,000,000,000 / 8 / 4^8 ~= 131,607 comparisons. This method would be much quicker, but it would be slightly less likely to result in a hit.

Example
 AGAGGCT|TACCGATT|ATCGAGAT|TCGGCAT
00000000|11111111|22222222|33333333

With a substring of length 9 for strings of length 30 with two mismatches, we get 1 - (18/30 * 9/30 * 5/9) = 90.00% coverage. There are five offsets that will not always be caught with substring length 9. These are depicted below. For strings with fewer mismatches, the coverage is 100%. This solution will also use 3,000,000,000 / 9 * 4B ~= 1.24GB of memory for the hash table entries and will require (30 – (9 – 1)) * 3,000,000,000 / 9 / 4^9 ~= 27,974 comparisons. This method would be even quicker, but it would have even less coverage for strings with two mismatches. This may be within acceptable parameters, however.

Example
 AGAGGCTT|ACCGATTAT|CGAGATTCG|GCAT
  AGAGGCT|TACCGATTA|TCGAGATTC|GGCAT
   AGAGGC|TTACCGATT|ATCGAGATT|CGGCAT
    AGAGG|CTTACCGAT|TATCGAGAT|TCGGCAT
     AGAG|GCTTACCGA|TTATCGAGA|TTCGGCAT
000000000|111111111|222222222|333333333

With a substring of length 10 for strings of length 30 with two mismatches, we get 1 - (20/30 * 10/30 * 9/10) = 80.00% coverage. There is only one offset that will always be caught with substring length 10. This is depicted below. For strings with fewer mismatches, the coverage is 100%. This solution will also use 3,000,000,000 / 10 * 4B ~= 1.12GB of memory for the hash table entries and will require (30 – (10 – 1)) * 3,000,000,000 / 10 / 4^10 ~= 6,008 comparisons. This method would be even quicker still, but it would have even less coverage for strings with two mismatches. Again, this may still be within acceptable parameters.

Example
AGAGGCTTAC|CGATTATCGA|GATTCGGCAT
0000000000|1111111111|2222222222

In the other direction, with a substring of length 6, we'd have 100% coverage (as the substring length is even shorter than the 7 we calculated before). This solution would require 3,000,000,000 / 6 * 4B ~= 1.86GB of memory for the hash table entries and (30 – (6 – 1)) * 3,000,000,000 / 6 / 4^6 ~= 3,051,758 comparisons. Clearly, by making the substring length shorter, we won't be doing ourselves any favors. The memory and comparison count only get larger as the subtring length decreases further.

As for how the hash table itself is handled, I use bit manipulation (shifting and masking with bitwise AND) to grab what will become the hash table index from the genome. I then use that hash table index (which consists of the short string of length 7 from the genome) to store the genome's index, i.e. the offset where that hash table index is found. Once I've iterated through all of the 3,000,000,000 / 7 positions in the genome, I prompt the user for input.

The user then inputs a string of 30 to represent the search string. I convert the search string into its bit representation and store it in a 64-bit unsigned long long (Long longs are part of C but not part of C++, last I checked, so I'm cheating to an extent. The C++ compiler must support long longs in order for this to work properly). I then iterate through all of the possible substrings of length 7 inside the search string and search for their entries in the hash table. I iterate through each of the genome offsets for each of the substring matches and grab the representative string of length 30 from the genome. I store that in a 64-bit unsigned long long as well. I then perform a bitwise XOR on the two unsigned long longs to get another unsigned long long that represents where the two strings do not match. I iterate through the 30 2-bit pairs, perform a bitwise AND with the number 3, check to see if the result is greater than 0, and then increment a mismatch counter if the comparison is true. If the number of mismatches is less than or equal to 2, I report a hit.

Description of the attached files

Short Read Mapper-mnelson.ppt - This file contains the presentation I gave in class. It contains more information about the hash algorithm that I used, including estimates for memory usage.
mapper64-mnelson.rar - This file is the short read mapper itself. It requires a 64-bit environment to guarantee proper performance. This is due to the large amount of memory that is addressed (about 3GB on a typical run, a large portion of which is due to memory overhead due to the STL datatypes used).
genomegen-mnelson.rar - This file contains a short program that utilizes the Mersenne Twister random number generator algorithm to produce a genome of length exactly equal to 3000000000. This file is stored on the hard drive for use with multiple runs of the mapper program.
genometranslate-mnelson.rar - This file contains a short program that will take the genome file generated by the generator and translate a sequence of length 10000 into a form that is easier to read. The starting offset can be changed inside the source code.

Performance

In debug mode, the time required to retrieve the genome file is approximately 20 seconds. In release mode, it takes approximately 4 seconds. In debug mode, the time necessary to generate the hash table is about 4 minutes. In release mode, the time necessary to generate the hash table is about 30 seconds. The searches are near-instantaneous.
Please note that these results are not guaranteed for all systems. These are the times that were required on a Core i7 920, so your mileage may vary.
The memory usage is higher than predicted due to memory overhead from STL datatypes. The total amount of memory used is approximately 2.7GB. This value may differ depending on your input file.

Weekly schedule

Week 4: Create this wiki
Week(end) 4: Tidy up the wiki and begin work on the skeleton for the program
Week 5: Do the majority of the coding
Week 6: Finish coding "Easy" portion
Week 7: If "Easy" portion is complete, attempt "Medium" portion of the project

Weekly update

Week 4

Progress: Wiki page has been created. Have not begun actual coding.
Next week: Plan on fixing up the wiki page so it's more presentable in addition to beginning work on the coding.
Grade for week: B

Week 5

Progress: Not as much progress as I had hoped. Busy with midterms and more time-constrained projects. Updated late.
Next week: Have most of the actual coding done.
Grade for week: F

Week 6

Progress: Still busy with midterms for the early part of the week.
Next week: Finish coding project.
Grade for week: D
Problems solved this week: Please see the extensive algorithm section that has been added.

Week 7

Progress: Finished most of the coding. Will be easily completed this weekend.
Next week: Finish coding project.
Grade for week: C
Problems encountered this week: Fell asleep while working on project and forgot to update wiki before doing so.

Week 8

Progress: Still at the point I was at last week.
This weekend: Finish. Definitely.
Grade for week: D
Problems encountered this week: I was busy with a project for another course and forgot to update until after midnight again. Oops.

Week 9

Progress: None.
Next week: Finish coding. Present slides explaining algorithm.
Grade for week: D

Week 10

Progress: Presented the Powerpoint slides.
This weekend: Finish. Definitely.
Grade for week: C

Week 11 (finals week)

Progress: I'm still not satisfied with it. I'll be working on tweaking the code throughout the night. Hopefully the files won't actually be grabbed until the morning.
Grade for week: D

Week 11 (Friday)

Progress: I'm finally content. The program works as intended (well, as much as a proof-of-concept should) without any quirks.
Grade for week: B

Week 11+1 (Sunday)

Progress: Updated the wiki again. I included a few more calculations to show how the program would perform ideally under changed parameters.
Grade for week: B

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License