In this project, I will attempt to create a short-read mapper, trying for at least the medium-level-difficulty project. Given a reference genome of up to length 3 billion and some number of reads of length 30, the goal is to find all positions in the genome where the reads match with at most two mismatches (currently, I'm interpreting mismatch as meaning having only two non-complementary alleles, but other definitions might include insertions or deletions of alleles, as well). I do not yet have concrete algorithmic proposals.
My name is Joseph Barker, and I'm a second-year PhD student with a focus on heuristic search (although I'm planning on getting a minor in bioinformatics). My current research focuses mostly on game-playing; i.e., given a non-random, complete-information game (like chess or checkers) and perfect play by both players, determine who wins. It's fun stuff. Since we're apparently supposed to put in an interesting factoid about ourselves, I'll put out there that I've lived in six countries, had malaria three times, and climbed Kilimanjaro last Christmas.
Goals For This Quarter
Create a short-read mapper that can map reads to a reference genome as fast as the current state of the art. Failing that, the mapper should be able to map reads in a reasonable span of time.
Week Of April 20
Write this project summary page. Download reference data (currently using CloudBurst's sample data). Brainstorm about algorithmic ideas. I feel like I've done A-quality work this week.
Week of April 27
Downloaded and played with maq, which I understand is currently the state-of-the-art short read mapper. I've figured out how to run it for use in benchmarking. I've decided on an initial algorithmic approach. I will use a two-step scanning process: in the first step, I will scan a lossily-compressed version of the genome and find possible matches — this can be done much more quickly than an exact matching algorithm. In the second step, I will scan only the portions of the file that correspond to possible matches and do exact matching there. I have an implementation of the first-pass scan completed — it can find possible matches of length 36 on a 100 megabase reference in 10 seconds (including all I/O). The next step is to implement the second pass search. I feel like this constitutes A-quality work, especially for this early in the term.
Update: I've implemented the second-pass search. I can now match a read of length 36 to a 4 gigabase reference in 5 minutes — almost entirely dominated by file i/o. It turns out I've misunderstood the formatting of the reads, but I believe I can adapt my algorithm appropriately with no significant cost to runtime. Should be trivial.
Week of May 4
Delayed update, since I forgot to write it on the correct day. This week I reimplemented my parser to map reads using the paired-end read format. This had no impact on run-time at all. However, it turns out that I was wrong in the first place, and the format I'd been using last week was actually correct for this project. Nonetheless, I think I deserve an A for this week's work.
Weeks of May 11 through June 1
I forgot to update my Wiki for these weeks, so this section summarizes my work for those weeks.
I developed two reference implementations against which to compare my implementation. The first is a no-memory implementation, which keeps only a segment of the genome the same size as a read in memory and does naive string comparison against it. The second is a no-compression implementation, which reads the entire genome in to memory without performing any form of compression and uses a naive string comparison (the advantage of this approach is that it is trivially threadable, compared to the no-memory version).
Using these two implementations for my comparison, I performed a number of tests on my implementation, generating data. The results of this data are summarized in my final report.
Unsurprisingly, I think I deserve an A for my work.
My final presentation is available at http://cs.ucla.edu/~jbarker/cs224/Presentation.pptx
My source code (including a small reference genome) is available at http://cs.ucla.edu/~jbarker/cs224/code.tar.bz2
To create a memory-efficient short-read mapper. That is, a tool that takes a number of short reads (short sequences of nucleotides) and for each one finds the location on a nucleotide where they align with the fewest mismatches.
Due to the huge size of genomes, these mappers generally take more memory than is available on a modern computer. My goal is to create an implementation of a short-read mapper that can be efficiently run on a modern computer using a maximum of one gigabyte of RAM.
I assume that reads differ from the genome only by nucleotide replacement (and not by deletions, insertions, etc.), and that each read differs from the genome by a maximum of only two replacements (although the algorithm does not rely on this requirement).
In short, the algorithm follows these steps:
- The genome and reads are read in to memory and compressed in to a binary representation. A and C are mapped to the 0 bit and T and G are mapped to 1. The compressed representations are stored as arrays of chars (i.e., each byte contains 8 nucleotides).
- Using essentially a naive algorithm, each read is slid along the genome, using clever bitwise techniques to compare 8 nucleotides of the genome with those of the read using only a few operations (described later). Any locations that have at most two mismatches are recorded as candidate matches (due to the binary representation, it is possible that this procedure generates matches that are not actually valid).
- The portions of the genome corresponding to candidate matches are read in to memory (not in a compressed representation), and a simple string comparison is used to determine which of these matches are valid.
To compare reads with genomes, the algorithm uses a simple bitwise comparison operation. A byte of the read (representing 8 nucleotides) is xored with a byte of the genome. In the resulting byte, any bit with value 1 represents a location where the genome and read have different values. This resulting byte is then used as an index in to a lookup table of size 256 — the value of each entry in the table is simply the number of ones in the binary representation of the number. Using this technique, 8 nucleotides can be compared using two operations, rather than the 16 that would be required using a more naive approach.
(The astute reader will realize that the algorithm described here will only find matches between reads and genomes on the 8-nucleotide boundary (as we can shift the read no less than one byte (i.e. 8 nucleotides) at a time). To avoid this problem, I actually store 8 different representations of each read, each one bitwise-shifted a different amount from the others, and use all 8 when searching).
My implementation (as well as my two reference implementations) are implemented in C++, using the boost and zlib libraries. My implementation is approximately 475 lines long (including empty lines and comments).
The bulk of the results are covered in my presentation, but I'll cover some highlights here.
An entire genome of size 3 billion can easily be stored in memory on a computer with only a gigabyte of memory (using approximately 357 megabytes of RAM). Any commodity computer has more than enough memory to be able to be able to store the entire genome and at least one million reads without any trouble.
As the intent of this project was to reduce memory rather than time usage, my algorithm is not especially fast. With a genome of size 3 billion, a read length of length 20, and using two threads to perform search, it takes my algorithm approximately 30 seconds per read to perform a matching. At that speed, it would take my mapper almost a year to perform a mapping with 1,000,000 reads. Again, not exceptional performance, but my algorithm is intended to reduce memory usage and uses a very naive mapping technique.