James_Chin_Read_Mapper

Introduction

Bio
James Chin, 4th year undergraduate Computer Science & Engineering major. Graduating in Fall '09 and will be working in industry for about a year, while applying for Electrical Engineering graduate school for 2010.

Personal interests include music, particularly composition and live performance, as well as studio recording and production. I play the guitar and piano and have my own home recording studio.

Project Description

I will be working on Project 22: Mapping of reads

Given a string of length L=30, find where it matches a substring within D=2 mismatches in a length N=3,000,000 sequence.

Easy: Build a small scale mapper that can map strings of length 30 to sequences of length 1,000,000.
Medium: Build a mapper that can scale to sequences of length 3,000,000,000. It can be slow.
Very Hard: Build a fast mapper.

Goal for the quarter

I would like to pursue the "Very Hard" path, since I have a lot of experience with data structures and algorithms from previous courses. I believe I can leverage some assumptions about the nature of the genome to my advantage, for example 1) the alphabet set to be matched is very small {A, C, T, G} compared to regular alphabets, meaning optimizations will work differently compared to regular algorithms for matching text. Also 2) we know the reference genome Never Changes, so I will use data structures that trade seek speed for setup time.

Worst case, if the optimizations end up being too much of a hassle to work the bugs out within time constraints, I will default back to the Medium difficulty and at least make sure that works properly.

Project Log

Week 4:

  • Weekly Progress: Chose the project 22
  • Plan For Next Week: Start pre-production, basic research
  • Weekly Grade: A, on track
  • Problems That Arose: No problems yet
  • Problems Solved: No problems yet

Week 5:

  • Weekly Progress: Began coding, also compiling special assumptions that can be made to optimize algorithms
  • Plan For Next Week: Lots more coding
  • Weekly Grade: A, on track
  • Problems That Arose: No problems yet
  • Problems Solved: No problems yet

Week 6:

  • Weekly Progress: Most of the main body of the code is now usable, can open genome file into memory, open file containing collection of short reads to be matched into memory. Interesting note, tried some string matching algorithms like KMP and it actually worked SLOWER than regular greedy. Assuming this is due to the overhead penalty for false matches. The alphabet set is so small that false matches occur very aggressively and thwart many "intelligent" matching algorithms.
  • Plan For Next Week: Implement data structures
  • Weekly Grade: A, on track
  • Problems That Arose: Emailed professor for advice on special assumptions and previous work done on the subject matter, have not heard back. Was hoping to find out more about existing work on the topic. No problem, will keep working on it nevertheless.
  • Problems Solved: No problems yet

Week 7:

  • Weekly Progress: Began coding the data structure code and header. I decided to use a 4-legged tree structure to index the divisions of the short read index, keeping the basic algorithm idea presented by Prof. Eskin, but I don't agree with using some kind of table or hash index to store the data. It seems to me there will inherently be a lot of wasted space since not all buckets will be distributed evenly. Furthermore it seems unfeasible to allocate arrays of "real" sizes in memory. I.e. 4^15 or 4^20 is already ridiculous and unfeasible with current computing technology. Therefore choosing a table or hash index would force using shorter reads or larger error threshold. Not really acceptable. The same seek speeds can be attained with a tree, but with dramatically reduced memory footprint. Also thinking about using a B+ Tree (commercial standard), however it would take way too much time to code and debug — much more time than I have to spend.
  • Plan For Next Week: Lots more coding
  • Weekly Grade: A, on track
  • Problems That Arose: No problems yet
  • Problems Solved: No problems yet

Week 8:

  • Weekly Progress: Spent most of this week's budgeted time debugging the tree structure coding. Getting compile errors I can't seem to figure out.
  • Plan For Next Week: Fix bugs, if can't fix by next week, revert to Easy/Medium project instead of Hard.
  • Weekly Grade: A-, ran into roadblock
  • Problems That Arose: See above
  • Problems Solved: Not solved

Week 9:

  • Weekly Progress: Could not resolve bugs with the working data structure coding within the budgeted time this week. Reverting back to the Easy/Medium project and will aim to freeze the project work by next week.
  • Plan For Next Week: Project Freeze, work on report and presentation.
  • Weekly Grade: A-, on track
  • Problems That Arose: See above
  • Problems Solved: Roadblock came up, going with Plan B instead.

Week 10:

  • Weekly Progress: Project Freeze, locked the code into a working form, included statistical tracking such as process time per record. The program works for genome of up to 4,294,967,294 length, however it's quite slow at 3,000,000,000. At the specified 1,000,000 it matches at an average of 0.06 secs per record, pretty acceptable. The test cases were evenly distributed. Also completed work on the Project Report and Presentation, will upload to site shortly. Will give presentation in front of class soon.
  • Plan For Next Week: None, finals week
  • Weekly Grade: A
  • Problems That Arose: None
  • Problems Solved: Easy/Medium project specifications completed and delivered.

Project Report (also see attached files)

James Chin
(undergraduate)
SID: 903452928
UCLA CS 124

READ MAPPING
(6/3/09)

Motivation
Current technologies only allow fractional reads of DNA sequence information from a complete genome; only very small subsections can be recovered from any single read. The goal is to reconstruct a complete, contiguous genome from these subsets. The problem is we do not know where in our complete genome that these short reads belong. Furthermore, differences occur between our short reads and our reference genome from differences between people's genome, single nucleotide polymorphisms (SNP's), mutations, and read errors.

The Computational Problem
We are given short reads of length L, a reference genome that is very similar to the genome from which the short reads are taken. The number D represents the permissible number of mismatches within any short read of length L and still be considered a match to some subsection of our reference genome. With this information, determine where the short reads belong in the newly re-sequenced genome and reconstruct the complete string.

Solution
Our basic solution will take the form of a program which reads the reference genome string in from a file and stores it into memory. The program then opens a second file which contains the short reads and greedily iterates the short reads against the reference string one position at a time, allowing for up to D mismatches before moving on to the next iteration. If an acceptable match is found, the starting position of that iteration is output and the program moves on to the next short read to be matched. If the program reaches the end of the reference string before a match is found, the short read is discarded and the program moves on to the next short read to be matched.

Prototype

prototype.png

[see source code below/attached]

Performance Analysis

This program can match against a genome up to 4,294,967,294 length, though it is probably not fast enough to want to scale up that high.

Benchmark tests:
Size of reference genome: 1,000,000
Size of short read: 30
Number of permissible mismatches: 2
Number of short reads processed: 200
Time elapsed: 12.23 secs
Average process time per short read: 0.0612 secs

Estimated average time per match for reference genome of size 3,000,000,000 is 183 secs, or 3 minutes per short read, all else held equal.

performance.png

Average processing time increases linearly with number of mismatches allowed

Assumptions

Case:

  • If more than D errors within any section of length L, Action: Discard
  • If a section of length L matches more than one region of the reference genome (within D tolerance), Action: Pick the first one
  • When a newly-matched short read overlaps existing data in the re-sequenced structure, Action: Overwrite

Improvements

As explained previously….

oldpic.png

In the worst case scenarios, there is always at least a L/(D+1) size perfect match. Assume no more than D mismatches within any segment of length L.

oldpic2.png

(images taken from Dr. Eskin)

So, create a sorted index of every possible L/(D+1) subset, each "bucket" containing all positions where that subset occurs in the reference genome.

We look up each of the (D+1) divisions of the short read in the index and calculate if the number of mismatches is <= D, and if the (D+1) pieces are really next to each other.

Hash/Table Index

If we use the model described above, the speed would rely primarily on the seek time.

The find time for a hash table is technically O(1), however there are some disadvantages to consider: Hash tables can use up a lot of memory overhead, especially if a large contiguous array of buckets have to be created in memory (where we would traverse each bucket to find the positions in the reference genome that match the divisions of our short read).

Another consideration is in the wasted space. Invariably some buckets will be empty, while other buckets with more common or repeated patterns will be very deep.

While these may be acceptable if the index length is 10, i.e. 410 or 1,048,576 buckets, even a short index length of 20 requires 420 or 1,099,511,627,776 buckets — totally impossible to fit such an array into memory. Likewise there is expected to be a tremendous amount of wasted space.

Using this data structure for this application forces us to use shorter reads (smaller L) and/or higher error tolerance (higher D), where we take longer short reads and have to divide them into smaller indexes. Having to checking more indexes of course slows down our process.

Tree Index

How can we do better?

There are two assumptions we can leverage to our advantage:

  • Small alphabet {A, C, T, G}
  • Reference Genome Never Changes

The reference genome never changes, therefore we can use data structures that provide faster seek time and smaller footprint, in exchange for higher resizing/rebalancing costs.

We can use a Four-Legged Tree Index:

tree.png

(image: James Chin)

All the buckets holding the position data are all at the bottom leaf level. And as demonstrated above, each node can descend into the tree through between 1 to 4 branches. A node with zero branches is impossible, since this would suggest an index was created with less than L / (D + 1) characters. The proof will be left to the reader.

Furthermore, notice none of the bucket nodes at the bottom leaf level can ever be empty. There are no "dead-end" branches and no empty-leaf buckets, therefore no wasted space.

The find time for this structure is O(log n), however this also reduces to O(1) since our tree has a fixed depth of log n levels. N never changes! The depth of the tree (and depth of the search) is always exactly equal to the length of our index.

And while undoubtedly the tree could still turn into a very large chunk of memory, the overhead in tree space is mitigated by the fact that each level of the tree shrinks by a factor of log n base 4. In other words, the lowest level of our tree will never be anywhere near the size of our hash case, and the rest of the upper levels disappear extremely quickly.

And so, using a tree index would give us the same constant seek speed, but with a dramatically reduced memory footprint.

B+ Tree

Can we do even better? Let's say we eventually want the ability to modify our reference genome in real time — or perhaps even write our newly re-sequenced genome into the same tree form. This would inevitably require some overwrites or changes in the tree, for example when a newly-matched short read overlaps existing data in the re-sequenced structure. However, re-sizing and re-balancing large trees are a huge pain in the ass! So how?

We can use a B+ Tree, which is commonly utilized in large commercial databases.

Some properties of the B+ Tree:

  • Allows changes to the data with minor overhead (changes do not have to propagate all the way up the tree nor require any dramatic re-balancing)
  • Retains the logarithmic seek speed of the tree data structure
800px-Btree.svg.png

(image taken from Wikipedia)

Information about the how a B+ Tree works will be left to the reader to read up on.

Other Considerations

Other specialized string-matching algorithms were tried and in many cases were actually found to be slower than a standard greedy iterative search. This is due to the small alphabet in this case (A, C, T, G) and therefore higher number of mismatches or mispredictions when compared to, say, the english alphabet. Most of these special algorithms (for example, the Knuth-Morris-Pratt or "KMP" algorithm) rely on efficiency gained by not having to repeat previously matched characters; to not have to repeat work if possible. However, these algorithms incur increased performance penalties for mispredictions. For example, using the KMP algorithm on our genome would result in a very high number of false matches and actually work far slower than greedy methods.

Therefore our efforts should be concentrated on reducing the seek times and memory footprints of our indexing data structures so that we may process larger reads faster with the currently available computing power.

APPENDIX

Source Code for Simple Read Mapper

#include <iostream>
#include <fstream>
#include <string>
#include <time.h>
using namespace std;

//Author:    James Chin
//Class:      UCLA CS 124
//Date:       6/3/09
//Program:   Read Mapper
//
//
//How to use this program:
//    Run the source in the same directory as "genome.txt" and "substring.txt"
//
//Assumptions:
//    The string to be matched is properly formatted
//    The data to be matched against is properly formatted
//    The first occurence of a sequence will be chosen if two or more matches are found in the genome
//    Discard the short read if more than D mismatches within any section of length L

int main () {
    int L = 30;    //string length
    int D = 2;    //permissible number of mismatches

    int i = 0;            //current genome position index
    int j = 0;            //current substring position index
    int mismatches = 0;    //mismatches counter

    clock_t init, final;        //for time statistical tracking
    double time_elapsed = 0;

    int num_of_reads = 0;    //read counter

    string genome;
    string substring;

    //Load genome data from file
    ifstream genomefile ("genome.txt");
    if (genomefile.is_open())
    {
        while (! genomefile.eof() )
        {
            getline (genomefile,genome);
            //cout << genome[2] << endl;
        }
        genomefile.close();
    }
    else cout << "Unable to open file";

    //Load substring data from file
    ifstream substringfile ("substring.txt");
    if (substringfile.is_open())
    {
        //START CLOCK HERE
        init = clock();

        while (! substringfile.eof() )
        {
            getline (substringfile,substring);
            //cout << substring[2] << endl;
            for (i = 0; i < (genome.size() - L + 1); i++) {
                j = 0;
                mismatches = 0;

                while ((mismatches <= D) && (j < substring.size())) {
                    if (genome[i+j] != substring[j])
                        mismatches++;
                    j++;
                }
                if (mismatches <= D) {
                    cout << substring << " match found at position: " << i << endl;
                    break;
                }
            }

            num_of_reads++;
        }

        //STOP CLOCK HERE
        final = clock() - init;

        substringfile.close();
    }
    else cout << "Unable to open file";

    time_elapsed = (double)final / ((double)CLOCKS_PER_SEC);

    cout << endl << endl << "========== STATISTICS ==========" << endl << endl;
    cout << "Size of reference genome is: " << genome.size() << endl;
    cout << "Size of short read is: " << substring.size() << endl;
    cout << "Number of permissible mismatches is: " << D << endl;
    cout << "Number of short reads processed is: " << num_of_reads << endl;
    cout << "Time elapsed: " << time_elapsed << " secs." << endl;
    cout << "Average process time for each short read is: " << time_elapsed / num_of_reads << " secs." << endl << endl;

    return 0;
}
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License