Evan Zhen's Mapper Project

Description of Project
I am working on the mapper project. The point is to perform mapping of short reads to the genome for resequencing. It takes a string of certain length, and it'll try to find all matches in a sequence with a maximum mismatch of 2 characters. Ideally, it can map short reads to a sequence of any size. My aim is to work on the medium portion of the project at least, where I could build a mapper that scales to sequences of length 3million. If given enough time, perhaps I could do the hard portion, where it can find all possible matches quickly.

A little bit about myself
My name is Evan Zhen. I am a 1st year Master's student in Computer Science. I did my undergraduate studies at UCLA as well, and I went straight to Master's as soon as I graduated. I probably won't be pursuing a Ph.D. anytime soon. If I do, it'll be maybe after working in the industry for a couple of years.

Goal by the end of the quarter
My goal is to at least complete the medium portion of the project (since I'm required to anyway). Hopefully, I'll be able to complete the hard portion as well.

Schedule for each week

  • Simple search algorithm
  • make the map
  • search using the map
  • test exhaustively and optimize it
  • if possible, make it faster (optional)

Final Project Submission
Basically, here is my submission of my final project. It is written in Java (bad choice, I know, but when I realized it, it was too late to change to a different language). The main program file is HapMapProject.java. Basically, my solution to the medium project (3 billion length) is to use an index stored in a separate file on harddisk.
A very short summary
My project is written in Java. To handle the medium project, I use external files to create an index (so it uses harddisk space). I index every single possible sequence. To handle the error cases, what I did was to permutate every single combination possible for the search sequence. For example, if searching for "CAT" with an error of 1, I will search for: CAT, AAT, GAT, TAT, CCT, CGT, …
Documentation/Notes: click here
Files: click here

Week of 6/8
Finals week. I managed to finish up my method, although I couldn't test it fully. The thing about using Lucene is that it uses diskspace, and I don't have enough diskspace to handle indexing a sequence of 3 billion in length. What I did was test on 100 million length sequence. I consider my project to be complete. Although I would like to work on it some more, I simply don't have the time, and I will not have the time during the summer as well.
Grade for this week: A (since I manage to finish working and testing my project)
Overall grade for project: B (since my project will be hard to be tested for medium project unless one has a sufficient amount of diskspace)

Week of 6/1
Presentation this week (my presentation). Through the presentations, I realized that my method will not be very suitable for the medium project (the 3 billion case). I have came up with a new method that uses a Java tool called Lucene that uses file index to perform searching. So pretty much, I'm back to square one for this week.
Grade for this week: C (since I pretty much need to redo my entire project. Also, didn't update wiki on time)
Overall grade for project so far: C (since my project is pretty much "useless" in terms of the medium project)

Week of 5/25
Still tested using Nick's simulator. I didn't use actual data yet because I forgot what was the link to get the actual genome. I added a timer to compute the time it takes for calculation, and it turns out my map+index algorithm took slightly longer to compute at each iteration. It probably has something to do with my method of handling mismatches. I also haven't been able to figure out a better way to reduce the memory usage within my code without resorting to forcing Java to increase the heap size.
Grade for this week: B+ (since still trying to fix bugs and some problems. Also, didn't update wiki on time)
Overall grade for project so far: B (even with more testing, there wasn't much real improvement of project from last week)

Week of 5/18
Begin testing using Nick's simulator. Tested with sequence of 1million in length. Because it was coded in Java (bad choice, now that I realized it), it ran into memory issues when creating the map. Had to increase the heap space before error disappeared. Preliminary tests seem fine after some major bug fixes. However, I didn't time the project to see which was faster.
Grade for this week: B+ (since performed the tests, although not very exhaustively. Also, didn't update wiki on time)
Overall grade for project so far: B- (done with implementation and preliminary testing seems to work, although there's still much more testing that needs to be done. Also, need to fix the memory issue with Java)

Week of 5/11
Implemented the search using the map. Didn't test it yet though because of incredibly busy schedule (unforseen).
Grade for this week: B (since no testing, and didn't update wiki on time)
Overall grade for project so far: C+ (done with implementation, but no testing means many possible errors can occur)

Week of 5/4
Implemented the map portion of the search. Preliminary tests seem fine.
Grade for this week: A- (since actually followed schedule, but didn't update wiki on time)
Overall grade for project so far: C (implemented simple search and made a map, but didn't do anything with the map)

Week of 4/27
Implemented the simple search algorithm. Didn't test it yet though because of school work.
Grade for this week: B (since no testing, and didn't update wiki on time)
Overall grade for project so far: D (only implemented the simple search so far)

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License