I am working on project 22: Mapping of Reads for my project along with my partner Armen Babakhanian. We will do the medium version of this project, which requires a more efficient implementation and use of data structures like hash tables.
I am a third year computer science major. I like math, programming, and playing sports like basketball and tennis. I am of Armenian descent (small country on the border of Turkey).
Armen was born in Iran and is of Persian-Armenian descent. From the time that I know have know him, I can say that he likes art, programming, and playing tennis. He also enjoys driving cars with a manual transmission and listening to indie rock bands.
Goals for the Quarter
My goal is to come up with an efficient solution to this problem and learn more about computational genetics and statistics.
April 20 - 27 Progress
This week I set up the wiki and next week I will learn more about the project's objectives and maybe begin coding.
Grade for Week: A
April 27 - May 4 Progress
Because of our midterm on Monday and studying for a midterm next week, I have unfortunately not been able to make any progress with this project. My goal next week is after Wednesday to talk with the professor and the ta to get a very strong grasp of exactly what I need to do for this project and begin working on it.
Grade for Week: B (didn't get much done because of midterm)
May 4 - May 11 Progress
This was the week that I ended up working with Armen as he no longer wanted to work on his project. I went to the professor's office hours to see if it is okay, and he said "yes" as long as you work on the medium project. Armen and I began discussing how to approach this project. We talked in general terms about how our hash tables should be set up and how we can avoid using a lot of memory and make our program run fast.
Grade for Week: A- (got a new partner and began thinking about what needs to be implemented)
May 11 - May 18 Progress
This week we did in fact begin to make substantial progress in the project. I went to the professor's office hours to ask concrete questions about the implementation. In particular, I asked him to give a concrete explanation of what it was we were trying to solve with this project and whether we need any hapmap data to do it. He explained that the process of your project should be that you randomly create a sequence of DNA and call it your reference DNA. Then, you make mutations to this DNA and call it your DNA. Then, you perform multiple reads of subsets of this DNA (with the number of reads on the order of the DNA sequence length) to try to recover the mutations. By finding the mutations and using the reference DNA, you are able to determine your DNA. The professor also explained that we could use a python program written by another student as a starting point for doing the reads. With this more concrete understanding of the project, we had a pretty good idea of the computer science problem we were to tackle. Next week, we will begin the coding.
Grade for Week: A (got a good concrete idea of what to do and now know what needs to be implemented).
May 18 - May 25 Progress
Both Armen and I used the python program to figure out what it was doing and found it to be pretty confusing. We weren't sure exactly what the code was doing and how we were to use it. I decided to sit down and write our own version of what the python program does. Basically, this version allows you to create a random reference DNA, make random mutations to it, and do reads from the mutated DNA. Also, in programming this, I made the key parameters variable so that we could change them during testing. When we want to evaluate the efficacy of our solution, we can change parameters like readlength, number of reads, number of mutations, dna size, and see how these parameters affect our results. Of course, our solution makes certain assumptions about the DNA (the most essential being that the DNA of two generic human beings are 99.9% identical). As a result, when there are a lot of mutations and/or not enough reads, our solution will not be sufficient. By the end of this week, I was able to implement this program as a good way to get us started.
After this code was implemented, Armen used the code I had made to create a hash table class that is used to read in the reference DNA. Essentially, the way this thing works is that it reads in n-char string from the reference DNA (if the string has 100 characters and you want a 10-character key, you have to read in 90 10 char-string from pos. 0,1,2,….90). These strings are used as the key in our hash table and the data it maps to is an integer vector containing all of the positions where this n-char string pattern appears in the DNA. We will use this to take our DNA and map it to the reference DNA. By the end of this week, we were able to get this to work so now we just have to implement the algorith that uses our hash table to find the mutations and recreate the DNA.
Grade for Week: A (got initital code done and setup the hash table)
May 25 - June 1 Progress
This week Armen worked on the algorithm that is used to recover a DNA sequence from a reference DNA. This algorithm was quite similar to the one our professor discussed in class. The algorithm assumes that there will be a perfect match at some point (perhaps more than one point) in a sequence read. When the match occurs, you go to all of the different positions in the reference DNA sequence that have this match and compare the DNA in that region with the read generated. After going through all the possible suspects, you accept the one with the lowest difference ratio between the read and the reference DNA. After working on the algorithm, we found that the error rate was quite low as long as the percentage of mutations was relatively low. Since the DNA of two individuals is likely to be very similar, the program we had created worked reasonably well.
After the coding was done, Armen and I worked together to run many tests on our program and see when it performs well and when it doesn't. Some of the results we got were fairly consistent with our intuition. For example, we found that as we increased the mutation rate, the percentage of success in our program began to decrease. However, when we increased the read length, the percentage of success began to increase again. This convinced us of the vital importance of having longer read lengths if possible. Of course, if the mutation rate was relatively small (like .1%), bigger read lengths were not as vital because even relatively short read lengths resulted in 100% accuracy. Another thing we found which was less intuitive is that when we decreased the key size of our hash table, we got better accuracy. This seemed somewhat counterintuitive. However, decreasing the key size has a negative impact on performance because you would have more collisions in your hash table and thus have more possible candidates to test. Thus, while decreasing key length has a positive impact on memory usage and accuracy, it results in slower execution of the algorithm. After we collected all of these results, we graphed them using Excel and incorporated into our final presentation.
Grade for Week: A (project completed - implementation and presentation. Only thing left is our presentation)
June 1- June 8 Progress
On Monday of this week we presented our project. It seemed that the professor liked what we did, and we were glad to have our project completed.
Grade for Week: Based on professor's evaluation, hopefully an A.
June 8 - June 15 Progress
Final code and powerpoint presentation uploaded to the site.