Stephen Douglass' Project Page

+ Paragraph Description of the project:

I'm going to be doing the short sequence read mapper project, medium difficulty. For this project I will align reads of length 30 to a genome up to size 3 billion, allowing for up to 2 mismatches in each read. Reads with more than 2 mismatches will be thrown out and reads that map to multiple regions in the genome with <=2 mismatches will have all positions reported (with the corresponding number of mismatches). The mapping will be done by trying to map sequences of length 10 to the reference genome (with perfect match) and then comparing the entire 30 only to sites which match perfectly to the 10 (each read will try to be matched with 3 10-mers). To do this the genome will be broken down into many 10-mers to allow quick access (each 10-mer will correspond to one or more position(s) in the genome). I will break up the project into two separate pieces (if it makes computational sense after comparing efficiency). The first part will be to create an index for the genome in which I break down the genome into unique 10-mers and then save this file so it can be accessed by part 2. Part 2 will be to map the reads onto the genome and create a text file output with the mapped sequence and the positions of all sequences it maps to in the reference genome. The cost of breaking it into 2 steps is that there is additional save and load time for the index, the benefit is that it would allow more rapid mapping of many different sets of sequences because you only have to create the index once per genome, instead of once per dataset. I hope that the computation time of the two step procedure will be less time efficient for one set of sequences but more time efficient for 2+ sets.

+ Paragraph about me:

I'm a first year graduate student and I'm about to transfer into the new bioinformatics PhD program. I did my undergraduate here at UCLA as a Molecular, Cell and Developmental Biology major with a Computing specialization (mostly classes through the PIC department). I don't have a lot of favorites, and I lie when asked about my favorite color. One interesting thing about me is that I have a hard time filling an entire paragraph with information about myself.

+ Goal for end of quarter:

Make a reasonably efficient short read mapper that can map reads of length 30 to a genome of length 3 billion and output the results of all matches (not just one match per read) in an easy-to-read format. Allow it to retain efficiency when used to map multiple files (such as reads from person A, person B, and person C) to the same reference genome.

+ Weekly schedule:

Week of April 20th: Do this and then study for the midterm on Monday.
Week of April 27th: Don't start coding, but create a plan with solid ideas and pseudocode to prove to myself that what I'm trying to do is possible for me without too much difficulty.
Week of May 4th: Download the human genome and finish part 1. Create the index for the human (or mouse) genome and check that it is working correctly.
Week of May 11th: Generate several files with various sequencing error/polymorphism rates to test part 2 with, start coding part 2.
Week of May 18th: Finish part 2 and apply test files from above, make sure everything works properly. See how efficient it is with various error rates.
Week of May 25th: Be ready to present this week (or even the Friday before this). Want to present first day it's available.

+ Week of April 20th:

What I did this week
As promised, I did nothing on the project this week.
What I did last week
Last week I made this lovely page.
How what you did compared to what I planned to do
Spot on! I can't imagine how I could come any closer to meeting my expectations for this week!
What grade I think I deserve for my work on the project for the week
I don't think I did very well in making progress on the project this last week, but if you take the timeline into account I'm right on schedule! So, I'll be nice and give myself a positive grade.

+ Week of April 27th:

What I did this week
I did a little less on the part that I planned on doing this week, but go ahead and did some that I was going to do next week. This week I came up with how I'm planning implement the design to allow my program to run on a machine with 4 gigs of RAM (aka my laptop) without taking it too close to the crashing point but using enough RAM so that it doesn't go too slowly. I didn't, however, get any pseudocode written as I wanted to this week. I also downloaded the human genome which I originally intended to do next week and I canceled a day trip this weekend so hopefully I'll be able to catch up and maybe even get a little ahead of schedule for next week. I'm still planning on following the schedule above once I catch up.
Next week's plan
I plan to finish part 1 (creating the index) and also generate pseudocode for part 2.
What grade I think I deserve for my work on the project for the week
I'll give myself a good grade this week since working out the details wasn't as easy as I thought (mainly I can't store the entire index in memory). Go me!
Problems that came up
The big problem that came up was when I did a quick calculation and found out that the index for the human genome can't be stored in 4 gigs (at least not the way I'm planning on doing it). After class this became even more obvious.
Problems solved this week
I think I came up with a solution to the above problem, but I won't know until next week when things start running

+ Week of May 4th:

What I did this week
This week (and by this week I mean pretty much entirely Saturday) I created the index for the human genome that I will be using. My project involves first creating the index and saving it followed by the actual alignment to optimize performance if the user will be repeatedly aligning files to a single genome. I also worked out how I'm going to do the remaining parts of the project on paper.
Next week's plan
Next week (probably another Saturday) I intend to finish the coding portion of the project. I hope to be completely done coding by Monday so that I can be set to present by Friday of next week. The plan will be to finish coding this weekend so that I can write the powerpoint and do the efficiency analysis (and the homework) during the week.
What grade I think I deserve for my work on the project for the week
I'll give myself a good grade this week since I did everything I set out to do in my outline.
Problems that came up
Not too many problems came up this week. This portion of the program takes a long time to run (~15 hours on my laptop) so I broke it up into pieces so that I could run some other stuff on my laptop between chromosomes without my computer slowing to a crawl.
Problems solved this week
Kind of addressed it above, I made it so that the index is created one chromosome at a time so that I can pause between chromosomes to run programs for work I'm doing in the lab or whatever.

+ Week of May 11th:

What I did this week
This week I finished coding the assignment and ran it under a few conditions to see how it performs. I also wrote my powerpoint and got ready to present the results, although the efficiency analysis is somewhat lacking (varied the number of sequences input but not the length of sequences).
Next week's plan
Next week I plan on presenting.
What grade I think I deserve for my work on the project for the week
I'll give myself a good grade this week since I finished the thing.
Problems that came up
Not too many problems came up this week. The length of time it takes to complete a run gets in the way a little bit but nothing that a surplus of time can't solve.
Problems solved this week
All of them. Feels good to be done… even if it isn't very efficient.

+ Week of May 18th:

What I did this week
This week I ran the program under a few more conditions and wrote the powerpoint to present. I'm planning to present this week.
Next week's plan
Do nothing 'cause I'll be done!
What grade I think I deserve for my work on the project for the week
I'll give myself an A this week because I finished the project and was ready to present on the first day available.
Problems that came up
No real problems came up this week since I was just re-running the program to generate a comparison table and writing the powerpoint.
Problems solved this week
Didn't solve any problems because I didn't have any…

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License