Week Four (4/19-4/25)
Done
- Create this page (yay!)
- First goal of project written out
To Do Next
- Better describe the (first?) goal of the project
- Plan out first goal implementation/theory before any coding begins
First Goal: Develop a mapper that can take many short reads and use a reference genome of length L to discover the target sequence the reads came from. The mapper should be able to identify the indels in the target sequence of length one. We will produce a random reference sequence, create single base indels in it, feed this new target sequence to a read simulator, and then pass these reads and original sequence into the mapper to test it.
Grade: A-/B+. Finally started doing things but not nearly as much as I would have liked to.
Week Five (4/26-5/2)
Done
- More explicit description of the problem and goal
- "slider": Slides a read along a reference to show matches, mismatches, etc
To Do Next
- Use the slider to see patterns in deletions/insertions as the read slides along the reference
- Algorithm description (naive at first)
Description:
A mapper that can handle single bp-length indels in a target sequence. It will take a reference sequence and multiple reads from a target to attempt to recreate the target sequence.
Valid Input
1. Reference genome
2. Multiple reads from a valid target sequence
Valid Target Sequence
1. Through a serience of single bp-length indels, the target sequence and reference sequence are equivalent
2. The indels have to be indelDistmin (=30) bps apart
Example of construction of a target:
(1) Start with a valid reference
(2) Choose a random spot x that is >= indelDistmin away from another spot that has been chosen and perform one of the 5 allowable operations: delete(x), insert(x,'A'), insert(x,'C'), insert(x,'G'), or insert(x,'T').
(3) Repeat (2) as desired
Valid Reference
We will assume that any two 30mers in a reference genome will have an edit distance of greater than 2.
Valid Output
1. Predicted target sequence somehow indicating (list of ops, highlighting, showing reference as well, etc.) of how the target and sequence are different from one another
testSliderMatcher.py: A few functions that check matches and mismatches. Very basic and naive algorithm that slides the read along the reference. It outputs the positions matched against in the reference, number of (mis)matches, percentage of matches, and the positions of the (mis)matches.
Grade: A-. Did a lot more this week and it seems like I'm on a path to figuring out a good way to find the indels.
Week Six (5/3-5/9)
Week timeline…
- (5/3) I think I have found a way to "detect" indels when sliding a read along a reference. I'll need to write up the notes better and upload if it turns out to be true.
- (5/4) Idea confirmed in lecture and in talks with professor. Try to post notes later.
- (5/5) Bulk of code to do a naive version of this algorithm completed. At this time only slides one read across the reference and outputs the operation (insert, delete, none). If there is time, will make it more complete.
Done
- Naive algorithm to do Mapper (1,0)
- Parametrized problem
To Do Next
- Start writing paper for Master's project
Problem
Given a set of reads from a valid target T and a valid reference, a mapper (from a set of mappers) should be able to recover the target sequence. A valid target is one having up to i+d differences from the reference within any distmin bp window. A valid reference genome is one where any read from all the reads only maps to the reference in one location. If invalid targets and/or reference sequences are used, the mapper(s)’ behavior will not be define. The mapper will output the target sequence or a list of operations to perform on the reference to create the target.
naiveOneIndelNoSNP.py: Naive algorithm for i=1 and d=0. It will find insertions and deletions if they are at least length of the read bp apart. It will spit back out the operations necessary to construct the target sequence. Some operations may not be the exact same as the originals, but should produce the same target. This is due to repetitions.
Grade: A. Did a lot this week in terms of code and parametrization of problem.
Week (5/10-5/16)
Done
- General outline of the paper
To Do Next
Rough Draft Intro to Paper
In recent years, the trend of DNA sequencing has gone towards high-throughput sequencing. This sequencing technology produces millions of short reads (30-100 bp long) in a much shorter time period than traditional methods. Because these reads are short relative to the target genome and they come from random locations, there is not enough information known to determine mutations (indels and mismatches) in the original genome. Traditional methods and algorithms for sequencing cannot handle these short sequences so a new class of algorithms needed to be created to be able to determine the location of a read of the original sequence against a reference sequence. Ideally, the algorithms are able to output a partial or complete target sequence so that analysis can be done to determine mutations in the forms of insertions, deletions, mismatches, etc. These algorithms “re”-sequence the reads back into the original sequence they came from.
Many of the more popular methods of mapping these many short reads do a good job of mapping the reads to a reference and “resequencing” the original sequence the reads came from. However, most of these methods only attempt to handle the mismatches and ignore other kinds of mutations such as indels (insertions & deletions). Typically, a mapping algorithm for mismatches attempts to map a read to a reference location allowing for a certain number of mismatches. Then using the results of several reads, the algorithm determines if these mismatches are caused by read errors in the DNA sequencer or are actually SNPs. Indels cause shifts in the original genome with respect to the reference genome. These shifts will often cause several mismatches to occur and the mapping algorithm will determine that the read does not belong there. While finding these mismatches is good for the biology being studied, it is also biologically important to find indels in the sequence. The indel mutations don’t exist in as high numbers as mutations in the form of SNPs, but at about [indel]% of the genome, they can be significant in their effect. Mappers of short reads need to be able to handle mismatches and indels.
Grade: B+/A-. Mostly working on the paper for the Master's version of the project. No new code to show.
Week Eight (5/17-5/23)
- (5/21) Uploaded a draft of the paper (draft #2)
Done
- This week was more working on the paper
To Do Next
- Continuing to work on the paper
"Finding" Indels
For each spot $i$ we have in the read $r=r_1 r_2 \cdots r_{\left|r\right|}$, we need to retain the state $M_i=\left[M_i^b,M_i^o,M_i^a\right]$ where each value contains the number of matches before, on, and after the position $i$. We also define: $M_i^j=M_i \mbox{ at step/time } j$.
Insertions:
For an insertion at position $x$, the following must hold at step $j$:
(1)
\begin{align} M_{x,j-1}^a=\left|r\right|-x \mbox{ and } M_{x,j}^b=x-1 \end{align}
Or to simplify…
(2)
\begin{align} M_{x,j-1}^a+M_{x,j}^b=\left|r\right|-1 \end{align}
Deletions:
For a deletion at position $x$, the following must hold at step $j$:
(3)
\begin{align} M_{x,j-1}^b=x-1 \mbox{ and } M_{x,j}^o=1 \mbox{ and } M_{x,j}^a=\left|r\right|-x \end{align}
Or to simplify…
(4)
\begin{align} M_{x,j-1}^b+M_{x,j}^o+M_{x,j}^a=\left|r\right| \end{align}
Grade: B+. Worked a lot on the paper but I wish I had done a little more.
Week Nine (5/24-5/30)
- (5/26) Received feedback on paper and starting to make changes on the paper
- (5/28) Finished first pass of revision of paper
- (5/29) Did some simulations to find mismatches and indels at the same time
Done
- Preliminary work on searching for indels and mismatches at the same time. Seems like to simply know about the existence of a mismatch with an indel, it is necessary to use one of those $M_{i,j}$ state formulas except check for values decreased by one. Then, if a mismatch exists, checking the other values to see where the mismatch is, etc. I need to do tests to see what happens if I delete a character in a read and then change the base of the position that took place of the deleted base.
Problems
To Do Next
Grade: A. Did a lot of writing for the paper.
Week Ten (5/31-6/6)
Done
- Presentation (and uploaded)
- Worked out how to find mismatches and indels
Working On
- Working on indexing and faster Mapper (1,0)
Grade: A-. Did a lot more on paper and work on finding indels and mismatches at once
Finals Week (6/7-6/13)
- (6/7) First version of Index Mapper (1,0). Didn't quite work the way I want to. Has lot's of corner case issues.
- (6/11) Finished the second version of the Index Mapper (1,0). It seems to work but there is an issue when I give it a lot of reads
- (6/11) Uploaded more source code files
- (6/12) Finished indexing mapper. Works!
- (6/13) Graduation!
Done
Working On
- Index version of Mapper (1,0). Having issues
Grade: A-. Almost donee with paper, did a decent job at presentation, and finished indexed mapper.