Short reads are mapped to the genome and then compared to identify SNPs. However, some reads have errors and the errors could look like SNPs. Luckily SNPs will occur in many reads, and errors will occur much less often. A SNP caller identifies the SNPs from these reads, while being resilient to errors.
I'm a first year computer science masters student. I'm taking this course for fun and I have zero experience in biology and near-zero experience in statistics. My hobbies include working on open source projects (Sonata, Git) and surfing. My research interests are in systems and programming languages.
This project will focus on writing a SNP caller and then using the caller to formulate an estimate on how many reads are required to get accurate SNP coverage.
|4||Setup wiki site and study for midterm|
|5||Gather data and work on a simple short read generator if needed|
|6||Start work on caller|
|7||Continue work on caller|
|8||Start formulating estimate|
|9||Finish formulating estimate|
Wiki looks good so this week is a success, gotta get to memorizing studying though…
Morale(Grade for week): Good
This week was spent looking around the web for SNP calling algorithms. I found this pdf describing how illumina does their calling. It's very basic but it's a good outline. I also found this open source project from the Broad Institute of MIT and Harvard. It's called Birdsuite. Next week I start the caller so hopefully that goes well.
Morale(Grade for week): Neutral
I was busy with other classes this week. So sadly I don't have anything to report. I hope to get started on the code next week and not fall too far behind schedule.
Morale(Grade for week): Poor
I got most of the initial work done for the caller this week.
I'm using a modified version of Nick's read generator. Essentially I've split it up into smaller modules which I can run from the command line. This way I can generate fasta sequences with one program, pipe the fasta to a snp generating program, and then pipe that result to a read generator to generate reads. Admittedly these are small python programs, but this work allows me to generate all the data in one line with the added bonus of being able to customize without having to edit any file. Also, the code is parallelized for me by the kernel.
I've uploaded the code to github too, so you can see it here.
The caller uses the consensus method right now. The algorithm is outlined as follows:
For every read Compare the read's alleles to the corresponding location on the reference sequence. If they differ: Increment the read coverage counter for that position Update that position's count for the allele. If they don't differ: Increment that position's read coverage counter. For all the positions with at least one allele recorded: For each recorded allele Calculate the percentage of reads which had the allele. If the percentage is greater than 0.5, call this allele as the snp. Otherwise return unknown if no allele gained majority.
I'm hoping that next week I can introduce errors into the reads and work on calculating the statistic for consensus correctness empirically. After that I might look into reducing the memory overhead of the tabulation.
Moral(Grade for week): Good
I introduced errors into my generated reads this week. I also wrote a simple script to verify the output of the snp caller. It reports the number of miscalls and the total number of SNPs. Overall the coding is complete. I'm now going to move onto trying to verify the error distribution for the consensus method. I hope to do that next week and finish up soon.
Moral(Grade for week): Neutral
I've written the verifier this week. Essentially all it does is check the output of the caller against the snp fasta file. I also wrote a statistic gathering script to calculate the statistic I'm searching for. A statistic was given in class, but my caller is performing worse than the in class formula says it should.
Here is a comparison of my caller versus the expected results from the class formula.
As you can see, my caller performs worse in almost every case. For the 1x coverage case, I believe that I have a high error rate because many snps are being called which only have 1 read coverage. In this case, the consensus is just 1 and there is no dissenting allele. Sadly this doesn't explain the results for 25x coverage, where my caller breaks down after the 0.5 error rate barrier.
I've also created a graph to compare read coverage versus miscalls.
As you can see, miscalls are significantly smaller with higher coverage, but as the error rate increases the miscalling converges to 100%.
Moral(Grade for week): Good but confused!
My presentation can be found here