Project Description (final presentation: nfurlotte224FinalPresentation.ppt)
Genomic variation plays a major role in determining many of our characteristics. Everything from hair color to disease susceptibility is at least in part determined by our particular patterns of variation. In the last few years, researchers have switched focus from genomic elements like microsatellites to single nucleotide polymorphisms (SNPs) or single base differences that are common in the population. These SNPs have served well as a way to quantify genetic variation. It has been known for a long time that many other types of variation exist in the genomes of organisms, but the technology has limited the ability to find them. In particular, structural variations are thought to account for a significant portion of our genomic variation.
One type of structural variation is called a copy number variation (CNV). CNVs are segments of genomic DNA that appear in a variable number of copies in the population. CNVs have been implicated in a number of human diseases and it has been shown that CNVs contribute to a significant portion of human variation (Henrichson et al. 2009). Recent studies have attempted to both find CNV's and associate them with phenotypic variation. The association study aspect is particularly exciting as it is hypothesized that CNV effect will be very large and thus associations will be detected with very small sample sizes. Unfortunately, the techniques for finding CNVs are not very fine-grained. The recent studies actually focus on finding copy number variable regions (CNVRs), which are larger segments that include one or more individual CNVs. In other words, the general area of the CNV is found, but the actual location and sequence is not recovered.
New high-throughput sequencing technology gives us the opportunity to locate the exact locations of CNV's and recover the actual sequence that makes up the different copies. In this project, I plan to both present a formalization for the problem of finding CNVs using paired-end sequence reads as well as present and implement an algorithm to solve this problem.
Change in Project and Revised Goals
My original idea was to both find the locations of potential CNVs using paired-end sequence data and then attempt to reconstruct the genomic sequences. After talking with Dan and Eleazar, I realized that each part of the problem was pretty hard and it made sense to split it into two and work on each of them separately. Dan was making progress in the reconstruction problem, so I decided to focus on the problem of finding the CNVs. I therefore Revised my goals in the following section.
Goals for this Quarter
- Write a read simulator that will take a genomic sequence and output FASTQ reads.
- Map these reads using MAQ
- Use the MAQ mapping to locate CNVs
- Assign a significance to each CNV and calculate a probability corresponding to the number of copies.
- Validate this method using both the simulated data and the mouse sequence data that we have.
My name is Nick Furlotte. I am a first year CS PhD student working in Dr. Eskin's lab. I am interested in solving problems that lie in the realm of computer science, statistics and genetics.
Accomplishments This Quarter
- Implemented the simulator. Details: Written in C, reads and writes MAQ map files, computes coverage for a given map file and region, generates random sequence, generates FASTQ reads
- Defined a method for finding CNVs given the mapping data. Takes advantage of discordant read pairs
- Find discordant read pairs. That is, pairs that have the reverse read mapped upstream of the forward read. This conformation implies a CNV.
- Cluster discordant reads. The clustering is a really simple greedy approach. You know that the mapping locations for the forward reads can only be a certain distance apart, so the clustering is really easy: just group all discordant reads together that fit this property. The assumptions are that you know the max insert length and that the CNVs will be far apart.
- Once you have clusters of discordant reads each of these will correspond to one CNV. The boundary estimates for the CNVs can then be obtained by taking the min of the reverse read mapping position and the max of the forward read mapping position plus the length of the read.
- I attempted to calculate a probability for the number of copies using the assumption that the number of reads mapping to a specific window would follow the poisson distribution with lambda = coverage*windowLength/readLength. I calculated a chi-square statistic based on the observed number of reads and expected number of reads following this distribution. This statistic was unable to accurately predict the number of copies for my true positives. I think that using an emprically derived distribution based on the coverage at each position might be a better. Instead of using this approach, I decided to use the simpler coverage ratio (mean coverage in the window / expected coverage).
- Applied the method to simulated data. Data was simulated by taking random sub-sequences from an actual mouse chromosome and stringing them together to create a 10mb genome reference. I inserted 5 CNVs, produced reads and mapped them to the fake reference using MAQ. I applied the method and was able to recover all of the CNVs without any false positives. The coverage ratio for the CNV regions was always around 2, suggesting that there were 2 copies. This was the case. Basically the method worked perfectly as expected in the simple simulated case.
- I then applied the method to the real mouse data and found over 1400 possible CNVs. Many of them were obviously wrong. I filtered the results and was able to narrow down to 5 that looked kind of interesting. I have not been able to find a source to validate them. For the real data, the method did not work that well and still needs a lot of work.
- My final presentation is linked here: nfurlotte224FinalPresentation.ppt
Joint Project with Dan
Dan and I combined our projects in order to produce a complete method that can identify and reconstruct the CNVs given mapped reads. We submitted the abstract to the ASHG conference and will hopefully be presenting a poster there in October. Below is the title and abstract from this submission.
Title: Discovery and Reconstruction of Copy Number Variations
Structural variations such as copy number variations (CNVs) account for a large portion of human genetic variance. The accurate discovery and reconstruction of the genomic regions containing CNVs is important for the understanding of phenotypic variation. There exist array-based methods for discovering CNVs, but these methods are unable to determine the regions in copy number with high resolution. The rapidly decreasing cost associated with next generation sequencing technologies offers an alternative for finding CNVs. In this project, we describe a computational method that utilizes the properties of paired-end read sequences to identify CNVs. Our method first determines the genomic regions that likely contain CNVs by clustering discordant paired-end reads. Each potential CNV is assigned a confidence score and probabilities corresponding to the number of copies in the region. Once an estimate of the region has been obtained, we are then able to reconstruct the exact sequences making up the CNV using un-mapped reads, namely reads that span the junctions between copies and that do not match anywhere in the reference genome. Given the high coverage of next generation sequencing technologies, all junctions are covered by un-mapped reads with high probability. Copies that differ from the original reference region will do so by having different starting and ending positions, which will result in different prefixes and suffixes for these copies. Therefore, by using un-mapped reads we are able to recover the exact start and end positions of each copy within the original CNV region by mapping their prefixes and suffixes back to the reference sequence. Candidate positions can be identified when both prefix and suffix match the CNV region in the reference sequence. SNPs in different copies can then be used to recover the order of these copies. For experimental evaluation, we tested our method on both simulated random sequences and real genomic sequences. Specifically, we generate real genomic sequences by randomly sampling and concatenating large sub-sequences from an actual chromosome. This process maintains the repeat structure found in real genomes, which does not exist in completely random sequence. CNVs of differing lengths were inserted at various spacings then simulated reads are created and mapped with a widely used mapping program. The results indicate that our method is able to re-construct the CNVs accurately for both datasets.