CNV-Reconstruction

My name is Dan He, first year Ph.D student.
My project is to design efficient algorithms to reconstruct CNVs (Copy Number Variations). This is joint work with Nick. Nick designed a probabilistic model to detect CNV regions and CNV-counts. My program then takes these as inputs to further reconstruct the CNV regions, where each copy of CNV may have different prefix and suffix. A complete introduction to CNV is here http://cs124project-2009.wikidot.com/cnvfindingnick.

Week 1 & 2:
The CNV-reconstruction problem is relatively new so there is no clear definition of the problem. The hardest part of this project is to define the problem clearly. I spent a lot of time thinking of possible definitions and parameters used to define the problem. Quite a few definitions were proposed. Prof. Eleazar and I discussed these definitions for a long time. Finally we agree on the following definition:

The CNV-reconstruction problem is defined as:

• Input: Reference sequence, Set of paired-end reads randomly sampled from donner sequence, Coverage ratio.
• Output: Donner sequence such that (1) The number of CNV is consistent with the coverage ratio (2) The number of mismatches between the set of paired-end reads and the donner sequence is minimized.

Week 3
I developed an algorithm to solve the naive version of CNV-reconstruction problem, where the reads are error-free and the CNV regions are clean, without other structural variations.

The algorithm consists of three major steps:

2. Find junctions between copies using Un-Mapped Reads: (1) split the Un-Mapped Read into 2 parts at each internal position of the Read.(2) match both parts to the reference sequence. If both parts match the reference sequence, record the matched positions.
3. Order the junctions such that the donor sequence is valid, when you have more than 2 copies and the copies are of difference length.

Week 4
I extended the algorithm to handle the cases where the reads may contain errors. The same algorithm can be applied here. However, more false positive junction candidates will be generated. A clustering algorithm is then applied to group junctions candidates adjacent to the same cluster. Thus the bigger the cluster is, the more likely the true positive junctions are in the cluster.

Week 5 & 6
I implemented the algorithm. I also conducted a set of experiments to validate the performance of the algorithm. Here are the experimental settings and some of the results:

• Reference sequence length: 1000
• CNV length: around 100
• Length threshold for significant match: 10
• Compare cases for 3 copies and 4 copies, with allowed errors as 0, 1, 2, respectively
• Simulate reference sequence, CNV and reads using Nick’s simulator
• Coverage Ratio: 40x

RUNNING TIME VS. NUMBER OF COPIES: Conclusion:

• Copy-counts affects the running time positively.
• Error rate affects the running time positively.

NUMBER OF CANDIDATE JUNCTIONS VS. NUMBER OF ERRORS Conclusion:

• Copy-counts affects the # of candidate junctions positively.
• Error rate affects the # of candidate junctions positively.