**Project Description**

This project is about cancer genomics. Cancer can often cause a deletion or amplification of a gene. In this project, we consider deletions only. This project computes the likelihood of a region of consecutive homozygous SNPs. If the likelihood is low, it may be indicative of a cancer region.

**About Me**

I am a third-year computer science major. I am planning to pursue graduate school after I finish at UCLA, and I am hoping to learn more about information retrieval.

**Goal**

My goals for the spring quarter are to design a method to take a region of genetic data and compute the likelihood of each run of homozygous SNP's. A small program will also be written to implement it.

**Weekly Schedule**

- Week 5- Perform research and background reading.
- Week 6- Design method to compute the likelihood.
- Week 7- Design method to compute the likelihood.
- Week 8- Implement program.
- Week 9- Implement program. Review and analyze effectiveness of method and program.

**Weekly Updates**

6-9-09

- Progress for Week
- I downloaded the data for Chromosome 19 and verified that the maximum runs were also very long and consequently the probability was very low.

- Plan for Next Week
- N/A

- Evaluation of Week
- Done!

- Problems solved
- Found that the same phenomenon occurs at other chromosomes too. It looks like this really happens.

6-2-09

- Progress for Week
- No updates before finals!

5-25-09

- Progress for Week
- Verified that the results make sense. It's surprising, but it appears that one should not use the allele frequencies alone to form this prediction!

- Plan for Next Week
- Try it out another another chromosome and see if the same thing happens.

- Evaluation of Week
- I'm done with the easy project, so the week went by very well.

- Problems solved
- Verified that results again.
- Learned that running in Release mode makes the code many orders of magnitudes faster.

5-18-09

- Progress for Week
- I verified that the longest run reported does actually exist. I also verified (again) that given that the length of the run is accurate, the probability is the correct order of magnitude. I modified my program to use less memory, because I noticed the CPU was underutilized and the memory size was enormous.
- I tested the Chromosome 1 in its entirety for the CEU panel of 90 individuals. All probabilities were extremely close to 0. The highest probability was on the order of 10
^{-60}, even after adjusting for the fact that the run could start in any location on the chromosome. - One possible explanation is that the assumption that all SNP's are independent is an extremely erroneous one. Obviously, we know that the assumption is not true because of linkage disequilibrium, but it is still surprising to see that it completely invalidates our results.
- I began working on using correlation. I obtained the correlation data from the HapMap website.

- Plan for Next Week
- Next week, I plan to identify a method to use the correlation between SNP's to estimate the likelihood of a run of SNP's occuring.

- Evaluation of Week
- The week went slowly, but I did some much needed work in verifying my results and improving the performance. My evaluation is that my progress for the week was average.

- Problems that came up
- The correlation data was humongous. It is over 2 gigabytes for a single chromosome. File IO consumes a lot of time, and I cannot even open it in a Windows text editor.

- Problems solved
- Verified that results are really zero.
- Decreased the time and memory needed to run the program.
- Obtained correlation data.

5-11-09

- Progress for Week
- No updates this week because of papers!

5-4-09

- Progress for Week
- I talked to Professor Eskin about the project. I thought of how to compute the probability of a run of homozygous SNP's occurring. The probability of a single homozygous SNP occuring is $pA^2 + (1-pA)^2$. The probability of a run occurring is the product of the individual probabilities. I designed an algorithm to determine an individual's longest run of homozygous SNP's. Feeling pretty confident, I went ahead and implemented a quick version to run on a small data set (10,000 SNP's on 90 individuals). The probabilities are extremely small, which seems odd. It suggests that everybody definitely has cancer. I verified that the probabilities were the correct order of magnitude, assuming the length of the homozygous run is correct. It is possible that the assumption the easy project makes, which is that the SNP's are independent, is a faulty one.

- Plan for Next Week
- Next week, I need to verify that the program is computing the length of the SNP's correctly. I also plan to investigate other reasons why the probability is so low.
- I want to consider how I can do the medium project, which is to use haplotype frequencies to estimate the probability.

- Evaluation of Week
- The week went very well. I was able to design the method, as planned, and even write some code to test my design.

- Problems that came up
- IO was extremely slow, as expected. I was able to greatly improve read times by using fread instead of » or fscnaf. fread was orders of magnitude faster. However, when I tried to parse the string I read in using sscanf, the time was roughly the same. Thus, I switched back to » and fscanf, since the highly complicated fread/sscanf code did not speed up the code. Perhaps there is a way to parse the string which is read in more efficiently.

- Problems solved
- Designed method to compute probabilities.
- Found test data from the HapMap data.
- Parsed test data.
- Found longest homozygous run.
- Compute probability of a given run of homozygous SNP's.

4-27-09

- Progress for Week
- I did background reading on my topic this week, which was cancer genomics. I read about how cancer mutations can lead to amplification, deletion, and copy number variation.

- Plan for Next Week
- Next week, I plan to lay down a high-level strategy for the project. Also, I want to look into each of those steps, so the week afterwards, I can fill in the details at each step.

- Evaluation of Week
- The week went as planned. It brought up some questions about the project.

- Problems that came up
- I need to clarify the project with Professor Eskin. In particular, I'm not clear what a run of homozygous nucleotides means. If there is a deletion on one chromosome due to a cancerous mutation, would that cause it to copy the allele of its matching chromosome? Or does that on a single nucleotide, there will be a run of nucleotides with the same nucleotide. If that's so, isn't that more like copy number variation or amplification?
- Also, I need to find data with long continuous blocks of genes, not just the common SNP's.

- Problems solved
- I decided that using the HapMap data is insufficient. Since I need to detect a run of nucleotides, I need to check adjacent gene loci. The HapMap data only provides common SNP's, which are not guaranteed (or likely) to be adjacent.