**Project Member**

George Wang

**The goal of this quarter**

- To implement the relatedness estimator (Project 1).

**About me**

- I am a 4th year undergraudate in CSE major.

**The Schedule for the quarter / Grade**

DONE - set up the wiki page

DONE - problem analysis

DONE - prototyping

DONE - code

DONE - Presentation

**Project Description**

- Problem raised in the Genetic science that given the genotypes of several individuals, how are all of them related? Basically parents transmit 1 chromosome to each child. Siblings share approximately 50% of their DNA. 1st cousins share about 25% of their DNA. So the challenge could be that some individuals may share DNA by chance.

- So in our study, it is considered that a SNP has minor allele frequency of 0.1. There could be questoins like if your brother has the allele on both chromosomes, what is your probability of having the allele on both your chromosomes. What could be different if the minor allele frequency is changed? From differentiating the frequency range, what is the most informative given that frequency?

- The purpose of this project is at easy level of difficulty, to construct a method for determining whether 2 individuals are siblings.

**Week 1**

After visiting professor's Ofice Hour, I had a basic idea about how to implement the the likelihood estimation. The first task is to find out that given a MAF, what is the probability for an individual that has the probability of different SNP, AA, AC, and CC. Then given any 2 individuals, I need to figure out boht cases for related/unrelated situations where child1 and child2 may or may not be siblings. The way to find out the probability for unrelated case is simply to take the probability for SNP of an individual multiply by the other individual. So we can form a table with 9 cells. For related case, it is bit more complicated, because now we need to take into account for parents. The statistic performance is complex and requires a lot of efforts to check and recheck. I make a rough spreadsheet and will visit professor in his next office hour for revision as well as to inquire for next step. Overall rate: A

**Week2**

While revisiting the professor, I was aware that that is a bug in my spreadsheet. We can easily tell by looking at the table I generate. The table must be symmetric. I found out that the the bug was at the one of the cases where given father=AA, AC, CC, and mother is AA, AC, CC, we need to find out that probability for child1= AC and child2= CC. Then once the bug is solved, I now have the probability summing up to 1, I can perform the next step of constructing the estimator. That is, we need to randomly generate a bunch of numbers representing some SNPs. We need to assume that they are taken from unrelated case, for example. So we take into account for the unrelated table derived from above. Since the random number is generated betweeen (0,1), we need to find the upper bound of the probability in the table. After mapping the randNumber from the table, the next step is to figure out that if the SNPs belongs to unrelated or related likelihood. The way I do it is to assume that I forgot where the SNPs are coming from which cases. Then I compare 2 tables, particularly I want to compare each corresponding cells by the SNPs, then I can make the following statement: if the probability of having SNPs in case1 is higher than the probability of having the same SNPs in case2, then this SNP belongs to case1. In our case, case 1 is unrelated, case2 is related. This process takes a long time and I believe the completion of this step will be the major milesone of this project for completion of the project. Overall rate: A

**Week3**

This week is to get the project done. I revisited the professor for permission to present the project on Friday. But it turns out that I need to make the presentation slides easier as this project is very conceptual and a lot of abstract to state before doing the real math works. So I create a few slides to help the class understand the project more comfortable. I resent the slides to the professor for more advice, and it is now good. I had several rehersals of the presenatation and am ready to present. Overall rate: A

**Week4**

After a well presentation to the class about the estimator project, there raises a problem about the final conclusion. As there is another team who concludes the project as that as higher MAF goes, more informative the data gives. This is contradict to what I have where my conclusion states that as we have lower MAF, result we can get will be more information. Lets assume that the MAF is 0.0000001. There is essentially (for all effective purposes) an equal probability to get the non minor allele on both chromosomes. However, when actually determining the probability to get it, the related matrix will always have a slightly higher probability. (This is because you can already discount several very unlikely combinations: AA x GG (where A is the minor allele) can be ignored completely). Since our algorithm just checks whether something is larger (probability wise) or not, it will always say they are related regardless of fact that it is near 50/50 chance of relatedness. Since these account for a large percent of simulated pairs it makes sense that the error is very high. This should be corrected by disregarding anything that is too close together (mark it as inconclusive because it effectively is). This begs the question of what threshold should be used.

Overall rate: A