4th Year Undergrad - Computer Science
The goal of this quarter
I will be attempting to implement the relatedness estimator (Project 1). I will finish the easy project and attempt the medium project.
The Schedule for the Quarter / Grade
Week 4 - set up the wiki page (A - Finished)
Week 5 - problem analysis (easy part) (A- Finished)
Week 6 - prototyping (A - Finished)
Week 7 - code (A - Finished)
Week 8,9 - analysis (A - Finished)
Easy Project: Construct a method for determining whether 2 individuals are siblings.
Given a minor allele frequency (MAF) of a SNP what is the probability that unrelated individuals will share share the allele? What is the probability that related individuals will share the allele?
MAF is always the probability of the least likely allele
The SNP is on both chromosomes in the individual (which it receives from the respective parent).
Lets assume that A is the minor allele with a MAF of p. The probability of a non-minor allele (G) is 1-p.
Since each individual has two chromosomes, the minor allele can surface twice on the individual.
The probability of AA = p * p
The probability of AG = p * (1-p)
The probability of two unrelated individuals to get a specific combination is the probability of each individual to get their corresponding pair multiplied by each other.
AA x AA = (p*p) * (p*p)
AG x AA = (p*(1-p)) * (p*p)
The unrelated individual tables can be calculated as shown above.
The related probabilities are slightly more complicated to calculate. The probability is determined by taking all possible parent allele combinations (ie: AA x AA, AG x GG, etc.) and calculating the probability that a parent can have the given pair of siblings. The probability of the given parent combination must also be factored in. This is best illustrated by an example.
In order to get a sibling pair of AA and AG there are several parent options:
Mother = AA Father = AG: AA AG AA AG possible children = 0.5 * 0.5 = 0.25 for AA + AG
Mother = AG Father = AA: AA AA GA GA possible children = 0.5 * 0.5 = 0.25 for AA + AG
Mother = AG Father = AG: AA AG GA GG possible children = 0.25 * 0.5 = 0.125 for AA + AG
The probability of the children will multiplied by the probability of getting the mother and father pair. These will then be summed up to get the total related probability. The mother and father are assumed to be unrelated, and the probability can be taken from the unrelated probability table.
AG and GA are the same thing because the selection of which chromosome is first is arbitrary.
Once the tables are calculated it is just a simple comparison from the values in the tables to determine which is more likely, unrelated or related, for a given sibling pair.
The application was coded in Java with a SWING ui. The algorithm is fairly trivial and the related parental mappings can be hard coded.
The application was capable of calculating the probability tables for any possible MAF. The screen shot of the application is shown. The probabilities and simulation results are calculated with a MAF of 0.1.
There are several interesting things that happen when the probability tables are examined. First, as the MAF goes down the probability of getting the minor alleles obviously decreases. However, the probability of two individuals with those minor alleles being related becomes vastly greater than the unrelated probability.
On the other hand, if the MAF is very close to 0.5. All the probabilities within the table and between tables start to get very close to each other.
I ran a simulation on the algorithm. It was conducted by first generating a fixed number of related and unrelated sibling pairs based on the probabilities in the table. The application would then be run and the number of related pairs would be tabulated and compared against the fixed number. The following is a graph of the result.
I determined that the lower the MAF the higher the error. This is due to the fact that the lower the MAF is the more likely a non minor allele pair will surface. If the MAF is sufficiently low, it is effectively a 50/50 chance for relatedness. In actuality, it is very slightly more likely that the pair are related. Due to this, the application will mark all of the non-minor allele pairs as related which leads to about 50 percent error.
On the flip side, when the MAF was closer to 0.5, the application was much more accurate because there were a lot more occurrences of the minor allele. However, the difference in the probability between unrelated and related was very low and probably should be deemed inconclusive.
A threshold needs to be derived for practical relatedness estimation.
A MAF "sweet spot" must be determined for optimal relatedness estimation. The MAF cannot be so low that the minor allele never shows up, and it cannot be so common that everything pretty much has the same probability, leading all the test to be inconclusive.
Part II and Part III need to developed as well.
The slides are attached to this page.