Haodong_Project 14_Disease_Models

Hi everyone, I'm Haodong, first year graduate student of ACCESS Program. ACCESS Program represents most molecular, cellular and other life sciences programs at UCLA.

Project Description

Genetic variation, or genetic disorder, is the main cause for many diseases. Some diseases are caused by one gene, while other may be caused by multiple genes disorder. There are several disease models, even for single gene disorder, such as autosomal dominant, autosomal recessive, X-linked dominant, X-linked recessive and Y-linked. There are also different types of dominance and recessiveness, such as complete dominance, incomplete dominance, codominance, dominant negtive. Models for polygentic diseases are more complex.

In order to simplify the statistic models for understanding in class, we assume that individuals have only 1 chromosome. However, the truth is that we have two copies of chromosomes. Therefore, we need to modify our statistics in order to investigate different disease models.

Nature Reviews Genetics has a Focus on Statistical Analysis with four excellent reviews, which I believe can help us familiarize with genetic statistic modeling.


To develop new statistic methods to take into account different disease models.

This week's work

  • Define different disease models
  • Review basic mathematic and statistic knowledge
  • Search relevant literatures
  • If possible, try to construct one statistic model for autosomal dominant


Autosomal Complete Dominant Model

In this model, one important assumption is that the possibility of having disease for one individual which has two copies of SNP A is the same as the one with only one copies of SNP A1.

\begin{equation} P(+|AA)=P(+|Aa) \end{equation}

Another assumption is that the distribution of SNP A genotype follows Hardy–Weinberg principle

\begin{equation} P(AA)=p^{2}_{A} \end{equation}
\begin{equation} P(Aa)=2*p_{A}*(1-p_{A}) \end{equation}
\begin{equation} P(aa)=(1-p_{A})^{2} \end{equation}

To test whether SNP A is associated with a certain disease, a case-control study could be conducted. Every individual is genotyped. To reduce the cost of genotyping, we can only test whether an individual has SNP A or not. Therefore, the readout would be

\begin{align} \hat{p}_{SNP A}^{+}=\hat{p}_{AA}^{+}+\hat{p}_{Aa}^{+} \end{align}

Here $\hat{p}_{SNP A}^{+}$ means in case group, the observed frequency of individuals who have at least one allele of SNP A. This value is different from the $\hat{p}_{A}^{+}$ we learned from class; the latter one means the observed frequency of SNP A. $\hat{p}_{AA}^{+}$ means the observed frequency of individuals whose genotype is AA. However, under this circumstance we do not know this value, although we could.

Our null hypothesis is that $\hat{p}_{SNP A}^{+} = \hat{p}_{SNP A}^{-}$. Next, we show that $\hat{p}_{SNP A}^{+}$ follows a normal distribution

\begin{align} \hat{p}_{AA}^{+}\sim N (p_{AA}^{+},\frac{p_{AA}^{+}(1-p_{AA}^{+})}{N}) \end{align}
\begin{align} \hat{p}_{Aa}^{+}\sim N (p_{Aa}^{+},\frac{p_{Aa}^{+}(1-p_{Aa}^{+})}{N}) \end{align}


\begin{align} \hat{p}_{SNP A}^{+} \sim N(p_{AA}^{+}+p_{Aa}^{+},\frac{p_{AA}^{+}(1-p_{AA}^{+})}{N}+\frac{p_{Aa}^{+}(1-p_{Aa}^{+})}{N}) \end{align}

The next work is to normalize this distribution to standard normal distribution.

\begin{align} P(Aa|+)=\frac{P(+|Aa)*P(Aa)}{P(+)} =\frac{P(+|AA)*P(Aa)}{P(+)} =P(AA|+)*\frac{P(Aa)}{P(AA)} \end{align}

I grade A- for this week. It seems that this project is a little bit more difficult than I expected.


Autosomal Addictive Model

This week I decided to work out autosomal addictive model, or incomplete dominant model, because I found that dominant and recessive models can be included in addictive model.

For addictive model, I assume that

\begin{align} P(+|AA)=\gamma_{AA}*P(+|aa) \end{align}
\begin{align} P(+|Aa)=\gamma_{Aa}*P(+|aa) \end{align}
\begin{align} \gamma_{AA}\geq \gamma_{Aa}\geq 1 \end{align}
\begin{align} \gamma_{AA}>1 \end{align}

In the case of complete dominant model, $\gamma_{AA}=\gamma_{Aa}$ ; in the case of recessive model, $\gamma_{Aa}=1$2
Next, we relate P(AA|+) to P(Aa|+)

\begin{align} P(Aa|+)=\frac{P(+|Aa)\times P(Aa)}{P(+)}=\frac{\frac{\gamma_{Aa}}{\gamma_{AA}}\cdot P(+|AA)}{P(+)}=\frac{\gamma_{Aa}}{\gamma_{AA}}\cdot \frac{P(Aa)}{P(AA)}\cdot P(AA|+) \end{align}

Grade: A


From this week I use $p_{A\_}^{+}$ instead of $p_{SNPA}^{+}$. Both mean the frequency of people who have at least one allele of SNP A among people who have disease.

This week I found one mistake I made previously. Although $\hat p_{AA} ^{+}$ and $\hat p_{Aa}^{+}$ follow normal distribution (6)(7), they are not independent. Therefore, the distribution of $\hat p_{SNPA}^{+}$ cannot be written simply as the sum of the distribution of $\hat p_{AA} ^{+}$ and $\hat p_{Aa}^{+}$ (8). However, since $\hat p_{A\_}^{+}$ equals to $1- \hat p_{aa}^{+}$, and $\hat p_{aa}^{+}$ follows normal distribution, $\hat p_{A\_}^{+}$ follows this distribution:

\begin{align} \hat{p}_{A\_}^{+}\sim N (p_{A\_}^{+},\frac{p_{A\_}^{+}(1-p_{A\_}^{+})}{N}) \end{align}

$\hat{p}_{A\_}^{-}$ also follows a similar normal distribution.

So we have

\begin{align} \hat{p}_{A\_}^{+} - \hat{p}_{A\_}^{-} \sim N(p_{A\_}^{+}-p_{A\_}^{-}, \frac{1}{N} [ p_{A\_}^{+}(1- p_{A\_}^{+})+p_{A\_}^{-}(1- p_{A\_}^{-})]) \end{align}

Next, by assuming that

\begin{align} p_{A\_}^{+}(1- p_{A\_}^{+})+p_{A\_}^{-}(1- p_{A\_}^{-}) \sim 2p_{A\_}(1-p_{A\_}) \end{align}

we can normalize the distribution(16):

\begin{align} \frac{\hat{p}_{A\_}^{+}-\hat{p}_{A\_}^{-}}{\sqrt{2/N}\cdot\sqrt{p_{A\_}(1-p_{A\_})}}\sim N(\frac{p_{A\_}^{+}-p_{A\_}^{-}}{\sqrt{2/N}\cdot\sqrt{p_{A\_}(1-p_{A\_})}},1) \end{align}

The next problem we need to solve is to relate $p_{A\_}^{+}$ , $p_{A\_}^{-}$with $p_{A}$ , $\gamma_{AA}$ and $\gamma_{Aa}$

\begin{eqnarray} \lefteqn{P(AA|+)=\frac{P(+|AA)\cdot P(AA)}{P(+)}}\\&&{=\frac{P(+|AA)\cdot P(AA)}{P(+|AA)\cdot P(AA)+P(+|Aa)\cdot P(Aa)+P(+|aa)\cdot P(aa)}} \\&&=\frac{\gamma_{AA}\cdot P(AA)}{\gamma_{AA}\cdot P(AA)+\gamma_{Aa}\cdot P(Aa)+P(aa)} \end{eqnarray}

Considering Hardy–Weinberg principle

\begin{align} p_{AA}^{+}=\frac{\gamma_{AA}\cdot p_{A}^{2}}{(\gamma_{AA}-2\gamma_{Aa}+1)p_{A}^{2}+2(\gamma_{Aa}-1)p_{A}+1} \end{align}
\begin{align} p_{Aa}^{+}=\frac{\gamma_{Aa}\cdot 2p_{A}(1-p_{A})}{(\gamma_{AA}-2\gamma_{Aa}+1)p_{A}^{2}+2(\gamma_{Aa}-1)p_{A}+1} \end{align}


\begin{align} p_{A\_}^{+}=\frac{\gamma_{AA}\cdot p_{A}^{2}+\gamma_{Aa}\cdot 2p_{A}(1-p_{A})}{(\gamma_{AA}-2\gamma_{Aa}+1)p_{A}^{2}+2(\gamma_{Aa}-1)p_{A}+1} \end{align}

For$p_{A\_}^{-}$, we can assume that $p_{A\_}^{-}=p_{A\_}$, so

\begin{align} p_{A\_}^{-}=p_{A\_}=p_{AA}+p_{Aa}=p_{A}^{2}+2p_{A}(1-p_{A}) \end{align}



First of all, I decided not to use H-W principle because the frequency of SNP haplotype can be obtaining from hapmap. Therefore I use P(AA) instead of $p^{2}_{A}$ to indicate the frequency of SNP A minor allele homozygous genotype, P(Aa) for heterozygous and P(aa) for the homozygous of reference allele.

Then we have

\begin{eqnarray} \lefteqn{P(AA|+)=\frac{P(+|AA)\cdot P(AA)}{P(+)}}\\&&{=\frac{P(+|AA)\cdot P(AA)}{P(+|AA)\cdot P(AA)+P(+|Aa)\cdot P(Aa)+P(+|aa)\cdot P(aa)}} \\&&=\frac{\gamma_{AA}\cdot P(AA)}{\gamma_{AA}\cdot P(AA)+\gamma_{Aa}\cdot P(Aa)+P(aa)} \end{eqnarray}
\begin{eqnarray} \lefteqn{P(Aa|+)=\frac{P(+|Aa)\cdot P(Aa)}{P(+)}}\\&&{=\frac{P(+|Aa)\cdot P(Aa)}{P(+|AA)\cdot P(AA)+P(+|Aa)\cdot P(Aa)+P(+|aa)\cdot P(aa)}} \\&&=\frac{\gamma_{Aa}\cdot P(Aa)}{\gamma_{AA}\cdot P(AA)+\gamma_{Aa}\cdot P(Aa)+P(aa)} \end{eqnarray}
\begin{eqnarray} P(A\_|+)=P(AA|+)+P(Aa|+) \end{eqnarray}
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License