Lecture 3
Lecture 3, Oct 3, 2025
The dataset I used in the Lecture can be downloaded here.
HW 3 (Due Oct 10, Friday).
1. A SNP locus is genotyped for a total of 1000 individuals, among which 485 is AA, 421 is Aa, and 94 is aa. Compute the $X^2$ statistic for Hardy-Weinberg Equilibrium, and find the P-val. You can either use function pchisq or use your simulated reference distribution.
2. Genetic Association Test
Genetic Association studies have been very popular in last two decades. Billions of dollars have been spent and tens of millions individuals have their genomes genotyped. They all look for a simple signal. That is, the distribution of genotypes differ among people with different phenotypes.
Consider a study with diabetes. Say we recruited 1000 diabetes patients, and 1500 controls. Assume that they are from the same genetic background, which is very important. If the patients are recruited in US, and controls are recruited in Mexico, then all we would be looking at the genetic difference between two populations. Say a candidate loci is genotyped, and the results can be summarized in the table
$$
\begin{array}{c|ccc}\hline
& AA & Aa & aa\\ \hline
Cases & O_{10} &O_{11} & O_{12}\\
Controls & O_{20} & O_{21} & O_{22} \\ \hline
\end{array}
$$
The null hypothesis here is that the loci has nothing to do with diabetes. Under this hypothesis, we expect that the genotypes has the same distribution in the cases and controls. Therefore, we estimate the proportion of AA to be $\hat{p}_{AA}=(O_{10}+O_{20})/N$, where N=2500 in this example, and the proportion of Aa and aa to be $\hat{p}_{Aa}=(O_{11}+O_{21})/N$ and $\hat{p}_{aa}=(O_{12}+O_{22})/N$.
The expected count of AA, Aa, and aa in the cases are
$E_{10}=1000\hat{p}_{AA}$,
$E_{11}=1000\hat{p}_{Aa}$,
$E_{12}=1000\hat{p}_{aa}$, respectively.
Similarly, the expected count of AA, Aa, and aa in the controls are
$E_{20}=1500\hat{p}_{AA}$,
$E_{21}=1500\hat{p}_{Aa}$,
$E_{22}=1500\hat{p}_{aa}$.
The Pearson $X^2$ can then be obtained using its general form. However, as I mentioned in the class, Karl Pearson actually had trouble to get the correct degrees of freedom for the Chisq distribution associated with this problem. But we can do a numerical experiment to find it out.
The table below lists the genotype counts at a particular loci. Under the null hypothesis, we can estimate $p_{AA}$ as $\hat{p}_{AA}=(489+632)/2500=0.4484$, etc. As part of this homework, finish the computation of the Pearson $X^2$ statistic. Going through this exercise should help you writing the R-code.
$$
\begin{array}{c|ccc}\hline
& AA & Aa & aa\\ \hline
Cases & 489 & 441 & 70\\
Controls & 632 & 670 & 198 \\ \hline
\end{array}
$$
Do a numerical experiment to generate the sampling distribution of $X^2$ statistics for the association test, at two different genotype frequencies and the sample sizes, and determine if the sampling distribution approximately follows a Chi-Square distribution with a certain degree of freedom. Note the the mean of the sampling distribution will give you a hint on the degree of freedom.
Below are template R-code with some comments
# Our goal is to simulate Pearson Chi-square statistics for genetic association tests.
# set number of cases and controls
# you can later change it to any other sample size
n1 <- 1000
n2 <- 1500
# We pretend to be God and set-up frequency of genotypes, can be any three numbers adds up to one, for example, c(0.3,0.5,0.2).
## you should change these numbers to something different
p1.geno <- c(0.3,0.5,0.2)
# Under the null hypothesis, the cases and controls have the same genotype distribution
p2.geno <- p1.geno
# Simulate the observed counts from multinomial distributions
# specify the sample size by replacing the ???
O1 <- rmultinom(1,size=??? ,prob=p1.geno)
O2 <- rmultinom(1,size=??? ,prob=p2.geno)
# Now, we need to come up with the expected genotype counts in Cases and Controls respectively
# first estimate the genotype frequency under H0
# note that under H0, cases and controls have the same genotype frequency
# finish the following three lines of R code
f.AA.hat <-
f.Aa.hat <-
f.aa.hat <-
# using the estimated genotype frequency to estimate the expected count
# note that both E1 and E2 are vectors of length 3.
E1 <-
E2 <-
# Compute the Pearson chi-square statistics, the first sum is over the cases, the second of the controls
PChi2 <- sum((O1-E1)^2/E1)+sum((O2-E2)^2/E2)
# Now repeat above 10000 times using a for-loop.
PChi2 <- rep(0,10000)
for(i in 1:10000){
O1 <- rmultinom(1,size=,prob=p1.geno)
O2 <- rmultinom(1,size=,prob=p2.geno)
f.AA.hat <-
f.Aa.hat <-
f.aa.hat <-
E1 <-
E2 <-
PChi2[i] <- sum((O1-E1)^2/E1)+sum((O2-E2)^2/E2)
}
# get a hint on the degree of freedom
E1 <-
E2 <-
# Compute the Pearson chi-square statistics, the first sum is over the cases, the second of the controls
PChi2 <- sum((O1-E1)^2/E1)+sum((O2-E2)^2/E2)
# Now repeat above 10000 times using a for-loop.
PChi2 <- rep(0,10000)
for(i in 1:10000){
O1 <- rmultinom(1,size=,prob=p1.geno)
O2 <- rmultinom(1,size=,prob=p2.geno)
f.AA.hat <-
f.Aa.hat <-
f.aa.hat <-
E1 <-
E2 <-
PChi2[i] <- sum((O1-E1)^2/E1)+sum((O2-E2)^2/E2)
}
# get a hint on the degree of freedom
mean(PChi2)
# use qqplot to compare the statistic against random sample from a chi square distribution (try to guess the df)
qqplot(,)
# add straight line x=y
abline(0,1)
# add straight line x=y
abline(0,1)
3. Dowland data here
Comments
Post a Comment