Posts

Showing posts from October, 2025

Lecture 8

  slide Jupyter Notebook HW8 (Due on 10/24) In the plasma dataset, test if fibrinogen predicts the ESR<20 or not. 

Mix-Effect Models

Data rats.txt Jupyter Notebook Slides HW7 (Due Oct 22nd) Find an example in your research area that should be modeled by mixed effect model instead of regular linear regression model. You only need to described the study design. 

Lecture 7

Slides Jupyter Notebook Lab3 HW7 (Due Oct 22) Compare the statistical power between One-way ANOVA F-test and Permutation Test on range statistics under the following scenario:  Number of Groups: 4 Number of Observations per group: 10 Data in each group are generated from the following normal distributions N(0,1) N(-1,1) N(1,1) N(0,1).  Note that this numerical experiment might take a bit long to finish, as we will perform permutation tests 1000 times. Below is a template of code to help you get started. pval.F <- rep(0,1000) pval.R <- rep(0,1000) for(i in 1:1000){     ### Simulate Data     x <- rep(c("A","B","C","D"),rep(10,4))     y <-           ### fstatistics and degree of freedoms     fstat <- summary(lm(y~x))$fstat          ### get P-val from the f-distribution     pval.F[i] <- 1- pf(fstat[1],fstat[2],fstat[3])          ### do...

Lecture 6

  Slides Jupyter Notebook HW6 (Due Oct 17): Analyze the data set  warpbreaks  that is included in R distribution. It is regarding the Number of Breaks in Yarn during Weaving. Describe how two factors affect the response.  You can view the description and access the data in R by  data(warpbrea ks) Briefly explain the difference between "correlation" and "interaction". Self-administer Lab 2 . You do not have to show me your work. 

Lecture 5

  Jupyter Notebook Slides lymphosyte data HW5 For the example given in the lecture, with the same data, use R function   lm   to fit one-way ANOVA, aov.fit <- lm(count ~ drug,data=lympho) Then use the function   anova   to get the same ANOVA table as shown in the class.   anova(aov.fit) You are not required to submit this homework assignment!

The final project

  The final project  How much does LLM "know" about statistical analysis 1. Identify a set of original data from your research lab, either generated by you or other members, whose goal is to explore the relationship between at least TWO variables. For unpublished data, the data and your analysis will be kept strictly confidential and will not be shared with anybody including the TAs.  If you have trouble access the original data, you can use some recently published data from your lab.  Prepare for a brief description of the background and hypothesis, a clear description of the study design and data collection, to be present to the LLM.  2. Use a LLM, ChatGPT, Gemini, Llama, etc., present your question, and the data, and see how they respond. You should try different prompt and push it to test the limit. You may use multiple LLMs.  3. Follow the instruction by the LLM to carry out the analysis. Feed the results back to them and ask them to interpret.  4...

Lecture 4

Slides Jupyter Notebook HW4: 1. Following the Problem 3 of HW 3 on the relation between body weight and brain weight,  a) reproduce the residual plot, but without the red-line. In the lm object, you can call fitted value by your.lm.object$fit, and plot it against the residuals. b) Find the correlation coefficient between brain weight and body weight (before and after the log-transformation). Does the value of correlation coefficient change?  2. Generate two random vectors of the same length, take the position of the lowest value of the first vector, and see where value at the same position of the 2nd vector was ranked. Take the positions that is below the median of the first vectors, and see where values at the same positions of the 2nd vector was ranked.

Lecture 3

Lecture 3, Oct 3, 2025 Slides Jupyter Notebook The dataset I used in the Lecture can be downloaded  here .  HW 3 (Due Oct 10, Friday).  1 .  A SNP locus is genotyped for a total of 1000 individuals, among which 485 is AA, 421 is Aa, and 94 is aa. Compute the $X^2$  statistic for Hardy-Weinberg Equilibrium, and find the P-val. You can either use function  pchisq  or use your simulated reference distribution. 2. Genetic Association Test Genetic Association studies have been very popular in last two decades. Billions of dollars have been spent and tens of millions individuals have their genomes genotyped. They all look for a simple signal. That is, the distribution of genotypes differ among people with different phenotypes. Consider a study with diabetes. Say we recruited 1000 diabetes patients, and 1500 controls. Assume that they are from the same genetic background, which is very important. If the patients are recruited in US, and co...

Lecture 2 + Lab

Slides  Jupyter Notebook  for the lab.  Jupyter Notebook  for detailed notes, R code and results Homework 2 1. Finish Task 4 of the Lab. Show your code and results. 2 . Do a numerical experiment to generate the sampling distribution of $X^2$  statistics for HWE (the more realistic one with estimated allele frequency), but choose different allele frequencies and sample sizes, to determine if the sampling distribution is approximately  $X^2$  regardless of the underlying allele frequencies.