Using a mixture model for finding informative genes in microarray studies

Jing Qin, Ph.D.

Assistant Attending Biostatistics

Department of Epidemiology and Biostatistics

Memorial Sloan-Kettering Cancer Center

The statistical analysis of microarray data focuses on the association between genetic expression and an outcome variable. For each gene, a test of no association generates a p-value. The multiple testing associated with the tens of thousands of genes typically incorporated into a microarray data analysis poses an interesting statistical challenge. Our approach is based on a two-stage procedure; an initial test is performed for the global null hypothesis that gene expression is not associated with an outcome variable. If this hypothesis is rejected, a parametric mixture model is proposed for the set of p-values to determine which genes are associated with outcome. The p-values are modeled as a mixture of uniform (0, 1) random variables, representing the genes with no association to the outcome variable, and Beta (ε, θ) random variables representing those related to outcome. The parameters of the Beta distribution are constrained to form a density decreasing in p. Likelihood analysis is used to estimate the mixture proportion and the Beta parameters. These estimates, along with the Bayesian false discovery rate, are used to determine a threshold p-value, whereby there is a high level of confidence that all genes with p-values less than the threshold are associated with the outcome variable. A neuroblastoma dataset is discussed.

If time permits, I will also discuss semiparametric approaches on estimating the proportion of informative genes.