Xianghong Zhou, Ph.D.

Research Fellow

Department of Biostatistics

Harvard University

School of Public Health

Boston, Massachusetts

 

 

 

Computational analysis of cellular systems:

Functional study of the transcriptome and cell classification with the proteome

 

High-throughput technologies have generated tremendous amounts of biological information at the levels of molecular sequences, gene expression, and protein activities. I will present two novel approaches that utilize such biological information to study cellular systems.

 

I will present a graph-theoretic approach to annotate gene functions based on microarray expression data. Current methods for the functional analysis of microarray gene expression data make the implicit assumption that genes with similar expression profiles have similar functions in cells. However, among genes involved in the same biological pathway, not all gene pairs show high expression similarity. Here, we propose that transitive expression similarity among genes can be used as an important attribute to link genes of the same biological pathway.

 

Based on large-scale yeast microarray expression data, we use the shortest-path analysis to identify transitive genes between two given genes from the same biological process. We find that functionally related genes with correlated expression profiles, as well as those without, are identified. In the latter case, we compare our method to hierarchical clustering, and show that our method can reveal functional relationships among genes in a more precise manner. Finally, we show that our method can be used to reliably predict the function of unknown genes from known genes lying on the same shortest path. We assigned functions for 146 yeast genes that are considered unknown by the Saccharomyces Genome Database and by the Yeast Proteome Database. These genes constitute around 5% of the unknown yeast ORFome.

 

In the second part of my talk, I will present a statistical framework for classifying cells according to the set of peptide masses obtained by mass spectrometric analysis of digestions of whole cell protein extracts. We have used defined bacterial strains to test this approach. For each bacterium, this process is repeated for extracts obtained at different points in the growth curve in order to try and define an invariant set of signals that uniquely identify the bacterium. We present algorithms for the creation of this cell fingerprint database and develop a Bayesian classification scheme for deciding whether or not an unknown bacterium has a match in the database. Our initial testing based on a limited dataset of three bacteria indicates that our approach is feasible. Via a jack-knife test, our Bayesian classification scheme correctly identified the bacterium in 67.8% of the cases.