Home | Business News | Browse by Publication | J | Journal of the American Statistical Association

Spike and slab gene selection for multigroup microarray data.

Publication: Journal of the American Statistical Association
Publication Date: 01-SEP-05
Format: Online
Delivery: Immediate Online Access

Article Excerpt
1. INTRODUCTION

Many invasive diseases, such as cancers, undergo significant transformations during their life spans. Staging of cancers is based on the extent of anatomical invasion from the primary site of development. However, although cancers have well-defined morphological evolution corresponding to clinical stage, very little is known about molecular changes that characterize the stages; this is particularly true for colon cancer; the focus of our present application. High-throughput microarray technology provides a unique opportunity to study this problem. DNA microarrays provide a snapshot of the simultaneous expression of thousands of mRNA transcripts at a given point in time via a single assay. To a first approximation, DNA microarray data provide information about a cell's proteomic composition (using nuclear RNA instead), and thus biological insight into what functional genomic changes might have taken place across the spectrum of a disease. (See Nguyen, Arpat, Wang, and Carroll 2002 for a general overview of biological and technical aspects of microarrays.)

At the same time, microarray data pose a serious statistical challenge due to the sheer volume of information being processed. It is the norm to see data collected on tens of thousands of gene expressions from only a small handful of tissue samples. Data analysis is further complicated because of heterogeneity of variances and correlation of gene expressions due to biological effect or technological artifact. Because it is expected that most genes show no differential gene expression across disease states, the potential for type I errors or false detections is large. For two-group problems, a common strategy is to control the false discovery rate (FDR) using the method of Benjamini and Hochberg (1995) or empirical Bayes methods (Efron, Tibshirani, Storey, and Tusher 2001; Tusher, Tibshirani, and Chu 2001; Storey 2002). However, although these methods work well in controlling FDR, the price paid is often a conservativeness that leads to missing important genes (Ishwaran and Rao 2003). Indeed, in two-group problems, the total number of misclassified genes can be derived in closed form under normality (Genovese and Wasserman 2002, thm. 3). Such calculations suggest that when the fraction of differentially expressing genes is relatively low, misclassification will be high unless FDR is controlled at a high value, thus defeating the purpose of such control.

1.1 Multigroup Data

Multigroup data refers to microarray data collected over different experimental conditions, such as from distinct stages of a disease process. Because of the many questions that could be asked when analyzing such data, most approaches start by simplifying the problem into a composite question that can be tested using a one-dimensional test statistic for each gene. Although this strategy is certainly convenient (e.g., making it possible to apply standard error control methods such as the FDR), it may not be optimal for several reasons. First, the underlying test statistic is likely to be fairly elementary, and thus highly variable, because it will not be regularized, that is, constructed in a way that carefully uses information across all genes and samples. Regularization is an important concept in microarray settings, where sample sizes are small. Second, composite statistics are seriously limited in the information that they provide. Consider an analysis using contrasts to check for a specific pattern of differential expression across groups. For example, consider a gene that differentially expresses early on in a disease process such as colon cancer, significantly affecting the biological milieu making it possible for other genes to act, but then later vanishes in terms of biological effect. We call this a hit-and-run hypothesis. A contrast, or set of contrasts, looking for hit-and-run genes would simply provide what is equivalent to a p value for rejecting a null hypothesis of no such pattern being present. What it would not identify is which genes among all such interesting genes are most likely to be truly off for the remainder of the biological process.

Beyond such conventional approaches, very little research seems to have been directed toward the multigroup problem. One notable exception is the recent work by Kendziorski, Newton, Lan, and Gould (2003), which used parametric empirical Bayes method to compute posterior odds for group expression patterns. (There are approximately g! of these if there are g groups.) This requires that gene expression values be exchangeable from a common null and alternative distribution adequately approximated as either a gamma or a lognormal mixture distribution. Genes are classified into a group pattern according to maximum posterior odds. However, although classifying genes into hit-and-run categories is relatively straight-forward with a posterior odds approach, it might be difficult to optimally rank genes within a category.

1.2 Contributions and Outline of the Article

Recently, Ishwaran and Rao (2003) introduced a method for detecting differentially expressing genes between two biological groups, termed Bayesian ANOVA for microarrays (BAM). This method recasts the statistical problem as a high-dimensional variable selection problem and uses a specific Bayesian hierarchical model geared toward adaptive shrinkage. Using model averaging, a way of accounting for model uncertainty, BAM provides gene effect estimates shrunken relative to standard least squares estimates in which primarily only the nondifferentially expressing gene effects are shrunken. This a general phenomenon that we call selective shrinkage, which plays a crucial role in our extension of BAM to multigroup data.

This extension differs in some subtle but important ways from the original methodology. One key innovation is our use of orthogonality. We show how to cast the multigroup microarray problem in terms of an ANOVA framework and then transform the problem into a high-dimensional orthogonal model after a simple dimension-reduction and rescaling step. The transformed data are then modeled using a Bayesian rescaled spike and slab model as introduced by Ishwaran and Rao (2005). Besides leading to computational simplifications, orthogonality is a key ingredient in establishing selective shrinkage and other theoretical properties of the method. These results, outlined shortly, provide a deeper understanding of the methodology than those of Ishwaran and Rao (2003). Another important advancement is our ability to systematically deal with heterogeneity of variances across genes. We show how a weighted regression clustering technique used in tandem with graphical diagnostic plots can effectively deal with this problem without resorting to global transformations that can distort signal to noise ratios.

Sections 3-5 contain the theoretical underpinnings of our methodology. Section 6 illustrates the methodology on a large database involving colon cancer. The article concludes with a discussion in Section 7. Our key contributions and primary findings are summarized as follows:

1. Approaching the problem through an ANOVA framework is advantageous, because it allows estimation of all gene-group differential effects simultaneously and avoids having to resort to pairwise group comparisons or user-constructed contrasts to identify interesting gene expression profiles across experimental groups.

2. False detection rates for least squares based test statistics are inflated due to a regression to the mean effect in multigroup problems. This effect, which is due to correlation between test statistics, is mitagated using a rescaled spike and slab approach because of the effects of shrinkage.

3. Under a suitably chosen bimodal hypervariance prior, we are able to achieve a selective shrinkage property in which Bayesian test statistics are large when the differential effect is real (Corollary 1) and shrunk to zero for differential effects that are zero, with this latter effect quantified using an exact representation (Thm. 3). These Bayesian test statistics can be viewed as solutions to a least squares constrained optimization problem in which each gene-group differential parameter has a unique penalty term that is adaptively estimated from the data.

4. We demonstrate that selective shrinkage is sufficient for oracle-like uniformly low-risk misclassification (Thm. 2) and that risk performance improves with sparsity of the parameter space. Given that many gene-group differential parameters are expected to be zero, this suggests that misclassification performance of rescaled spike and slab models naturally improves in multigroup problems.

5. We derive an adaptive data-driven graphical cutoff rule. We characterize the rule by showing that it has the property of optimally (asymptotically) separating genes that are differentially expressing from those that are not (Thm. 4). This shows that the method is risk consistent.

6. We show how rescaled spike and slab models can be used to pull out complex patterns of differential gene expression across stages of colon cancer and liver metastasis that include the patterns described earlier as well as many other interesting patterns. A detailed biological functional analysis of selected genes provides insight into pathways that are activated or deactivated across stages of this disease. Graphical plots for simultaneously visualizing all stagewise differential effects are introduced.

2. RESCALED SPIKE AND SLAB MODELS FOR MULTIGROUP DATA

The data used in illustrating our approach come from a large microarray repository of colon cancer samples of various stages collected at the Ireland Cancer Center of Case Western Reserve University. All gene expression data were compiled using high-density 59K-on-one gene chips developed by EOS Biotechnology. These are Affymetrix-derived chips with proprietary probe sets. The high density of probe sets reflects known genes and ESTs (expressed sequence tags), as well as predicted exons.

Consider Figure 1, which is part of a detailed analysis given later in Section 6. The figure is based on data from four distinct colon tissue samples: Duke's B, C, and D and liver METS, as expressed by the Astler-Coller-Duke staging system (Cohen, Minsky, and Schilsky 1997). The Duke B's in our dataset were actually Duke BSurvivors, comprising patients still alive from the time of initial diagnosis and represent an intermediate stage of cancer. Duke's C tumors represent a progressive worsening of the disease as the cancer begins to spread deeper into the colon wall from the innermost tissue layers, and also to nearby lymph nodes. Liver METS (METS) represent the most advanced stage of the disease where the tumor has metastasized to a distant site, in this case the liver (the other major site is the lung). Duke's D tumors correspond to the tumor depost remaining in the primary organ site after metastasis.

[FIGURE 1 OMITTED]

Plotted in Figure 1 are Bayesian estimated differential gene effects (defined later as Zcut values) for comparing the METS and D's with the BSurvivors (x- and y-axes). Also overlayed on the plot are triangles for identifying genes turning off or on for stage C relative to the BSurvivors. In the figure we have used color to highlight stagewise gene effects of biological interest. Points colored in magenta are genes with significant differential expression across the D's and METS being either turned on or turned off relative to the BSurvivors. For example, the small cluster of magenta triangles in the bottom-left quadrant indicate genes that turn off throughout the C, D, and METS samples. Data points colored in green and blue indicate genes that are significant (in either direction) for only the stage D's and not the METS or for only the METS and not the stage D's. In particular, green points that hug the y-axis are those exhibiting significant changes from BSurvivors to D's but whose METS expression resemble the BSurvivors. These are hit-and-run genes, mentioned in Section 1, and are of particular importance, because they have a very specific early effect only.

Least squares estimates (Z-tests) from a standard ANOVA model provides a strikingly different plot (Fig. 2). Especially apparent is the ellipsoid nature of the figure. As we show later (Sec. 3), this is due to a regression to the mean effect caused by the correlation between the Z-statistics for the METS versus BSurvivors and the D's versus BSurvivors. Regression to the mean inflates false detections and makes it difficult to delineate signal from noise. Notice how difficult it is to identify any hit-and-run candidates. For example, early hit-and-run genes might be the ones in the quadrants indicated by the two arrows, but this is not so clear.

[FIGURE 2 OMITTED]

2.1 Multigroup...

View this article FREE - Now for a Limited Time, try Goliath Business News
Free for 3 Days!



More articles from Journal of the American Statistical Association
Nonparametric inferences for additive models., September 01, 2005
Semiparametric regression analysis of longitudinal data with informati..., September 01, 2005
Dynamical correlation for multivariate longitudinal data., September 01, 2005
Estimation of long memory in the presence of a smooth nonparametric tr..., September 01, 2005
Measurement error in linear autoregressive models., September 01, 2005

Looking for additional articles?
Search our database of over 3 million articles.

Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication name or publication date.

About Goliath
Whether you're looking for sales prospects, competitive information, company analysis or best practices in managing your organization, Goliath can help you meet your business needs.

Our extensive business information databases empower business professionals with both the breadth and depth of credible, authoritative information they need to support their business goals. Whether it be strategic planning, sales prospecting, company research or defining management best practices - Goliath is your leading source for accurate information.