Home | Business News | Browse by Publication | J | Journal of the American Statistical Association

A two-way semilinear model for normalization and analysis of cDNA microarray data.

Publication: Journal of the American Statistical Association
Publication Date: 01-SEP-05
Format: Online
Delivery: Immediate Online Access

Article Excerpt
1. INTRODUCTION

Microarray technology has become a useful tool for quantitatively monitoring gene expression patterns and has been widely used in functional genomics (Schena, Shalon, Davis, and Brown 1995; Brown and Botstein 1999). In a cDNA microarray experiment, cDNA segments representing the collection of genes and expression sequence tags (ESTs) to be probed are amplified by polymerase chain reaction and spotted in high density on glass microscope slides using a robotic system. Such slides are called microarrays. Each microarray contains thousands of reporters of the collection of genes, or ESTs. The microarrays are queried in a cohybridization assay using two fluorescently-labeled biosamples prepared from the cell populations of interest. One sample is labeled with fluorescent dye Cy5 (red); the other, with fluorescent dye Cy3 (green). Hybridization is assayed using a confocal laser scanner to measure fluorescence intensities, allowing simultaneous determination of the relative expression levels of all of the genes represented on the slide (Hedge et al. 2000).

A basic question in analyzing cDNA microarray data is normalization, the purpose of which is to remove systematic bias in the observed expression values by establishing a normalization curve across the whole dynamic range. A proper normalization procedure ensures that the normalized intensity ratios provide meaningful measures of relative expression levels. Normalization is needed because many factors (including differential efficiency of dye incorporation, difference in the amount of RNA labeled between the two channels, uneven hybridizations, and differences in the printing pinheads, among others) may cause bias in the observed expression values. Therefore, proper normalization is a critical component in the analysis of microarray data and can have an important impact on higher-level analyses, such as detection of differentially expression genes, classification, and cluster analysis.

Yang, Dudoit, Luu, and Speed (2001) systematically considered several normalization methods, including global, intensity-dependent, and dye-swap normalization. The global normalization method assumes a constant normalization factor for all of the genes and rescales the red and green channel intensities so that the mean or median of the intensity log-ratios is 0. For intensity-dependent normalization, Yang et al. (2001) proposed using the locally weighted linear scatterplot smoother (lowess) (Cleveland 1979) in the scatterplot of log-intensity ratio versus log-intensity product (the M-A plot) and using the resulting residuals as the normalized log-intensity ratios. The analysis of variance (ANOVA) method (Kerr, Martin, and Churchill 2000) and the mixed linear model method (Wolfinger et al. 2001) take into account array and dye effects, among others, in a linear model framework and assume constant normalization factors. Fan, Tam, Vande Woude, and Ren (2004b) proposed a semilinear in-slide model (SLIM) method that uses replications of a subset of the genes in an array. Fan, Peng, and Huang (2004a) generalized the SLIM method to account for across-array information, resulting in an aggregated SLIM, so that replication within an array is no longer required. Park et al. (2003) conducted comparisons of a number of normalization methods, including global, linear, and lowess normalization methods. All of the methods described here, except the ANOVA and the method of Fan et al. (2004a), treat normalization as a step separate from the subsequent significant analysis, in which the variation due to normalization is not taken into account.

The lowess normalization is one of the most widely used normalization methods. It assumes that at least one of the two biological assumptions is satisfied: (a) the proportion of differentially expressed genes should be small, or (b) there is symmetry in the expression values between up-regulated and down-regulated genes. These two assumptions reduce the likelihood that the differentially expressed genes are incorrectly "normalized." For experiments where these two assumptions are violated, the lowess normalization method is not appropriate. Yang et al. (2001) suggested using dye-swap normalization. This approach makes the assumption that the normalization curves in the two dye-swapped slides are the same. Because of slide-to-slide variation, this assumption may not always be satisfied. To alleviate the dependence of the lowess normalization method on the assumption (a) or (b) stated earlier, Tseng, Oh, Rohlin, Liao, and Wong (2001) proposed using a rank-based procedure to first select a set of invariant genes that are likely to be constantly expressed, then carrying out lowess normalization using this set of genes. However, these authors pointed out that the set of selected genes may be relatively small and may not cover the whole dynamic range of the expression values, and that extrapolation is needed to fill in the gaps not covered by the invariant genes.

We propose a two-way semilinear model (TW-SLM) for normalization of cDNA microarray data. This model is motivated in part by examining the lowess normalization from a semiparametric regression standpoint. The TW-SLM normalization method does not make the assumptions underlying the lowess normalization method, and does not require preselection of invariant genes or replicated genes in an array. The TW-SLM also provides a framework for incorporating variability due to normalization into significance analysis of microarray data. In a brief, preliminary version of this article, Huang et al. (2003) provided a description of the TW-SLM and reported an application of it to a hippocampus dataset from the Soares Lab at the University of Iowa. We organize the article as follows. In the next section we describe the TW-SLM for microarray data. In Section 3 we describe a Gauss-Seidel algorithm for computing the normalization curves and the estimated relative expression values based on the TW-SLM model. In Section 4 we present a method for detecting differentially expressed genes based on the TW-SLM. In Section 5 we provide theoretical results for the proposed estimators of TW-SLM. In Section 6 we illustrate the proposed method by an example. We also use simulation to compare the proposed method with the lowess normalization method and an analog of the lowess method where splines are used in curve fitting instead of local regression. We give some concluding remarks in Section 7.

2. A TWO-WAY SEMILINEAR MODEL FOR MICROARRAY DATA

To motivate the proposed TW-SLM model for normalization, we first give a description of the lowess normalization method from the semiparametric regression standpoint. Because the proposed TW-SLM can be considered an extension of the standard semiparametric regression model (SRM) (Wahba 1984; Engel, Granger, Rice, and Weiss 1986), we also give a brief description of this model.

2.1 The Lowess Normalization

Suppose that there are J genes and n arrays in the study and that each gene is spotted once in an array. Let [u.sub.ij] and [v.sub.ij] be the intensity levels of gene j in array i from the type 1 and type 2 samples. Following Chen, Daugherty, and Bittner (1997) and Yang et al. (2001), let [y.sub.ij] be the log-intensity ratio of the jth gene in the ith array and let [x.sub.ij] be the corresponding average of the log-intensity. That is,

[y.sub.ij] = [log.sub.2][[u.sub.ij]/[v.sub.ij]], [x.sub.ij] = [1/2][log.sub.2]([u.sub.ij][v.sub.ij]),

i = 1,..., n, j = l,..., J. (1)

For the ith array i = 1,..., n, the lowess normalization fits the nonparametric regression

[y.sub.ij] = [f.sub.i]([x.sub.ij]) + [[epsilon]*.sub.ij], j = l,..., J, (2)

using Cleveland's lowess method. Let [^.f.sub.i] be the lowess estimator of [f.sub.i], and let the residuals from the nonparametric curve fitting be

[^.[epsilon]*.sub.ij] = [y.sub.ij] - [^.f.sub.i]([x.sub.ij]), i = 1,..., n, j = l,..., J.

These residuals are defined as the normalized data and used as the input in the subsequent analysis. So usually, the overall analysis consists of two steps: normalization and analysis based on the normalized data [^.[epsilon]*.sub.ij]. For example, in comparing two DNA samples using a direct-comparison design (i.e., the two cDNA samples are competitively hybridized on an array), a typical approach is to first normalize the data using the lowess normalization, then make inference about differentially expressed genes based on the normalized data. The underlying statistical framework of such a two-step analysis in the direct-comparison design can be described using two models. The first of these models is the nonparametric regression for normalization given in (2); the second concerns the residual

[[epsilon]*.sub.ij] = [[beta].sub.j] + [[epsilon].sub.ij], (3)

where [[beta].sub.j] is the underlying relative expression value of gene j. The goal of the significance analysis is to detect genes with [[beta].sub.j] [not equal to] 0. In the two-step approach, (2) and (3) are used as standalone models for each of the two steps, and the effects of the approximation [^.[epsilon]*.sub.ij] [approximately equal to] [[epsilon]*.sub.ij] typically are completely ignored in the analysis.

The lowess normalization is usually applied using all of the genes in a study. In general, if all of the genes are used, then the differentially expressed genes may be incorrectly "normalized," because such genes tend to pull the normalization curve toward themselves. Thus the two-step analysis approach may yield biased estimators of both [f.sub.i] and [[beta].sub.j] and inefficient test statistics for the inference of [[beta].sub.j] (e.g., relatively large p values for two-sided tests compared with more efficient procedures).

2.2 The Semiparametric Regression Model

Suppose that the data consist of n triplets ([y.sub.i], [x.sub.i], [z.sub.i]), i = 1,...,n, where [y.sub.i] is the response variable and ([x.sub.i], [z.sub.i]) is the covariate. The SRM is

[y.sub.i] = f([x.sub.i]) + [z'.sub.i][beta] + [[epsilon].sub.i], i = 1,..., n, (4)

where f is an unknown function, [beta] is the regression parameter, and [[epsilon].sub.i] is the residual. This model is useful in many situations, for example, when [z.sub.i] is a dichotomous variable representing two conditions (e.g., treatment vs. placebo) and we are interested in the treatment effect [beta] but need to adjust for the effect of the continuous covariate [x.sub.i]. For a p-dimensional covariate [x.sub.i] = ([x.sub.il],..., [x.sub.ip])', it is useful to impose an additive structure on f (Hastie, Tibshirani, and Friedman 2001). A semiparametric generalized additive model is

[y.sub.i] = [f.sub.l] ([x.sub.li]) + ... + [f.sub.p]([x.sub.pi]) + [z'.sub.i][beta] + [[epsilon].sub.i], i=1,..., n. (5)

Models (4) and (5) are two basic semiparametric models. There are two important considerations about parameter estimation in (4) and (5). First, both f and [beta] should be estimated jointly. For example, it is incorrect to fix [beta] at 0, obtain an estimate of f, treat this estimate of f as a known quantity, substitute it back into (4), and then estimate [beta]. Second, the uncertainty due to estimation of f generally needs to be taken into account in estimating [beta], according to the semiparametric information theory (see, e.g., Bickel, Klaassen, Ritov, and Wellner 1993, pp. 107-109).

2.3 The Two-Way Semilinear Model

We first describe the proposed model for the special case of a direct-comparison design, in which two cDNA samples from the respective cell populations are competitively hybridized on the same array. Let [y.sub.ij] and [x.sub.ij] be the log-intensity ratio and product defined in (1). The proposed (simple) TW-SLM is

[y.sub.ij] = [f.sub.i]([x.sub.ij]) + [[beta].sub.j] + [[epsilon].sub.ij], i = 1,..., n, j = 1,..., J, (6)

where [f.sub.i] is the intensity-dependent normalization curve for the ith array, [[beta].sub.j] [member of] R represents the normalized relative expression values of gene j, and [[epsilon].sub.ij] has mean and variance [[sigma].sub.ij.sup.2].

The TW-SLM can be considered a combination of the two models implicitly used in the lowess normalization (2) and (3). Specifically, we obtain (6) by simply substituting (3) into (2). Combining these two models enables us to estimate normalization curves and gene effects simultaneously. This is desirable, because we typically do not know which genes are constantly expressed (i.e., with [[beta].sub.j] = 0). Approximately unbiased normalization could be carried out using only constantly expressed genes if a large set of such genes could be identified, but this is rarely the case.

We call (6) a two-way model because it also can be considered a semiparametric generalization of the two-way ANOVA model. That is, when [f.sub.i] = [[alpha].sub.i], i = 1,..., n, where [[alpha].sub.i] is a constant parameter, (6) simplifies to the two-way ANOVA. The TW-SLM is an extension of, but is different from, the SRM (4). Clearly, it is also different...



More articles from Journal of the American Statistical Association
Nonparametric inferences for additive models., September 01, 2005
Semiparametric regression analysis of longitudinal data with informati..., September 01, 2005
Dynamical correlation for multivariate longitudinal data., September 01, 2005
Estimation of long memory in the presence of a smooth nonparametric tr..., September 01, 2005
Measurement error in linear autoregressive models., September 01, 2005

Looking for additional articles?
Search our database of over 3 million articles.

Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication name or publication date.

About Goliath
Whether you're looking for sales prospects, competitive information, company analysis or best practices in managing your organization, Goliath can help you meet your business needs.

Our extensive business information databases empower business professionals with both the breadth and depth of credible, authoritative information they need to support their business goals. Whether it be strategic planning, sales prospecting, company research or defining management best practices - Goliath is your leading source for accurate information.