Home | Business News | Browse by Publication | J | Journal of the American Statistical Association

Semilinear high-dimensional model for normalization of microarray data: a theoretical analysis and partial consistency.

Publication: Journal of the American Statistical Association
Publication Date: 01-SEP-05
Format: Online - approximately 11102 words
Delivery: Immediate Online Access

Article Excerpt
1. INTRODUCTION

DNA microarrays monitor the expression of tens of thousands of genes in a single hybridization experiment using oligonucleotide or cDNA probes. The technique has been widely used in many biomedical research and biological studies. A challenge in analyzing microarray data is the systematic biases due to variations in experimental conditions, such as the efficiency of dye incorporation, intensity effect, DNA concentration on arrays, amount of mRNA, variability in reverse transcription, and batch variation, among others. Normalization is required to remove the systematic effects of confounding factors so that meaningful biological results can be obtained.

Several useful normalization techniques aim to remove the systematic biases such as the dye, intensity, and print-tip block effects. The simplest such technique is the global normalization method featured in software packages such as GenePix4.0 and analyzed by Kroll and Wolfl (2002). Such a technique implicitly assumes that there is no print-tip block effect and no intensity effect. Without such an assumption, the method is statistically biased. The "lowess" method of Dudoit et al. (2002) significantly relaxes the foregoing assumption. But it assumes that the average expression levels of up-regulated and down-regulated genes at each intensity level are about the same in each print-tip block. This assumption was further relaxed by Tseng, Oh, Rohlin, Liao, and Wong (2001) to only a subset of more conservative genes based on a rank-invariant selection method. As admitted by Tseng et al. (2001), the method is not expected to be useful when there are far more genes that are up-regulated (or down-regulated). Such situations can occur when cells are treated with some reagents (Grolleau et al. 2002; Fan, Tam, Vande Woude, and Ren 2004). In an attempt to further relax the foregoing biological assumption, Huang, Wang, and Zhang (2003) and Huang and Zhang (2003) introduced a semilinear model to account for the intensity effect and to aggregate information from other arrays to assess the intensity effect. The method is expected to work well when the gene effect is the same across arrays.

In an attempt to further relax the aforementioned statistical and biological assumptions in the cDNA microarray normalization, Fan et al. (2004) developed a new method of estimating the intensity and print-tip block effects by aggregating information from the replications within a cDNA array. cDNA microarray chips are usually constructed by dipping a printer head containing 16 spotting pins into a 96-well plate containing cDNA solutions, printing these 16 spots on the slide, washing the spotting pins, dipping them into different 16 wells and printing again, and so on (see Craig, Black, and Doerge 2003 for details). For the specific designs of cDNA microarrays used by Fan et al. (2004), there are 111 clones that are printed twice on the cDNA chips. The locations of these 222 replications appear random in the 32 blocks (see Sec. 4.2). In other words, replications are achieved not by printing twice the same 16 spots on a printer head, but by constructing the wells in plates themselves. The replications of the clones in the cDNA chips contain much information about systematic biases, such as the print-tip block and intensity effects. In fact, for two identical clones of cDNA in the same slide, apart from the random errors, the expression ratios should be the same. Observed differences of expression ratios tell us a lot of information about the print-tip block and intensity effects. The seemingly random patterns of replications enable one to unveil the print-tip block effect. This cannot be achieved if only the same 16 spots in a well plate are printed twice.

To put the foregoing problem into a statistical framework, let G be the number of genes and let I be the number of replications of gene g within an array. (I should depend on g, because most do not have replications.) Following Dudoit et al. (2002), let [R.sub.gi] and [G.sub.gi] be the red (Cy5) and green (Cy3) intensities of the gth gene in the ith replication. Let [Y.sub.gi] be the log-intensity ratio of red over green channels of the gth gene in the ith replication, and let [X.sub.gi] be the corresponding average of the log intensities of the red and green channels, that is,

[Y.sub.gi] = [log.sub.2][[R.sub.gi]/[G.sub.gi]], [X.sub.gi] = [1/2][log.sub.2]([R.sub.gi][G.sub.gi]).

To model the intensity and print-tip block effects, we consider the following high-dimensional partial linear model for microarray data:

[Y.sub.gi] = [[alpha].sub.g] + [[beta].sub.r.sub.gi] + [[gamma].sub.c.sub.gi] + m([X.sub.gi]) + [[epsilon].sub.gi], (1)

where [[alpha].sub.g] is the treatment effect associated with the gth gene; [r.sub.gi] and [c.sub.gi] are the row and column of the print-tip block where the gth gene of the ith replication resides; [beta] and [gamma] are the row and column effects with constraints

[r.summation over (i=1)][[beta].sub.i] = and [c.summation over (j=1)][[gamma].sub.j] = 0,

where r and c are the number of rows and columns of the print-tip block; m(*) with constraint Em([X.sub.gi]) = is a smoothing function of X representing the intensity effect; and [[epsilon].sub.gi] is random error with mean and variance [[sigma].sup.2]([X.sub.gi]).

In our illustrative example in Section 4.2, there are 19,968 genes in an array, residing in 8 X 4 blocks with r = 8 and c = 4. Among those, are 111 genes with two replications. For those genes without replications, because [[alpha].sub.g] is free, they provide no information about the parameters [beta] and [gamma] and the smooth function m(*). We need to estimate the parameters from the genes with replications. With a slight abuse of notation, for this illustrative example, G = 111 and I = 2.

For the normalization purpose, our aim is to find a good estimate of the print-tip block and intensity effects. Let [^.[beta]], [^.[gamma]], and [^.m](*) be good estimates for model (1). Then the normalization is to compute

[Y*.sub.g] = [Y.sub.g] - [^.[beta].sub.r.sub.g] - [^.[gamma].sub.c.sub.g] - [^.m]([X.sub.g]) (2)

for all genes. Interpolations and extrapolations are needed to expedite the computation when m(*) is estimated over a set of fine grid points. According to model (1), [Y*.sub.g] [approximately equal to] [[alpha].sub.g] + [[epsilon].sub.g], in which the effects of confounding factors have been removed. Thus, as far as the process of the normalization is concerned, the parameters [beta] and [gamma] and the function m(*) are of primary interest and the parameters {[[alpha].sub.g]} are nuisance parameters. Of course, in the analysis of treatment effect on genes, the parameters {[[alpha].sub.g]} represent biological fold changes and are of primary interest.

Model (1) has a much wider spectrum of applications than at first appearance. First, if there is no replication within an array but there are four (say) replications across arrays, by imaging a super array that contains these four arrays, "within-array" replications are artificially created. In this case I is the number of arrays, and G is the number of genes per array. The basic assumption behind this method is that the treatment effect on the genes remains the same across arrays. This is not an unreasonable assumption when the same experiment is repeated several times. Second, by removing the row and column effects and applying model (1) directly to each block of microarrays, resulting in

[Y.sub.gib] = [[alpha].sub.g] + [m.sub.b]([X.sub.gi]) + [[epsilon].sub.gib], (3)

the model allows nonadditive effect between the intensity and blocks. [The index b can be removed from model (3), and hence this model becomes a submodel of (1).] In this case G is the number of genes within a block. For example, if there are 624 genes in a block and 4 replications of arrays, then G = 624 and I = 4. Third, the idea can also be adapted to normalize the Affymetrix arrays by imaging "treatment" and "control" arrays as the outputs from green channels and red channels. This will enable us to remove intensity effects in the Affymetrix arrays. Finally, by thinking of rows as blocks and deleting the column effects, model (1) can accommodate nonadditive column and row effects. The additivity in model (1) is to facilitate the applications in which G is relatively small.

The challenge of our problem is that the number of nuisance parameters is large. In fact, for many practical situations, I = 2 and G can be large, on the order of hundreds or larger. So our asymptotic results are based on the assumption that G [right arrow] [infinity]. This is in contrast with the assumption of Huang and Zhang (2003), where I tends to infinity. The number of nuisance parameters in (1) grows with the sample size. In our illustrative example, half of the parameters are nuisance ones. The question is whether the parameters of primary interest can be consistently estimated in the presence of a large number of nuisance parameters and how much it costs to estimate these parameters. Such a problem is poorly understood, and a thorough investigation is needed.

To provide more insight into the problem, consider writing model (1) in the matrix form as

[Y.sub.n] = [B.sub.n][[alpha].sub.n] + [Z.sub.n][beta] + M + [[epsilon].sub.n], n = G X I, (4)

where [Y.sub.n] = ([Y.sub.1],...,[Y.sub.n])[.sup.T], [B.sub.n] is an n X G design matrix, [Z.sub.n] is an n X d random matrix with d being the sum of the numbers of rows and columns, [beta] = ([[beta].sub.1],...,[[beta].sub.r], [[gamma].sub.1],...,[[gamma].sub.c]) is the print-tip block effect, M = (m([X.sub.1]),...,m([X.sub.n]))[.sup.T] is the intensity effect, and [[epsilon].sub.n] = ([[epsilon].sub.1],...,[[epsilon].sub.n])[.sup.T]. The theory on the partial linear model is usually based on the assumption that G is fixed or at least G/n tends to at certain rate (see Hardle, Liang, and Gao 2000). However, in our application, [[alpha].sub.n] cannot be consistently estimated, because G/n = 1/I in (4). It is not clear whether the parameters [beta] and the function m(*) can be consistently estimated. The answer might not be affirmative for general matrix [B.sub.n]. For our application, the matrix [B.sub.n] is in a specific form, [B.sub.n] = [I.sub.G] [cross product] [1.sub.I], where [cross product] is the Kronecker product, [I.sub.G] is the G X G identity matrix, and [1.sub.I] is a vector of length I with...

View this article FREE - Now for a Limited Time, try Goliath Business News
Free for 3 Days!



More articles from Journal of the American Statistical Association
Nonparametric inferences for additive models., September 01, 2005
Semiparametric regression analysis of longitudinal data with informati..., September 01, 2005
Dynamical correlation for multivariate longitudinal data., September 01, 2005
Estimation of long memory in the presence of a smooth nonparametric tr..., September 01, 2005
Measurement error in linear autoregressive models., September 01, 2005

Looking for additional articles?
Search our database of over 3 million articles.

Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication name or publication date.

About Goliath
Whether you're looking for sales prospects, competitive information, company analysis or best practices in managing your organization, Goliath can help you meet your business needs.

Our extensive business information databases empower business professionals with both the breadth and depth of credible, authoritative information they need to support their business goals. Whether it be strategic planning, sales prospecting, company research or defining management best practices - Goliath is your leading source for accurate information.