Home | Business News | Browse by Publication | J | Journal of the American Statistical Association

CATS: clustering after transformation and smoothing.

Publication: Journal of the American Statistical Association
Publication Date: 01-SEP-05
Format: Online - approximately 6688 words
Delivery: Immediate Online Access

Article Excerpt
1. INTRODUCTION

CATS--clustering after transformation and smoothing--is a technique for nonparametrically estimating and clustering a large number of curves (or profiles). We first screen out curves that are nearly flat, then smooth the remaining curves, and then cluster the smoothed curves. A novel feature of our method is that we estimate the error due to the fact that we are clustering the estimated curves rather than the true curves.

CATS is quite general, but for clarity, we discuss the method in the context of microarray experiments. This problem is challenging because of the large number of expression profiles. Our motivating example is a genetic microarray experiment conducted at the University of Pittsburgh. This experiment produced time series of gene expression levels for 5,355 genes over 15 time points.

There is now a substantial literature on genetic microarrays on various topics such as clustering (Eisen, Spellman, Brown, and Botstein 1998; Hastie et al. 2000; Bar-Joseph, Gerber, Gifford, and Jaakkola 2002; Wakefield, Zhou, and Self 2002) and multiple testing (Dudoit, Yang, Callow, and Speed 2000; Efron, Tibshirani, Storey, and Tusher 2001; Fraley and Raftery 2002; Newton, Kendziorski, Richmond, Blattner, and Tsui 2001). Related work on curve clustering in the context of microarray data has been done by Bar-Joseph et al. (2002) and Wakefield et al. (2002).

2. THE MODEL

We consider data of the form

[Y.sub.ij] = [f.sub.i]([t.sub.ij]) + [[sigma].sub.i][[epsilon].sub.ij], i = 1,..., N, j = 1,..., m, (1)

where E([[epsilon].sub.ij]) = 0. Thus [Y.sub.ij] is the jth observation on the ith curve. In the examples of interest, N is much larger than m. In the microarray setting, [Y.sub.ij] is the log gene expression of gene i at time [t.sub.ij].

We assume that the curves [f.sub.i] belong to a class of smooth functions F as defined in the Appendix. Let [[phi].sub.1], [[phi].sub.2],... be an orthonormal basis for F and write

[f.sub.i](t) = [[infinity].summation over (j=1)] [[theta].sub.ij][[phi].sub.j](t), (2)

where

[[theta].sub.ij] = [integral] [f.sub.i](t)[[phi].sub.j](t) dt. (3)

Generally, we cannot estimate more parameters than data points, so rather than estimate [f.sub.i] we actually estimate its projection, [[summation].sub.j=1.sup.m][[theta].sub.ij][[phi].sub.j](t), onto the first m basis functions, which we continue to denote by [f.sub.i]. We estimate [f.sub.i] by

[^.f.sub.i.sup.J](t) = [J.summation over (j=1)][^.[theta].sub.ij] [[phi].sub.j](t), (4)

where the estimates, [^.[theta].sub.ij], and the choice of smoothing parameter, 1 [less than or equal to] J [less than or equal to] m, are as described later. We call a curve [f.sub.i] null or inactive if [f.sub.i] is constant as a function of t. Otherwise, [f.sub.i] is nonnull or active. Let A denote the set of active curves.

Let [[theta].sub.i] = ([[theta].sub.i1],..., [[theta].sub.im]) be the vector of coefficients for curve [f.sub.i]. We view the ([[theta].sub.i], [[sigma].sub.i])'s as a random draw from some distribution P. We assume that P has compact support. In a slight abuse of notation, we also use P to denote the marginal law of the [[theta].sub.i]'s.

3. CLUSTERS OF CURVES

Because our goal is to cluster the curves, we need a measure of the efficacy of a set of clusters. Let C = {[f.sub.1],..., [f.sub.N]} denote a finite set of curves. A clustering algorithm may be viewed as a map,

T:C X C [right arrow] {0, 1},

where

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

The cluster map T induces a partition {[C.sub.1],..., [C.sub.k]} of C, where two curves, f and g, are in the same partition element if and only if T(f, g) = 1. The numbering of the partition elements is arbitrary. Generally, one uses an algorithm that can produce k clusters for any given k. Thus let us write [T.sub.k] for the cluster map that yields k clusters. For example, [T.sub.k] might be the output of the k-means clustering algorithm.

We address two questions for the efficacy of the clusters: (1) How good are the estimated clusters? and (2) how close is the clustering using the estimated curves [^.C] = {[^.f.sub.1],..., [^.f.sub.N]} to the clustering using the true curves C = {[f.sub.1],..., [f.sub.N]}? The first question concerns cluster quality; the second concerns estimation error. We define two parameters, [OMEGA] and [eta], associated with these questions.

3.1 Cluster Quality

Many such criteria have been proposed to measure cluster quality. We use the following. Suppose that [C.sub.1],..., [C.sub.k] are clusters; that is, they form a partition of C. Define the cluster quality,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

where

[rho](f, g) = [[integral](f(x) - [bar.f])(g(x) - [bar.g]) dx]/[square root of ([integral](f(x) - [bar.f])[.sup.2] dx [integral](g(x) - [bar.g])[.sup.2] dx)]

and [bar.f] = [integral] f(x) dx.

Thus [rho](f, g) is the Pearson correlation between the curves f and g, and [OMEGA] measures the worst pairwise correlation over all of the clusters. Note that -1 [less than or equal to] [OMEGA] [less than or equal to] 1 and [OMEGA] = 1 if and only if all of the curves in each cluster are proportional to each other. We write [OMEGA](k) if we want to emphasize the dependence on the number of clusters k.

In the Fourier domain, we can rewrite [OMEGA] as follows. Let f = [[summation].sub.j][a.sub.j][[phi].sub.j] and g = [[summation].sub.j][b.sub.j][[phi].sub.j]. From a = ([a.sub.1], [a.sub.2],...), define a new vector [~.a] = ([~.a.sub.2], [~.a.sub.3],...) obtained by discarding [a.sub.1] and normalizing

[~.a.sub.j] = [a.sub.j]/[square root of ([[summation].sub.j=2.sup.m][a.sub.j.sup.2])], j [greater than or equal to] 2. (5)

Define [~.b] = ([~.b.sub.1], [~.b.sub.2],...) similarly. Then

[rho](f, g) = 1 - [[||[~.a] - [~.b]||[.sup.2]]/2]. (6)

Hence correlation clustering in function space is equivalent to Euclidean clustering in the Fourier domain, after the transformation a [right arrow] [~.a].

Generally, [OMEGA](k) will increase as k increases. We examine [OMEGA] as a function of k. If we want to choose one value of k, then we take the smallest k such that [OMEGA] [greater than or equal to] 1 - [epsilon] for some user-specified [epsilon]. This gives the smallest number of clusters that guarantees that all curves within a cluster are (1 - [epsilon])-similar.

3.2 Estimation Error

Regarding estimation error, we proceed as follows. Let C = {[f.sub.1],..., [f.sub.n]} denote the true curves, and let [^.C] = {[^.f.sub.1],..., [^.f.sub.n]} denote the estimated...

View this article FREE - Now for a Limited Time, try Goliath Business News
Free for 3 Days!



More articles from Journal of the American Statistical Association
Ranked Set Sampling: Theory and Applications.(book by Zehua Chen, Zhid..., September 01, 2005
Association Schemes: Designed Experiments, Algebra and Combinatorics.(..., September 01, 2005
Diagnostic Checks in Time Series.(book by Wai Keung Li )(Book Review), September 01, 2005
Introduction to Rare Event Simulation.(book by James Antonio Bucklew)(..., September 01, 2005
Transdimensional Markov chains: a decade of progress and future perspe..., September 01, 2005

Looking for additional articles?
Search our database of over 3 million articles.

Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication name or publication date.

About Goliath
Whether you're looking for sales prospects, competitive information, company analysis or best practices in managing your organization, Goliath can help you meet your business needs.

Our extensive business information databases empower business professionals with both the breadth and depth of credible, authoritative information they need to support their business goals. Whether it be strategic planning, sales prospecting, company research or defining management best practices - Goliath is your leading source for accurate information.