Home | Business News | Browse by Publication | J | Journal of the American Statistical Association

Functional data analysis for sparse longitudinal data.

Publication: Journal of the American Statistical Association
Publication Date: 01-JUN-05
Format: Online
Delivery: Immediate Online Access

Article Excerpt
1. INTRODUCTION

We develop a version of functional principal components (FPC) analysis, in which the FPC scores are framed as conditional expectations. We demonstrate that this extends the applicability of FPC analysis to situations in longitudinal data analysis, where only few repeated and sufficiently irregularly spaced measurements are available per subject, and refer to this approach as principal components analysis through conditional expectation (PACE) for longitudinal data.

When the observed data are in the form of random curves rather than scalars or vectors, dimension reduction is mandatory, and FPC analysis has become a common tool to achieve this, by reducing random trajectories to a set of FPC scores. However, this method encounters difficulties when applied to longitudinal data with only few repeated observations per subject.

Beyond dimension reduction, FPC analysis attempts to characterize the dominant modes of variation of a sample of random trajectories around an overall mean trend function. There exists an extensive literature on FPC analysis when individuals are measured at a dense grid of regularly spaced time points. The method was introduced by Rao (1958) for growth curves, and the basic principle has been studied by Besse and Ramsay (1986), Castro, Lawton, and Sylvestre (1986), and Berkey, Laird, Valadian, and Gardner (1991). Rice and Silverman (1991) discussed smoothing and smoothing parameter choice in this context, whereas Jones and Rice (1992) emphasized applications. Various theoretical properties have been studied by Silverman (1996), Boente and Fraiman (2000), and Kneip and Utikal (2001). (For an introduction and summary, see Ramsay and Silverman 1997.) Staniswalis and Lee (1998) proposed kernel-based functional principal components analysis for repeated measurements with an irregular grid of time points. The case of irregular grids was also studied by Besse, Cardot, and Ferraty (1997) and Boularan, Ferre, and Vieu (1993). However, when the time points vary widely across subjects and are sparse, down to one or two measurements, the FPC scores defined through the Karhunen-Loeve expansion are not well approximated by the usual integration method.

Shi, Weiss, and Taylor (1996), Rice and Wu (2000), James, Hastie, and Sugar (2001), and James and Sugar (2003) proposed B-splines to model the individual curves with random coefficients through mixed effects models. James et al. (2001) and James and Sugar (2003) emphasized the case of sparse data, postulating a reduced-rank mixed-effects model through a B-spline basis for the underlying random trajectories. In contrast, we represent the trajectories directly through the Karhunen-Loeve expansion, determining the eigenfunctions from the data. Perhaps owing to the complexity of their modeling approach, James et al. (2001) did not investigate the asymptotic properties of the estimated components in relation to the true components, such as the behavior of the estimated covariance structure, eigenvalues, and eigenfunctions, especially for the sparse situation. Instead, they constructed pointwise confidence intervals for the individual curves using bootstrap. With our simpler and more direct approach, we are able to derive asymptotic properties, using tools from functional analysis. We can also derive both pointwise and simultaneous bands for predicted individual trajectories. This requires first obtaining the uniform convergence results for nonparametric function and surface estimates under dependence structure that follows from the longitudinal nature of the data. The dependence is a consequence of the assumed random nature of the observed sample of trajectories, which sets our work apart from previous results where either the observed functions are nonrandom with independent measurements (Kneip 1994), are random vectors of large but fixed dimensions (Ferre 1995), or are random trajectories sampled on dense and regular grids (Cardot, Ferraty, and Sarda 1999).

The contributions of this article are as follows. First, we provide a new technique, PACE, for longitudinal and functional data, a method designed to handle sparse and irregular longitudinal data for which the pooled time points are sufficiently dense. Second, the presence of additional measurement errors is taken into account, extending previous approaches of Staniswalis and Lee (1998) and Yao et al. (2003). Third, an emphasis is on the derivation of asymptotic consistency properties, by first establishing uniform convergence for smoothed estimates of the mean and covariance functions under mild assumptions. These uniform consistency results are developed for smoothers in the situation where repeated, and thus dependent, measurements are obtained for the same subject. Then we couple these results with the theory of eigenfunctions and eigenvalues of compact linear operators, to obtain uniform convergence of estimated eigenfunctions and eigenvalues. To our knowledge, there exist only few published asymptotic results for FPC (Dauxois, Pousse, and Romain 1982; Bosq 1991; Silverman 1996), and none for functional data analysis in the sparse situation. Fourth, we derive the asymptotic distribution needed to obtain pointwise confidence intervals for individual trajectories, and obtain asymptotic simultaneous bands for these trajectories.

The main novelty of our work is that we establish the conditional method for the case of sparse and irregular data, show that this provides a straightforward and simple tool for the modeling of longitudinal data, and derive asymptotic results for this method. Under Gaussian assumptions, the proposed estimation of individual FPC scores in PACE corresponds to the best prediction, combining the data from the individual subject to be predicted with data from the entire collection of subjects. In the non-Gaussian case, it provides an estimate for the best linear prediction. The proposed PACE method extends to the case of sparse and irregular data, provided that as the number of subjects increases, the pooled time points from the entire sample become dense in the domain of the data. We suggest one-curve-leave-out cross-validation for choosing auxiliary parameters, such as the degree of smoothing and the model dimension, corresponding to the number of eigenfunctions to be included, similar to the approach of to Rice and Silverman (1991). For faster computing, we also consider the Akaike information criterion (AIC) to select the number of eigenfunctions.

The remainder of the article is organized as follows. In Section 2 we introduce the PACE approach, that is, the proposed conditional estimates for the FPC scores. We present asymptotic results for the proposed method in Section 3, with proofs deferred to the Appendix. We discuss simulation results that illustrate the usefulness of the methodology in Section 4. Applications of PACE to longitudinal CD4 data and time-course gene expression data for yeast cell cycle genes are the theme of Section 5, followed by concluding remarks in Section 6 and proofs and theoretical results in the Appendix.

2. FUNCTIONAL PRINCIPAL COMPONENTS ANALYSIS FOR SPARSE DATA

2.1 Model With Measurement Errors

We model sparse functional data as noisy sampled points from a collection of trajectories that are assumed to be independent realizations of a smooth random function, with unknown mean function EX(t) = [mu](t) and covariance function cov(X(s), X(t)) = G(s, t). The domain of X(*) typically is a bounded and closed time interval T. Although we refer to the index variable as time, it could also be a spatial variable, such as in image or geoscience applications. We assume that there is an orthogonal expansion (in the [L.sup.2] sense) of G in terms of eigenfunctions [[phi].sub.k] and nonincreasing eigenvalues [[lambda].sub.k]: G(s, t) = [[summation].sub.k] [[lambda].sub.k][[phi].sub.k](s)[[phi].sub.k](t), t, s [member of] T. In classical FPC analysis, it is assumed that the ith random curve can be expressed as [X.sub.i](t) = [mu](t) + [[summation].sub.k][[xi].sub.ik][[phi].sub.k](t), t [member of] T, where the [[xi].sub.ik] are uncorrelated random variables with mean and variance E[[xi].sub.ik.sup.2] = [[lambda].sub.k], where [[summation].sub.k][[lambda].sub.k] < [infinity], [[lambda].sub.1] [greater than or equal to] [[lambda].sub.2] [greater than or equal to]....

We consider an extended version of the model that incorporates uncorrelated measurement errors with mean and constant variance [[sigma].sup.2] to reflect additive measurement errors (see also Rice and Wu 2000). Let [Y.sub.ij] be the jth observation of the random function [X.sub.i](*), made at a random time [T.sub.ij], and let [[epsilon].sub.ij] be the additional measurement errors that are assumed to be iid and independent of the random coefficients [[xi].sub.ik], where i = 1,..., n, j = 1,..., [N.sub.i], k = 1, 2,.... Then the model that we consider is

[Y.sub.ij] = [X.sub.i]([T.sub.ij]) + [[epsilon].sub.ij] = [mu]([T.sub.ij]) + [[infinity].summation over (k=1)][[xi].sub.ik][[phi].sub.k]([T.sub.ij]) + [[epsilon].sub.ij], [T.sub.ij] [member of] T, (1)

where E[[epsilon].sub.ij] = 0, var([[epsilon].sub.ij]) = [[sigma].sup.2], and the number of measurements [N.sub.i] made on the ith subject is considered random, reflecting sparse and irregular designs. The random variables [N.sub.i] are assumed to be iid and independent of all other random variables.

2.2 Estimation of the Model Components

We assume that mean, covariance, and eigenfunctions are smooth. We use local linear smoothers (Fan and Gijbels 1996) for function and surface estimation, fitting local lines in one dimension and local planes in two dimensions by weighted least squares. In a first step, we estimate the mean function [mu] based on the pooled data from all individuals. The formula for this local linear smoother is in (A.1) in the Appendix. Data-adaptive methods for bandwidth choice are available (see Muller and Prewitt 1993 for surface smoothing and Rice and Silverman 1991 for one-curve-leave-out cross-validation); subjective choices are often adequate. (For issues of smoothing dependent data, see Lin and Carroll 2000.) Adapting to estimated correlations when estimating the mean function did not lead to improvements (simulations not reported); therefore, we do not incorporate such adjustments.

Note that in model (1), cov([Y.sub.ij], [Y.sub.il]|[T.sub.ij], [T.sub.il]) = cov(X([T.sub.ij]), X([T.sub.il])) + [[sigma].sup.2][[delta].sub.jl], where [[delta].sub.jl] is 1 if j = l and otherwise. Let [G.sub.i]([T.sub.ij], [T.sub.il]) = ([Y.sub.ij] - [^.[mu]]([T.sub.ij]))([Y.sub.il] - [^.[mu]]([T.sub.il])) be the "raw" covariances, where [^.[mu]](t) is the estimated mean function obtained from the previous step. It is easy to see that E[[G.sub.i]([T.sub.ij], [T.sub.il])|[[T.sub.ij], [T.sub.il]] [approximately equal to] cov(X([T.sub.ij]), X([T.sub.il])) + [[sigma].sup.2][[delta].sub.jl]. Therefore, the diagonal of the raw covariances should be removed; that is, only [G.sub.i]([T.sub.ij], [T.sub.il]), j [not equal to] l, should be included as input data for the covariance surface smoothing step (as previously observed in Staniswalis and Lee 1998). We use one-curve-leave-out cross-validation to choose the smoothing parameter for this surface smoothing step.

The variance [[sigma].sup.2] of the measurement errors is of interest in model (1). Let [^.G](s, t) be a smooth surface estimate [see (A.2) in the App.] of G(s, t) = cov(X(s), X(t)). Following Yao et al. (2003), because the covariance of X(t) is maximal along the diagonal, a local quadratic rather than a local linear fit is expected to better approximate the shape of the surface in the direction orthogonal to the diagonal. We thus fit a local quadratic component along the direction perpendicular to the diagonal and a local linear component in the direction of the diagonal; implementation of this local smoother is achieved by rotating the coordinates by 45 degrees and then minimizing weighted least squares [similar to (A.2)] in rotated coordinates with local quadratic and linear components, see (A.3) in the Appendix.

Denote the diagonal of the resulting surface estimate by [~.G](t) and a local linear smoother focusing on diagonal values {G(t, t) + [[sigma].sup.2]} by [^.V](t), obtained by...

View this article FREE - Now for a Limited Time, try Goliath Business News
Free for 3 Days!



More articles from Journal of the American Statistical Association
Experimental and Quasi-Experimental Designs for Generalized Causal Inf..., June 01, 2005
Numerical Issues in Statistical Computing for the Social Scientist.(Bo..., June 01, 2005
Computational Methods in Statistics and Econometrics.(Book Review), June 01, 2005
The Statistical Evaluation of Medical Tests for Classification and Pre..., June 01, 2005
Quantitative Methods in Population Health: Extensions of Ordinary Regr..., June 01, 2005

Looking for additional articles?
Search our database of over 3 million articles.

Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication name or publication date.

About Goliath
Whether you're looking for sales prospects, competitive information, company analysis or best practices in managing your organization, Goliath can help you meet your business needs.

Our extensive business information databases empower business professionals with both the breadth and depth of credible, authoritative information they need to support their business goals. Whether it be strategic planning, sales prospecting, company research or defining management best practices - Goliath is your leading source for accurate information.