|
Article Excerpt 1. INTRODUCTION
Longitudinal data modeling is essential to describe both trend and variation for biological processes, such as growth curves, effects over time of medical intervention on physiological characteristics, and monitoring of human exposure to carcinogens. Methods and approaches have been described in texts by Jones (1993), Hand and Crowder (1996), Verbeke and Molenberghs (2000), and Diggle, Heagerty, Liang, and Zeger (2002).
A promising approach for functional data analysis is to treat longitudinal pathways as realizations of a smooth stochastic process (Ramsay and Silverman 1997). This concept proved useful for describing the effects of certain treatments on a response curve (Church 1966), and naturally progressed to the modeling of a collection of random curves (Rice and Silverman 1991) and semiparametric and nonparametric models for the effects of time-dependent covariates on longitudinal observations (Martinussen and Scheike 1999, 2000).
In this article we describe an approach for capturing the correlation structure between multivariate longitudinal responses, leading to the notion of dynamical correlation to describe the correlation among multivariate longitudinal curves. A classical approach to describing the correlation between subsets of elements of random vectors is canonical correlation (Hotelling 1936). Canonical correlation has been extended to the case of multivariate time series (Brillinger 1975), under the assumption of stationarity, and extension of canonical correlation to functional data has been proposed by Leurgans, Moyeed, and Silverman (1993), who pointed out the need for regularization. Moreover, functional canonical correlation requires restrictive assumptions to be well defined, because it corresponds to an inverse problem (He, Muller, and Wang 2003). For these reasons, simple and efficient alternative measures to describe the dependency of multivariate functional data are needed (Service, Rice, and Chavez 1998).
Dynamical factor analysis has been discussed in the psychological literature to investigate intra-individual variation and lagged relationships for multivariate longitudinal data (Molenaar 1985). However, these methods require restrictive designs and are applied to a single individual only; extensions to samples of subjects have not been established. Methods based on the notion of causality between the components of multivariate stochastic processes have been discussed by Boudjellaba, Dufour, and Roy (1992) and Sy, Taylor, and Cumberland (1997). Like the approaches of Molenaar (1985) and Brillinger (1975), the method of Boudjellaba et al. (1992) is used for multivariate time series and does not generalize to multiple-subject longitudinal data; Sy et al. (1997) relied on fairly restrictive assumptions on the nature of the correlations and subsequent determination of causality. A spline-based method for modeling bivariate longitudinal data, including investigation of the correlation between responses, has been presented by Wang, Guo, and Brown (2000).
It is the purpose of this article to define simple, efficient, nonparametric correlation measures for multivariate longitudinal data, which include derivatives and lags. These measures are first obtained at the subject level. Then consistent estimates for population dynamical correlations are easily obtained by averaging over the subjects in the sample.
An advantage of the proposed dynamical correlation over functional canonical correlation is its explicit representation given by (5), whereas functional canonical correlation is implicitly defined by (12) via the solution of a maximization problem and thus does not have a comparably simple interpretation as an average of individual correlations. Section 5.2 provides a more detailed comparison of the practical performance of dynamic and functional canonical correlation. An additional benefit of dynamical correlation is its stability. As we show, even if a presmoothing step is included, dynamical correlation is quite insensitive to the choice of bandwidth for this step, whereas the estimation of functional canonical correlation depends critically on regularization, because it corresponds to an ill-defined inverse problem (Leurgans et al. 1993). Also, as shown later, these regularized correlation estimates break down easily and then are not useful, whereas dynamic correlation generally leads to reliable results. Related time-averaged correlation measures between two regression functions were discussed by Heckman and Zamar (2000). Limitations of the proposed dynamical correlation, especially in the application to longitudinal studies, are that the number of repeated measurements per subject cannot be too small and that the times of measurements need to fill out the domain of the random trajectories for which the correlation measure is desired.
The data used to demonstrate the methods in this article come from a nephrologic study (Kaysen et al. 2000). In this study, 35 hemodialysis patients were followed for up to 230 days, with measurements of five acute-phase blood proteins taken longitudinally. Observed repeated measurements for two serum proteins, albumin and C-reactive protein, for a randomly selected subject, are shown in Figure 1. The graphs suggest a negative relationship over time for these two acute-phase proteins. Although such simple graphical representations are useful, it is important to have quantitative summary measures of correlation taking the entire variation over time into account.
The article is organized as follows. We discuss the underlying model and basic definition of dynamical correlation between two components of random multivariate longitudinal curves in Section 2. In Section 3 we describe how to estimate the dynamical correlation between two sets of longitudinal data obtained for a sample of independent subjects. In Section 4.1 we provide extensions of dynamical correlation to derivatives of curves and time-shifted curves. In Section 4.2 we discuss a two-stage bootstrap procedure for obtaining a nonparametric interval estimate for the correlation measure when smoothing of the original data is required. We report results of simulation studies in Section 5, including a sensitivity analysis of the correlation estimate under a range of bandwidth choices and a comparison of the performance of functional canonical correlation and dynamical correlation. We cover the application of the proposed methods to the serum protein data in Section 6. We give some concluding remarks in Section 7, and provide additional details and proofs in the Appendix.
[FIGURE 1 OMITTED]
2. DEFINING DYNAMICAL CORRELATION
In this section we define dynamical correlation and discuss some basic properties. The setting is as follows: for a randomly selected subject (experimental unit), we observe p random functions or curves, [f.sub.1],..., [f.sub.p], p [greater than or equal to] 2, where [f.sub.k] [member of] [L.sup.2] (dw), the space of square-integrable functions with E{[integral][f.sub.k.sup.2](t)w(t)dt} < [infinity] for 1 [less than or equal to] k [less than or equal to] p with respect to a measure dw = w dt, where dt is Lebesgue measure and w is a nonnegative weight function with [integral]w(t) dt = 1 and [integral][w.sup.2](t) dt < [infinity] (see Ash and Gardner 1975). Usually, w will be chosen to have compact support.
A simple and convenient choice for the weight function w is an indicator function, w(t) = [1/[b-a]][I.sub.[a,b]], for a < b. Other choices may relate to a varying degree of uncertainty with which the functions are observed over time, to a nonconstant variance function of the underlying stochastic process, or to inhomogeneities in the design, that is, the timing of the longitudinal measurements.
The notions of inner products and angles between functions that we use herein are extensions from the familiar multivariate concepts to Hilbert space (Conway 1985; Ramsay and Silverman 1997). Using the notation <f, g> = [integral]f(t)g(t)w(t) dt in [L.sup.2] (dw), our approach is based on the "functional inner products,"
[M.sub.k] = <[f.sub.k], l>, [M.sub.k,l] = <[f.sub.k], [f.sub.l]>, 1 [less than or equal to] k, l [less than or equal to] p. (1)
The basic assumptions ensure that the moments E[M.sub.k], E[M.sub.k.sup.2], and E[M.sub.k,l] are all well defined.
Any given [L.sup.2] function can be represented uniquely in [L.sup.2] through the following random-effects model, without any restriction, and it is convenient to define dynamical correlation within the framework of this representation. In particular, the kth functional component can always be represented as
[f.sub.k](t) = [[mu].sub.k](t) + [[mu].sub.0,k] + [[infinity].summation over (r=0)][[epsilon].sub.r,k][[eta].sub.r](t) = [[mu].sub.k](t) + ([[mu].sub.0,k] + [[epsilon].sub.0,k]) + [[infinity].summation over (r=1)][[epsilon].sub.r,k][[eta].sub.r](t), 1 [less than or equal to] k [less than or equal to] p. (2)
Here [[mu].sub.k] is a fixed mean function with [[mu].sub.k] [member of] [L.sup.2] (dw) and <[[mu].sub.k], 1> = 0, and ([[mu].sub.0,k] + [[epsilon].sub.0,k]) is an intercept term, the "static random part" of the model, which includes a constant term [[mu].sub.0,k] and a random variable [[epsilon].sub.0,k], neither of which depends on time, with E([[epsilon].sub.0,k]) = and var([[epsilon].sub.0,k]) =...
|