Home | Business News | Browse by Publication | J | Journal of the American Statistical Association

Estimation in partially linear models with missing covariates.

Publication: Journal of the American Statistical Association
Publication Date: 01-JUN-04
Format: Online - approximately 9881 words
Delivery: Immediate Online Access

Article Excerpt
1. INTRODUCTION

Perhaps the most common model used in analyzing observational studies of the causal effect of a possibly multivariate treatment or exposure [X.sup.T] = ([X.sub.1],..., [X.sub.p]) on a continuous response Y when data are available on one or more continuous pretreatment confounding variables Z is the partial linear model

Y = [X.sup.T][beta] + v(Z) + [member of], (1)

where [beta] is an unknown parameter, v(*) is a smooth unknown function of Z, E([member of]|X, Z) = 0, and the joint distribution of the regressors (X, Z) is left completely unspecified. Robins, Mark, and Newey (1992) proved that this model arises whenever we assume (a) no unmeasured confounders (i.e., ignorability of treatment X within levels of Z) and (b) a constant additive effect of treatment X on the mean of Y. In particular, given assumption (a), this model is guaranteed to be correctly specified under the causal null hypothesis of no effect of treatment X on Y, because the causal null hypothesis implies (1) with [beta] = 0. Thus, under (a), an asymptotically correct 1 - [alpha] confidence interval for [beta] in model (1) provides an asymptotic distribution-free [alpha]-level test of the causal null hypothesis of no exposure effect. Tests of [beta] = based on lower-dimensional models that impose parametric functional forms on either v(Z) and/or the density of X|Z do not provide asymptotically distribution-free tests of the causal null hypothesis under (a). Even when (a) cannot be assumed to hold, model (1) remains useful and robust, because a large sample test of [beta] = under model (1) remains an asymptotic distribution-free test of the important associational hypothesis that (a) Y is mean independent of X given Z and that (b) Y is conditionally independent of X given Z.

For these reasons, estimation of [beta] in model (1) has been the subject of considerable study (see Hardle, Liang, and Gao 2000 for a summary). Our contribution in this article is to study model (1) when data on X are not fully observed for some study subjects, whether by design (as in two-stage studies) or by happenstance. The problem of missing exposure variables in regression has been treated in great detail by Robins, Rotnitzky, and Zhao (1994); however, these authors assumed a parametric functional form for v(Z). For the aforementioned reasons, it is clearly important to relax, as we do in this article, the assumption that the functional form of v(Z) is known. As was done by Robins et al. (1994), we allow the missingness probabilities to depend on both Y and Z, but not on the unobserved value of X. Our results include both the case where the missingness probabilities are known (as in a designed two-stage study) and the case where they are unknown. Our results build on the work of Wang, Wang, Gutierrez, and Carroll (1998), who considered the nonparametric problem (no X) with missing data (see also Cheng 1990, 1994; Cheng and Chu 1996).

The article organized as follows. In Section 2 we define the missing-data mechanism for the problem and define our methods of estimation. In Section 3 we describe our asymptotic results. Not only do we derive the asymptotic distribution of our estimators of [beta], but we also describe three extensions. First, we compare our methods with methods that use only the complete data with appropriate Horvitz-Thompson (HT) weighting, and show that our methods are asymptotically more efficient. Second, along with deriving analytic standard error estimates, we also justify the use of the nonparametric bootstrap in this context. Finally, we show that our methods can be extended to longitudinal and clustered data when working independence is used as the method of estimation, thus extending the work on nonparametric regression for correlated data using working independence (Zeger and Diggle 1994; Hoover, Rice, Wu, and Yang 1998; Fan and Zhang 2000; Lin and Ying 2001) to the missing-data context.

In Section 4 we study asymptotic efficiency for estimation of [beta] in model (1). Here we derive the semiparametric efficient score function and the semiparametric information bound. The semiparametric efficient score function is a solution to a complex integral equation, but in a special case we are able to derive the score function explicitly and compare the result with our methods. In Section 5 we report the results of a small simulation study, and in Section 6 we present the results of the analysis of an AIDS study. We provide concluding remarks in Section 7, and give proofs in the Appendix. Our asymptotic work uses the general asymptotic theory for semiparametric models developed by Newey (1994) and Robins et al. (1994).

2. THE MODEL AND ESTIMATORS

Let [delta] = 1 if X is observed and [delta] = otherwise. Assume that the X's are missing at random (MAR) in the sense that

[pi]([Y.sub.i], [Z.sub.i]) = P([[delta].sub.i] = 1|[X.sub.i], [Z.sub.i], [Y.sub.i]) = P([[delta].sub.i] = 1|[Z.sub.i], [Y.sub.i]). (2)

In this article, we first assume that the missing-data probability, [pi] (Y, Z), is known. Later we show that its estimation with an error of order [o.sub.p]([n.sup.-1/4]) can be undertaken without affecting the asymptotic properties of our proposed estimate of [beta]. Moreover, we first assume that ([Y.sub.i], [X.sub.i], [Z.sub.i], [[delta].sub.i]), i = 1,..., n, are independent and identically distributed (iid). Then we extend the case to the longitudinal/clustered data setting in Section 3.4.

For general parametric models E(Y|X) = g(X, [theta]), Robins et al. (1994) proposed the estimating equation

[PSI] (*, [theta]) = [[delta]/[pi]] p(X){Y - g(X, [theta])} - [[[delta]-[pi]]/[pi]][phi](Y, [theta]) (3)

for some user-supplied function p(*), where [phi](y, [theta]) is a general function. When there is no Z, the optimal choice of [phi](*) is [phi](Y) = E{p(X)(Y - [X.sup.T][beta])|Y}. If [beta] = (no X) in (1), then the topic becomes a nonparametric problem, for which Wang et al. (1998) developed HT weighted local linear kernel methods. Our methods can be looked on as combining the parametric procedures of Robins et al. with the nonparametric procedure of Wang et al.

Here is the intuition behind our method. If there were no missing data, then let [^.v](Z, [beta]) and [^.m](Z) be nonparametric regressions of Y - [X.sup.T][beta] and X on Z. Then under normality and homoscedascity, the semiparametric optimal score function for [beta] is {X - [^.m](Z)}{Y - [X.sup.T][beta] - [^.v](Z, [beta])}. Effectively, what we do is apply (3) to this score function to compensate for the missing data, with p(X) being X - [^.m](Z) and g(X, [theta]) being [X.sup.T][beta] + [^.v](Z, [beta]), the result being (5).

To understand our methods, we first define the HT weighted local linear kernel method. Let K(*) be a symmetric density function and let h be a suitable bandwidth. Then for any response q(Y, X), the local estimate at [z.sub.0] is the intercept in the regression of q(Y, X) on (Z - [z.sub.0])/h with weights [K.sub.h](Z - [z.sub.0])[delta]/[pi](Y, Z), where [K.sub.h](v) = [h.sup.-1]K(v/h), that is, the solution [[alpha].sub.0] in the equation

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (4)

Motivated by (4), we define our estimator as follows. The procedure comprises four stages, and our method of estimation is simple and noniterative.

Step 1. For any [beta], let [^.v](*, [beta], [pi]) be a weighted nonparametric regression of Y - [X.sup.T][beta] on Z using the HT local linear kernel method, that is, (4) with q(Y, X) = Y - [X.sup.T][beta].

Step 2. Form the corresponding HT local linear kernel regression function [^.m](z, [pi]) for estimating m(z) = E(X|z) by regressing X on Z, that is, (4) with q(Y, X) = X.

Step 3. Let [^.[phi]](Y, Z, [beta], [^.v], [^.m]) be a function of Y and Z that is linear in [beta], specifically, an estimate of E[{X - E(X|Z)}{Y - v(Z, [beta]) - [X.sup.T][beta]}|Y, Z] [denote [phi](Y, Z)]; see the statement following Step 4.

Step 4. Solve for [beta] in the equation

= [n.summation over (i=1)]{[X.sub.i] - [^.m]([Z.sub.i], [pi])} X {[Y.sub.i] - [^.v]([Z.sub.i], [beta], [pi]) - [X.sub.i.sup.T][beta]}[[[delta].sub.i]/[[pi]([Y.sub.i], [Z.sub.i])]] - [n.summation over (i=1)][^.[phi]]([Y.sub.i], [Z.sub.i], [beta], [^.v], [^.m])[[[[delta].sub.i] - [pi]([Y.sub.i], [Z.sub.i])]/[[pi]([Y.sub.i], [Z.sub.i])]]. (5)

Because (5) is linear in [beta], it can be solved without iteration. We call the solution [^.[beta].sub.all].

Step 3 is the only point requiring comment, because it anticipates nonparametric regression with the two "covariates" Y and Z. In particular, we let [^.[phi]](y, z, [beta], [^.v], [^.m]) = [^.E](X|y, z)y - [^.E](X|y, z)[^.v](z, [beta], [pi]) - {[^.E](X[X.sup.T]|y, z)}[beta] - [^.m](z, [pi])y + [^.m](z, [pi])[^.v](z, [beta], [pi]) + [^.m](z, [pi]){[^.E](X|y, z)}[.sup.T][beta], where [^.E](X|y, z) and [^.E](X[X.sup.T]|y, z) are the HT bivariate local linear estimators of E(X|y, z) and E(X[X.sup.T]|y, z), similar to the definition of [^.m](z, [pi]). For example, the HT bivariate local linear estimator of the jth element of E(X|y, z) is the solution of [[alpha].sub.0] to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

where [K*.sub.[[lambda].sub.1],[[lambda].sub.2]](*, *) is a two-dimensional density function with bandwidths [[lambda].sub.1] and [[lambda].sub.2] and [X.sub.ij] is the jth element of [X.sub.i]. Similarly, one may define the estimator of each element of...

Read the FULL article now - Try Goliath Business News - FREE!   
You can view this article PLUS...

  • Over 5 million business articles
  • Hundreds of the most trusted magazines, newswires, and journals (see list)
  • Premium business information that is timely and relevant
  • Unlimited Access

Now for a Limited Time, try Goliath Business News - Free for 3 Days!
Tell Me More   Terms and Conditions

Get Goliath Business News for 1 year - Just $99 (Save 65%)
Tell Me More   Terms and Conditions

Already a subscriber? Log in to view full article



More articles from Journal of the American Statistical Association
Heteroscedastic one-way ANOVA and lack-of-fit tests., June 01, 2004
Multiple comparison of several linear regression models., June 01, 2004
On priors with a Kullback-Leibler property., June 01, 2004
Inferential aspects of the skew exponential power distribution., June 01, 2004
Robust analysis of generalized linear mixed models., June 01, 2004

Looking for additional articles?
Search our database of over 3 million articles.

Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication name or publication date.

About Goliath
Whether you're looking for sales prospects, competitive information, company analysis or best practices in managing your organization, Goliath can help you meet your business needs.

Our extensive business information databases empower business professionals with both the breadth and depth of credible, authoritative information they need to support their business goals. Whether it be strategic planning, sales prospecting, company research or defining management best practices - Goliath is your leading source for accurate information.