Home | Business News | Browse by Publication | J | Journal of the American Statistical Association

Semiparametric Bayesian analysis of matched case-control studies with missing exposure.

Publication: Journal of the American Statistical Association
Publication Date: 01-JUN-05
Format: Online - approximately 9281 words
Delivery: Immediate Online Access

Article Excerpt
1. INTRODUCTION

This article concerns matched case-control studies when one of the covariates is partially missing. Case-control studies are the dominant form of analytical research in epidemiology, with matched case-control studies having the advantage of being matched on the basis of important stratification variables. (See Breslow and Day 1980; Breslow 1996 for background and references.) With no missing data, there is a complementary Bayesian literature for ordinary and matched case-control studies (see Zelen and Parker 1986; Muller and Roeder 1997; Diggle, Morris, and Wakefield 2000; Seaman and Richardson 2001; Ghosh and Chen 2002).

In a matched analysis, some variables may not be observed for all study subjects. For example, in the Los Angeles study on endometrial cancer in a retirement community (Breslow and Day 1980), a binary variable denoting obesity was missing in approximately 16% of respondents. There have been two main approaches to handle the missingness in matched case-control problems. Lipsitz, Parzen, and Ewell (1998) and Rathouz, Satten, and Carroll (2002) modeled the missingness process directly. More germane to the present article is the likelihood approach that involves modeling the missing covariate distribution among the controls (see Satten and Kupper 1993a,b, who initiated this approach). Wang, Wang, and Carroll (1997) and Paik and Sacco (2000) attacked the missing-data problem directly by positing that the exposure distribution among the controls belongs to a canonical exponential family. Although the Paik and Sacco model is fully parametric, their estimation method is a pseudolikelihood methodology rather than full likelihood, and in the absence of missing data it reduces to the usual conditional logistic regression (CLR). Satten and Carroll (2000) generalized the Satten--Kupper and Paik--Sacco methodology to allow a full parametric likelihood analysis, allowing essentially any exposure distribution among the controls and providing for full likelihood analysis; their approach does not reduce to the usual CLR with no missing data.

The present article develops a Bayesian approach to case-control studies with a disease indicator D, a completely observed covariate vector Z and an exposure or risk factor X with possibly missing values. We also include variables S = ([S.sub.o], [S.sub.u]), which define the strata used for matching. These are generally a vector of covariates, so that [S.sub.o] denotes covariates that are explicit factors making up the strata and [S.sub.u] denotes factors that are not explicitly used to form strata but nonetheless are associated with the strata. The variable X that is partially missing can be discrete or continuous.

An example will help clarify the strata issue. This example, specifically considered numerically in this article, is due to Kim, Cohen, and Carroll (2002). They described a 1:1 matched case-control study that investigated the association of various management practices with equine colic. Participating veterinarians were asked to provide data monthly for one horse that was treated for colic and one horse that received emergency treatment for any condition other than colic between March 1, 1997 and February 28, 1998. A case of colic was defined as the first horse treated during a given month for signs of intra-abdominal pain. A control horse was defined as the next horse that received emergency treatment for any condition other than colic. Two individual-level variables of interest were the age of the horse, X, and an indicator, Z, of whether the horse had undergone a recent change in diet.

In this example it is difficult to model parametrically the "next sick horse in the clinic" matching method by explicit variables [S.sub.0]. One can certainly try, for example, by accounting for month, veterinarian, region of Texas, urban/rural, another data, but the dimensionality is likely to be high, and clearly there is more to the matching than can be measured explicitly. However, it is entirely possible that some or all of these factors affect the distribution of X among the controls, and a likelihood analysis would have to account for them. The dimensionality of the effects quickly gets out of hand in our example, and it is likely that in such cases the possible effect of stratification on the distribution of X will be ignored.

A second general example is also of interest. There has been a recent resurgence of interest in genetic case-control studies to explore a variety of gene-disease and gene-environment interactions. The prime objective here is to examine the association between a candidate gene and the occurrence of a disease. In such problems, it is very common to stratify a population that comprises subpopulations with different allele frequencies for the gene under consideration. If different subpopulations have different risks for the disease, then failure to take these stratum effects into account will lead to misleading estimates of the association between the disease and the candidate gene. (See Satten, Flanders, and Yang 2001 for a nice description of this phenomenon in the unmatched case-control context.) In our case of a matched case-control study, population stratification is handled by the matching when there are no missing data, but interestingly, with missing-data, population stratification may become an issue. Our approach can be viewed as a semiparametric method for handling such an issue if the stratification is caused by a continuous variable.

The likelihood approaches referenced earlier start with an explicit parametric distribution for X among the controls, D = 0, and given Z as well as possibly the explicit parts [S.sub.o] of the strata. Wang et al. (1997) and Paik and Sacco (2000) did not include the strata in the model for X, whereas Satten and Carroll (2000) included the observed components [S.sub.o] as a linear term. In the equine example, we of course believe that it is effectively impossible to measure all of the stratification variables and then model their effect on X, which is likely to be in the form of an unexplained heterogeneity.

Our approach is to generalize the previous methods to allow for the possibility of unmeasured stratification effects, that is, unexplained heterogeneity in the distribution of X given Z among the controls. Alternatively, our approach can be looked at as a way of allowing for high-dimensional effects of stratification. We model the stratification effects via a Dirichlet process prior with a normal base measure for the distribution of the stratum effects, then use the Bayesian machinery. By this route, we achieve a measure of model robustness and acknowledge that the distribution of X among the controls can be affected by stratification, especially by unmeasured factors.

The outline of the article is as follows. Section 2 introduces the model and notation, and Section 3 derives the appropriate likelihoods and the Markov chain Monte Carlo (MCMC) method for computing Bayesian inference. We note in passing that a limiting case of our approach is a full parametric Bayesian analysis of matched case-control studies with missing data, something that does not appear to be available in the literature.

Section 4 provides data analysis for three examples covering two special cases of the general exponential family: (a) a continuous exposure in the equine epidemiology problem, (b) a binary exposure in an endometrial cancer study and (c) a binary exposure in a low birth weight study in newborns. Section 5 contains a small simulation study to assess the performance of our proposed methods. We show that at least for this simulation, our methods lose nothing in the way of efficiency when there are no unmeasured stratification effects on the missing covariate X, but gain quite a bit of efficiency when stratification does affect the missing covariate. Section 6 presents some concluding remarks. All technical details are collected into the Appendix.

In concluding this section, we highlight some of the newer aspects of this article. To our knowledge, this is the first article that takes a Bayesian approach, both parametric and semi-parametric, toward the analysis of matched case-control studies with missing exposure. Our method is applicable to both discrete and continuous data. Moreover, our proposed method explicitly captures the unmeasured stratum effects, which may not be possible just by introducing additional covariates.

2. MODEL AND NOTATION

For subject j in stratum i, i = 1,..., n, j = 1,..., M + 1, let [D.sub.ij] represent the disease status, namely [D.sub.ij] = 1 for a case and [D.sub.ij] = for a control. We consider a single exposure variable [X.sub.ij], which may be partially missing, and a completely observed p-dimensional...

View this article FREE - Now for a Limited Time, try Goliath Business News
Free for 3 Days!



More articles from Journal of the American Statistical Association
Experimental and Quasi-Experimental Designs for Generalized Causal Inf..., June 01, 2005
Numerical Issues in Statistical Computing for the Social Scientist.(Bo..., June 01, 2005
Computational Methods in Statistics and Econometrics.(Book Review), June 01, 2005
The Statistical Evaluation of Medical Tests for Classification and Pre..., June 01, 2005
Quantitative Methods in Population Health: Extensions of Ordinary Regr..., June 01, 2005

Looking for additional articles?
Search our database of over 3 million articles.

Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication name or publication date.

About Goliath
Whether you're looking for sales prospects, competitive information, company analysis or best practices in managing your organization, Goliath can help you meet your business needs.

Our extensive business information databases empower business professionals with both the breadth and depth of credible, authoritative information they need to support their business goals. Whether it be strategic planning, sales prospecting, company research or defining management best practices - Goliath is your leading source for accurate information.