|
Article Excerpt 1. INTRODUCTION
Simultaneous inference on both model and parameter space is an issue that is fundamental to modern statistical practice. In general, for observed data x we might consider a countable set of models, M = {[M.sub.1], [M.sub.2],...}, indexed by a parameter, k [member of] K, each with a parameter vector defined on [[theta].sub.k] [member of] [[THETA].sub.k] of length [n.sub.k]. Under a Bayesian framework, we would relate each model to a posterior distribution,
[M.sub.k]:[~.[pi].sub.k]([[theta].sub.k]|x) = [[[pi].sub.k]([[theta].sub.k]|x)]/[[m.sub.k](x)] [proportional] [L.sub.k](x|[[theta].sub.k])[p.sub.k]([[theta].sub.k]),
generally known only up to a constant of proportionality, [m.sub.k](x)[.sup.-1], where [L.sub.k] and [p.sub.k] denote the likelihood and parameter prior under model [M.sub.k]. Explicitly expressing [m.sub.k](x) = [[integral].sub.[THETA].sub.k][L.sub.k](x|[[theta].sub.k])[p.sub.k]([[theta].sub.k])d[[theta].sub.k] as the marginal or predictive densities of x under model [M.sub.k], the normalized posterior probability of model [M.sub.k] is given by
[M.sub.k](x) = [[[rho].sub.k][m.sub.k](x)]/[[[summation].sub.i = 1.sup.|K|][[rho].sub.i][m.sub.i](x)] = (1 + [summation over (i[not equal to]k)][[[rho].sub.i]/[[rho].sub.k]][B.sub.ik])[.sup.-1], (1)
where [B.sub.ik] = [m.sub.i](x)/[m.sub.k](x) is the Bayes factor of model [M.sub.i] to [M.sub.k], and [[rho].sub.k] is the prior probability of model k. (See, e.g., Chipman, George, and McCulloch 2001; Berger and Pericchi 2001, 2004; Kass and Raftery 1995; Ghosh and Samanta 2001; Barbieri and Berger 2004; Robert 2001; George and McCulloch 1996; Madigan and Raftery 1994 for discussion of Bayesian model selection techniques.) As an alternative to the selection of a single model, a common approach within the Bayesian framework is that of model averaging, which incorporates model uncertainty in addition to parameter uncertainty. Here interest would be in some predictive density,
[pi](y|x) = [[integral].sub.K][[integral].sub.[THETA].sub.k][pi](y|[[theta].sub.k])[~.[pi].sub.k]([[theta].sub.k]|x)d[[theta].sub.k]dk,
with integration over both model and parameter space. There is a fantastic literature on the application of Bayesian methods for model uncertainty. General review articles such as those by Clyde and George (2004), Chipman et al. (2001), Clyde (1999a), and Hoeting, Madigan, Raftery, and Volinsky (1999) contain a wealth of information and references. Similarly, Andrieu, Doucet, and Robert (2004) reviewed the recent computational and technological advances relating to Bayesian analyses, and Muller and Quintana (2004) and Heikkinen (2003) summarized the current state of nonparametric Bayesian inference.
A typical Bayesian analysis based on the foregoing will encounter two related problems. First, the density [m.sub.k](x) in general will be unavailable because of analytic intractability. Second, the number of candidate models, |K|, will often be very large, prohibiting a brute-force calculation of [M.sub.k](x) via (1). One of the more flexible and popular techniques used to overcome these problems is Markov chain Monte Carlo (MCMC) methodology. For the approximation of [m.sub.k](x) in particular, these include various forms of Metropolis-Hastings samplers. (See, e.g., Robert and Casella 2004; Cappe and Robert 2000; Gilks, Richardson, and Spiegelhalter 1996 for a discussion of standard MCMC methods and implementation issues.) However, techniques capable of simultaneously considering a large number of candidate models did not become available until the mid-1990s.
Almost exactly a decade ago, Green (1995) recast the terms and definitions involved in the Metropolis-Hastings algorithm in a more rigorous manner. In particular, the idea of using a time reversibility condition for the transition kernel of a Markov chain to ensure convergence to the desired stationary distribution was extended to more general state spaces. In integral form, the detailed balance condition for a general transition kernel P and its invariant distribution [pi] can be written as
[[integral].sub.(x,x')[member of]AXB][pi](dx)P(x,dx') = [[integral].sub.(x,x')[member of]AXB][pi](dx')P(x',dx) (2)
for all Borel sets A X B [subset] [THETA] for a general state space [THETA] (see, e.g., Green 2001; Tierney 1998). The novelty of this reexpression was that the generality of the state space under consideration now included formulations that could encompass multiple models. One often considered instance is [THETA] = [[union].sub.k[member of]K][[THETA].sub.k] X {k}, that is, a countable union of subspaces of possibly varying dimensionality. Via standard Metropolis-Hastings updates, this development enabled the implementation of Markov chains simultaneously spanning both parameter and model space, [THETA], with stationary distribution [pi] that is absolutely continuous in [[THETA].sub.k] for each k [member of] K with respect to the [n.sub.k]-dimensional Lebesgue measure. As a result, the estimation of posterior model probabilities and other marginal densities of interest is in theory easily obtainable, irrespective of the order of K. The general class of Markov chains that admit transitions between states of differing dimension have since been developed further to achieve a broad family of interrelated sampling frameworks. Given their model-spanning nature, these have recently been termed transdimensional Markov chains.
This article aims to achieve three objectives: (1) to provide an accessible evaluation of the current state of the art in terms of the practical implementation of transdimensional sampling technologies; (2) to provide an overview of the most recent and most important developments in multimodel sampling frameworks, in terms of ongoing theoretical and methodological development; and (3) to suggest some perspective of residual open problems and highlight those facets that would strongly benefit from further research.
Although transdimensional sampling algorithms are now well documented in the literature, in the first part of this article (Sec. 2) we present both a brief overview of the main frameworks that have been developed in the last 10 years and an assessment of their impact and implementation, including an evaluation of the freely available software for this purpose. We also consider non-Bayesian applications, the efficient estimation of Bayes factors, and the highly important, although frequently overlooked, issue of convergence assessment.
In the second part of this article (Sec. 3), we examine one facet of transdimensional sampling schemes that holds considerable potential for future research: the development of algorithms with increasing degrees of efficiency and automation. Achieving this--one of the fundamental goals of modern sampling frameworks--would permit routine implementation of transdimensional samplers by nonexpert practitioners, perhaps via stand-alone software packages such as the popular WinBuGS suite (Gilks, Thomas, and Spiegelhalter 1992; Spiegelhalter, Thomas, Best, and Gilks 1996b). Recent developments have made great strides in this direction, providing advances in the areas of between model transitions in terms of both efficiency and constructing generic mappings, the extension of perfect sampling schemes to the transdimensional case, and progress in default prior specifications over joint model and parameter spaces.
2. TRANSDIMENSIONAL MARKOV CHAINS AND THEIR IMPLEMENTATION
A number of frameworks have been proposed since the mid-1990s that supplement or extend the existing fixed-dimension Monte Carlo sampling schemes to encompass across model stochastic simulation. Each scheme may be related to others in a conceptually straightforward manner, facilitating natural settings for sampler comparison.
2.1 Sampling Frameworks
Introducing the [THETA] = [[union].sub.k[member of]K][[THETA].sub.k] X {k} formulation of model space, Grenander and Miller (1994) proposed a sampling strategy based on continuous time jump-diffusion dynamics. Such a Markov process essentially jumps between parameter spaces (and therefore models) at random times, and between the jumps follows a diffusion process according to a Langevin stochastic differential equation indexed by time, t, satisfying
d[[theta].sub.k.sup.t] = d[B.sub.k.sup.t] + [1/2][nabla]log[pi]([[theta].sub.k.sup.t])dt, (3)
where d[B.sub.k.sup.t] denotes an increment of Brownian motion and [nabla] denotes the vector of partial derivatives. In practice, (3) is approximated by a discrete-time version with a Metropolis-Hastings step to preserve the stationary distribution [pi] (Roberts and Tweedie 1996). This method has found some application in signal processing and other Bayesian analyses (e.g., Miller, Srivastava, and Grenander 1995; Phillips and Smith 1996), but has in general been superseded by the more accessible reversible-jump sampler (Green 1995). In fact, correcting for the time-discretization approximation via the Metropolis-Hastings acceptance probability template, the dump-diffusion sampler can be shown to result in an implementation of the reversible-jump algorithm (Besag 1994).
The widely implemented reversible-jump sampler was introduced by Green (1995) in a Bayesian model determination setting. Tierney (1998) and Green (2003a) both provided interesting expositions on this theme. One reason for the popularity of this algorithm in particular is conceptual: The framework is a natural generalization of the standard Markov chain theory, lending it certain appeal. In general, assuming that detailed balance (2) is satisfied, we may denote the acceptance probability of a proposed between-model move from ([[theta].sub.k], k) in model [M.sub.k] to the state ([[theta]'.sub.k'], k') in model [M.sub.k'] to be min{1, A[([[theta].sub.k], k) [right arrow] ([[theta]'.sub.k'],k')]}. Here
A[([[theta].sub.k], k) [right arrow] ([[theta]'.sub.k'], k')] = [[[L.sub.k'](x|[[theta]'.sub.k'])[p.sub.k']([[theta]'.sub.k'])[[rho].sub.k']q(k'[right arrow]k)[q.sub.k']([u'.sub.k'])]/[[L.sub.k](x|[[theta].sub.k])[p.sub.k]([[theta].sub.k])[[rho].sub.k]q(k [right arrow]) k')[q.sub.k]([u.sub.k])]]|[[partial derivative][g.sub.k[right arrow]k']([[theta].sub.k], [u.sub.k])]/[[partial derivative]([[theta].sub.k], [u.sub.k])]|,
where [[theta]'.sub.k'] = [g.sub.k[right arrow]k']([[theta].sub.k], [u.sub.k]) for a random vector [u.sub.k] [approximately] [q.sub.k](u; [[psi].sub.k]) with parameter vector [[psi].sub.k], and q(k [right arrow] k') is the probability of proposing to move from model [M.sub.k] to [M.sub.k']. Here [g.sub.k[right arrow]k']:[[THETA].sub.k] X [[THETA].sub.k.sup.q] [right arrow] [[THETA].sub.k'] denotes a mapping of the state ([[theta].sub.k], k) together with the vector [u.sub.k] to the state ([[theta]'.sub.k'], k'). The mapping satisfies [g.sub.k'[right arrow]k]([g.sub.k[right arrow]k']([[theta].sub.k], [u.sub.k]), [u'.sub.k']) = [[theta].sub.k] and requires that [n.sub.k] + [d.sub.k] = [n.sub.k'] + [d.sub.k'], where [d.sub.k] is the dimension of [[THETA].sub.k.sup.q] (known as "dimension matching").
It must be noted that the reversible-jump algorithm is not limited to the countable set of models M, although it is frequently presented in this context. In fact, one may implement the sampler without knowing the size of the model space beforehand (which may contribute somewhat to the popularity of the algorithm), although at least some knowledge of the model space is recommended for the construction of an efficient chain. The most common setting involving an uncountable number of models is in Bayesian nonparametrics, where both the number of basis functions and the functions themselves are free to vary--via fractional polynomial regression (Royston and Altman 1994) or free-knot splines, for example. (See Denison, Holmes, Mallick, and Smith 2002; DiMatteo, Genovese, and Kass 2001; Denison, Mallick, and Smith 1998; Smith and Kohn 1996 for useful instances of Bayesian nonparametric curve fitting.)
Providing an alternative to samplers designed for implementation on unions of model spaces, a number of approaches have developed conventional Markov chain technology on a product supermodel space, [THETA]* = K X [[cross product].sub.k[member of]K][[THETA].sub.k], that encompasses all model spaces jointly, thereby circumventing the necessity of between-model transitions. Defining a composite parameter vector, [theta]*, consisting of a concatenation of all parameters under all models, Carlin and Chib (1995) proposed a formulation in which the posterior distribution for the composite model space was given by
[pi](k, [theta]*|x)[proportional][L.sub.k](x|[[theta]*.sub.I.sub.k])[p.sub.k]([[theta]*.sub.I.sub.k])[p.sub.k]([[theta]*.sub.I.sub.-k]|[[theta]*.sub.I.sub.k])[[rho].sub.k],
where [I.sub.k] and [I.sub.-k] are index sets identifying and excluding the parameters [[theta].sub.k] from [theta]*. Here [I.sub.k] [intersection] [I.sub.k'] = for all k [not equal to] k', so that the parameters for each model are distinct. In practice, this setting requires that the size of the model space, |K|, be finite, thereby restricting its range of application relative to (say) the reversible-jump algorithm. Even when this setting is appropriate, a number of impracticalities are apparent for even a moderately sized model space. Although some of the computation in sampling the full parameter vector, [theta]*, may be avoided (Godsill 2003; Dellaportas, Forster, and Ntzoufras 2002; Green and O'Hagan 1998), this approach requires the definition of [p.sub.k]([[theta]*.sub.I.sub.-k]|[[theta]*.sub.I.sub.k]), termed pseudopriors. Although their specification is essentially arbitrary in terms of obtaining the desired marginal distributions, sampler performance is sensitive to their specification, introducing practical problems in terms of efficiency and tractability (see Godsill 2001, 2003; Green 2003a for a discussion). However, it is believed that, in contrast to the lack of memory of previously visited states inherent in the reversible-jump sampler, in the product space formulations (which contain a perfect memory), the information contained within [[theta]*.sub.I.sub.-k] may be useful in generating efficient between-model transitions when in model [M.sub.k]. This idea was exploited by Brooks, Guidici, and Roberts (2003c), and we consider it in more detail in Section 3.1.
Comparing the methods of Green (1995) and Carlin and Chib (1995), Godsill (2001) proposed a further generalization of the foregoing that achieves enhanced flexibility by permitting individual model parameter vectors to overlap arbitrarily; that is, the restriction that [I.sub.k] [intersection] [I.sub.k'] = for all k [not equal to] k' is relaxed. This may seem intuitive for, say, nested models. The approach of Godsill (2001)...
|