|
Article Excerpt 1. INTRODUCTION
With the advent of high-throughput genotyping data, geneticists are able to identify genetic regions responsible for quantitative trait known as quantitative trait loci (QTL). Let us assume that there are two loci, A and B, that influence the quantitative trait of interest. Under an additive linear model, the genetic contribution to the quantitative trait can be written as
[X.sub.ijn] = [theta] + [[alpha].sub.i] + [[beta].sub.j] + [[gamma].sub.ij] + [[epsilon].sub.ijn], (1)
where [X.sub.ijn] is the phenotype of the quantitative trait, i and j index the different genotypes of locus A and locus B, and [[epsilon].sub.ijn] is noise. Traditional approaches using linear regression (Haseman and Elston 1972) and maximum likelihood (Lander and Botstein 1989) both assume normality of the error distribution. Zou, Fine, and Yandel (2002) adopted a semiparametric approach that assumes that only the log ratio of error densities follows a linear model with the baseline unspecified. With no prior knowledge of the error distribution, there is a great need for a robust nonparametric method for testing linkage and interaction effect. Kruglyak and Lander (1995) proposed a generalized Wilcoxon rank-sum test to perform interval mapping; however, this method is restricted to a search for one QTL at a time. Considering two QTLs simultaneously, we can view the testing of QTL linkage and gene-gene interaction as testing for the main and interaction effect in a two-way factorial design. For a specified pair of loci A and B, the number of replicates for each genotype combination cannot be controlled by the experimenter, because the inheritance events occur randomly and the genotypes can be determined only after the samples have been genotyped. This results in unbalanced factorial designs in the setting of QTL mapping.
Rank procedures for balanced factorial designs are well established in the literature. Aligned tests (Hodges and Lehmann 1962; Puri and Sen 1971) use the method of "ranking after alignment." The rank-transform method (Conover and Iman 1976; Hora and Conover 1984) applies the classical ANOVA test on the overall rankings. However, the aforementioned tests are not applicable to unbalanced designs, which constrains their use in genetic setting.
Akritas (1990) proposed a modified rank-transform method for nested effects in unbalanced designs by investigating the asymptotic version of the rank-transform statistics. However, the test proposed for main effects is restricted to orthogonal cases. Modified Friedman-type statistics (Mack and Skillings 1980; Rai 1987; Wittkowski 1988) have been proposed to test for main effects in unbalanced designs with no interaction. These tests are based on within-block rankings, which results in potential loss of power. Special cases of unbalanced designs with interaction have been investigated by Boos and Brownie (1992), Brunner and Dette (1992), Akritas (1993), Brunner and Denker (1992), and Brunner, Puri, and Sun (1995). Nonparametric hypothesis tests in unbalanced designs have been developed by Akritas, Arnold, and Brunner (1997). The theory is based on a general nonparametric model, which is different from the linear model considered in this article.
Considering the testing problems in the linear model setting, Hettmansperger and Mckean (1998) described a unified framework for robust general linear model analysis that includes a rank regression test based on measure of dispersion, a Wald-type test based on the rank estimate of the parameters, and a least squares-type test based on the concept of pseudo-observations. Mansouri (1999) presented an aligned rank-transform method, which constructs the statistic in the same way as the least squares F-statistic except that the rank scores of the residuals of the reduced model are used instead of the original observations. Admittedly, these approaches are powerful and robust tools for testing general linear hypotheses, but they are not pure rank tests and require estimation of the effects.
Motivated by the QTL mapping problem, we aim to develop a unified nonparametric approach to perform hypotheses testing for arbitrary unbalanced designs with and without interaction. Under this framework, pure rank statistics can be constructed to test for main effects as well as for nested and interaction effects without the need for alignment or R-estimation. The methodology rests on the notions of composite linear rank statistics (CLRSs) introduced by Gao and Alvo (2005) and of weighted ranks defined in Section 2.
The article is organized as follows. In Section 2 the definition of weighted linear rank statistics (WLRSs) is introduced, and its connection with composite linear rank statistics is established. The asymptotic normality of WLRS is proved under mild conditions. In Section 3 a new rank-transform method based on WLRS is proposed for main effects in unbalanced designs without interaction. Limiting distributions under both the null hypothesis and the Pitman alternatives are derived, and consistent covariance estimation is provided. The test is a generalized version of the Hora--Conover (HC) statistic. In Sections 4 and 5, the use of WLRSs is extended to the problem of testing for nested effects and for interaction effects in unbalanced designs. In Section 6 asymptotic relative efficiencies (AREs) of the proposed tests versus parametric counterparts are evaluated under Pitman alternatives. In Section 7 Monte Carlo simulations are conducted to verify the small-sample performance of the proposed tests for different cell sizes and different error distributions. In Section 8 two simulated genetic datasets from a backcross study are analyzed to demonstrate the potential application of the generalized HC statistic in QTL mapping.
Throughout the article we use the following notation. Let [I.sub.I] and [I.sub.J] denote identity matrices of dimensions I and J. Let [J.sub.I] and [J.sub.J] denote matrices of all entries equal to 1 and dimensions equal to I and J. The subscript "." denotes summing over all values of the index.
2. WEIGHTED LINEAR RANK STATISTICS
Consider a two-way unbalanced layout with I blocks, J treatments, and [n.sub.ij] replications in the (i,j) cell. To tackle the problem of unbalanced ranking, Benard and van Elteren (1953) proposed to "center" the within-block ranks by subtracting 1/2([n.sub.j]. + 1), the mean of the [n.sub.i], ranks in the ith block. As suggested by Prentice (1979), the centered ranks are unsatisfactory when block sizes are very different. Instead, Prentice proposed scaling the centered within-block ranks by a factor of 1/([n.sub.i]. + 1). Skillings and Mack (1981), Rai (1987), and Wittkowski (1988) further advocated the idea of "centering" and "scaling" and proposed different versions of standardized ranks that adjust only for different block sizes using weighted sums of within-block ranks. Such adjustments do not have direct extensions to the rank-transform statistics.
Define the function u(x) = 1 if x [greater than or equal to] 0; and u(x) = if x < and note that
[R.sub.ijn] = [summation over (i')][summation over (j')][[n.sub.ij].summation over (n')]u([X.sub.ijn] - [X.sub.i'j'n']).
Thus the overall rankings do not adjust for different treatment or block levels in unbalanced designs. To address this problem, we define the notion of a weighted rank.
Definition l. Let [OMEGA] = {[X.sub.ijn], i = 1,..., I, j = 1,..., J, n = 1,..., [n.sub.ij]} be a collection of random variables. The weighted rank of [X.sub.ijn] within this set is
[R*.sub.ijn] = [N/IJ][summation over (i'j')][1/[n.sub.i'j']([summation over (n')]u([X.sub.ijn] - [X.sub.i'j'n'])), where N = [summation over (ij)][n.sub.ij].
The weighted rank is a sum of indicator functions weighted by the reciprocal of the number of replicates in each cell. When the [n.sub.ij]'s are equal, the weighted rank reduces to the usual definition of rank, denoted by [R.sub.ijn].
The asymptotic normality of linear rank statistics based on traditional ranks is well established in the literature. However, any linear rank statistic based on weighted ranks needs justification for its asymptotic normality to hold. We show later that a class of WLRSs can be reformulated in terms of CLRSs introduced by Gao and Alvo (2005). A CLRS is a sum of correlated simple linear rank statistics defined on different overlapped sets that provides a convenient tool to prove the asymptotic normality of WLRSs.
To demonstrate the connection between CLRS and WLRS, it is necessary to recall the definition of CLRS in Gao and Alvo (2005). Let [OMEGA] be a set consisting of N random variables [X.sub.1], [X.sub.2],..., [X.sub.N] with continuous distribution function [F.sub.1],..., [F.sub.N]. Let A be a collection of subsets in [OMEGA], not necessarily disjoint. Let the score-generating function [[alpha].sub.N](x) be generated by a real-valued function [phi](x) with a bounded second derivative usually defined either as [[alpha].sub.N](i) = [phi](i/(N + 1)) or [[alpha].sub.N](i) = E[phi]([U.sub.N.sup.(i)]), where the {[U.sub.N.sup.(i)]} are the order statistics from a uniform distribution. Let [R.sub.A]([X.sub.i]) be the rank of [X.sub.i] within set A, and let [C.sub.A]([X.sub.i]) be a constant coefficient associated with index i and set A, and let [bar.C.sub.A] be the average of [C.sub.A]([X.sub.i]) over set A, and let [n.sub.A] be the cardinality of set A.
Let [S.sub.A] be a simple linear rank statistic defined on set A: [S.sub.A] = [[summation].sub.[X.sub.i][member of]A][C.sub.A]([X.sub.i])[[alpha].sub.[n.sub.A]]([R.sub.A]([X.sub.i])). Define the average distribution function [H.sub.A](x) = 1/[n.sub.A][[summation].sub.[X.sub.i][member of]A][F.sub.i](x). Let [[mu].sub.A] = [[summation].sub.[X.sub.i][member of]A][C.sub.A]([X.sub.i])[integral][phi]([H.sub.A](x))d[F.sub.i](x) and [mu] = [[summation].sub.A[member of]A][[mu].sub.A]. The projection of [S.sub.A] onto [X.sub.i] is given by
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (2)
It is known that the distribution of [S.sub.A] converges to a normal distribution with mean [[mu].sub.A] and variance [[summation].sub.[X.sub.i][member of]A]var([L.sub.A]([X.sub.i])) (Hajek 1968, thm. 2.1). The following theorem in Gao and Alvo (2005) provides a general extension to this result and establishes the asymptotic normality of sums of correlated simple linear rank statistics.
Theorem 1. Let the CLRS be
S = [summation over (A[member of]A)][S.sub.A] = [summation over (A[member of]A)][summation over ([X.sub.i][member of]A)][C.sub.A]([X.sub.i])[[alpha].sub.[n.sub.A]]([R.sub.A]([X.sub.i])).
Let
[[sigma].sup.2] = [N.summation over (i=1)]var([summation over (A[member of]A)]([L.sub.A]([X.sub.i]))).
If
[lim.[[min.sub.A[member of]A]([n.sub.A])[right arrow][infinity]] [[[sup.sub.A[member of]A][sup.sub.[X.sub.i][member of]A]([C.sub.A]([X.sub.i]) - [bar.C.sub.A])[.sup.2]]/[[sigma].sup.2]] = 0, (3)
then
[[S - ES]/[sigma]][d.[right arrow]]N(0, 1).
The conclusion still holds if [sigma] is replaced by var(S)[.sup.1/2]. If [sup.sub.A[member of]A][sup.sub.[X.sub.i][member of]A]([C.sub.A]([X.sub.i]))[.sup.2] [less than or equal to] K [sup.sub.A[member of]A][sup.sub.[X.sub.i][member of]A]([C.sub.A]([X.sub.i]) - [bar.[C.sub.A]])[.sup.2] for some constant K, then ES can be replaced by [mu] in the conclusion.
By the projection method, we obtain a closed-form expression for [[sigma].sup.2] that serves as a good approximation to var(S). The latter is more difficult to calculate in practice.
Using Theorem 1, we now prove the asymptotic normality for a class of WLRSs. For the two-way unbalanced layout, define set [OMEGA] = {[X.sub.ijn]; i = 1,..., I, j = 1,..., J, n = 1,..., [n.sub.ij]}. We assume throughout this article that as min [n.sub.ij] [right arrow] [infinity], lim [n.sub.ij]/N = [[rho].sub.ij] with < [[rho].sub.ij] < 1. Let A(i, j, i', j') denote the set of variables in [OMEGA] contained in distinct cells (i,j) and (i',j'). Let A = {A(i,j,i',j')|[for all] unordered pair of (i,j) [not equal to] (i',j')}.
Theorem 2. Let R* ([X.sub.ijn]), [OMEGA], N, A, and A be as defined earlier. Let S* be a WLRS of linear scores defined on set [OMEGA] that takes the form
S* = [summation over (ijn)][d.sub.ij][R*.sub.ijn], (4)
where the [d.sub.ij]'s are fixed constants associated with cell (i, j). Then S* can be reexpressed as
S* = c + [summation over (A[member of]A)][summation over ([X.sub.ijn][member of]A)][C.sub.A]([X.sub.ijn])[[[R.sub.A]([X.sub.ijn])]/[[n.sub.ij] + [n.sub.i'j'] + 1]], (5)
where [C.sub.A]([X.sub.ijn]) = [d.sub.ij]N([n.sub.ij] + [n.sub.i'j'] + 1)/(IJ[n.sub.i'j']), and the constant c is of order [N.sup.2][[summation].sub.ij][d.sub.ij].
Consequently, the asymptotic normality of S* with linear scores in the form of (4) is established if the CLRS in (5) satisfies condition (3). It is known that for a given error distribution, different score functions have different efficiency properties. Thus it is imperative to extend the asymptotic result of WLRS from linear scores to nonlinear score functions. Define weighted rank scores [[alpha].sub.N]([R*.sub.ijn]) as scores generated by a function [phi](t), with a bounded second derivative, < t < 1 as [[alpha].sub.N]([R*.sub.ijn]) = [phi]([R*.sub.ijn]/[N + 1]). Let S* be a WLRS with nonlinear scores, which takes the form
S* = [summation over (ijn)][d.sub.ij][[alpha].sub.N]([R*.sub.ijn]).
Define the average distribution function H(x) = 1/(IJ) X [[summation].sub.ij][F.sub.ij](x) and average regression constants [bar.d] = 1/(IJ) X [[summation].sub.ij][[rho].sub.ij][d.sub.ij]. For variance approximation, define the projection variable
[Z*.sub.ijn] = [1/IJ[n.sub.ij]][summation over (i'j'n')]([d.sub.i'j'] - [[[rho].sub.ij]/[[rho].sub.i'j']][d.sub.ij]) X [integral] (u(x - [X.sub.ijn])...
|