Insight into the Cosolvent Effect of Cellulose Dissolution in

Jul 5, 2013 - All simulations were performed with the MDynaMix 5.2 package.(25) The ..... This material is available free of charge via the Internet a...
2 downloads 3 Views 261KB Size
Computational Statistics and Data Analysis 60 (2013) 157–168

Contents lists available at SciVerse ScienceDirect

Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda

Most powerful rank tests for perfect rankings Jesse Frey ∗ , Le Wang Department of Mathematics and Statistics, Villanova University, Villanova, PA 19085, United States

article

info

Article history: Received 6 May 2012 Received in revised form 14 November 2012 Accepted 15 November 2012 Available online 22 November 2012 Keywords: Imperfect rankings Neyman–Pearson Lemma Ranked-set sampling

abstract We consider the problem of testing for perfect rankings in ranked set sampling (RSS). By using a new algorithm for computing the probability that specified independent random variables have a particular ordering, we find most powerful rank tests of the null hypothesis of perfect rankings against fully specified alternatives. We compare the power of these most powerful rank tests to that of existing rank tests in the literature, and we find that the existing tests are surprisingly close to optimal over a wide range of alternatives to perfect rankings. This finding holds both for balanced RSS and for unbalanced RSS cases where the different ranks are not equally represented in the sample. We find that the best of the existing tests is the test that rejects when the null probability of the observed ranks is small, and we provide a new, more efficient R function for computing the test statistic. © 2012 Elsevier B.V. All rights reserved.

1. Introduction Ranked-set sampling (RSS), proposed by McIntyre (1952, 2005), is a sampling scheme that typically outperforms simple random sampling in settings where small sets of units can be accurately ranked at a cost that is negligible compared to the cost of making actual measurements. RSS may be either balanced or unbalanced. To do balanced RSS, one first chooses a set size m. This set size is usually chosen to be in the range two to five so that sets of size m can be ranked easily. One then draws m independent samples (sets) of size m and ranks the units in each set from smallest to largest. The ranking can be done either by judgment or according to an easily available covariate, and it need not match the true ranking according to the variable of interest. One then selects for measurement the unit ranked smallest in the first set, the unit ranked secondsmallest in the second set, and so on. This process yields a sample of m independent values, with one value from each of the m possible in-set ranks. To obtain a larger sample, one repeats this process for n > 1 independent cycles. The measured sample then consists of nm independent values, with n values from each of the m possible ranks. If the rankings are perfect so that the ranking matches the true ranking according to the variable of interest, then the nm values are independent order statistics. Otherwise, the values are independent judgment order statistics. To do unbalanced RSS, one removes the constraint that each in-set rank must appear the same number of times. Instead, one chooses a set size m and a vector n ≡ (n1 , . . . , nm ) such that ni gives the number of  independent values with in-set rank m i that are to be selected for measurement. The measured sample then consists of N = i=1 ni independent values. Both in the balanced case and in the unbalanced case, we will write X[i]j for the jth sampled value with in-set rank i. McIntyre’s original work on RSS was motivated by the problem of estimating mean yields in agriculture, and subsequent work has used RSS to solve a wide range of statistical problems in a wide variety of application areas. Statistical problems that have been treated using RSS include variance estimation (MacEachern et al., 2002), estimation of the distribution function (Stokes and Sager, 1988), parametric inference (Stokes, 1995), testing for a shift in means (Bohn and Wolfe, 1992), and interval estimation of quantiles (Ozturk and Deshpande, 2005). Application areas where RSS has been used include forestry



Corresponding author. E-mail addresses: [email protected] (J. Frey), [email protected] (L. Wang).

0167-9473/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2012.11.012

158

J. Frey, L. Wang / Computational Statistics and Data Analysis 60 (2013) 157–168

(Halls and Dell, 1966), medicine (Chen et al., 2005), environmental monitoring (Kvam, 2003), and entomology (Howard et al., 1982). Important theoretical contributions to RSS were made by Takahasi and Wakimoto (1968) and Dell and Clutter (1972), and recent summaries of the RSS literature appear in two survey articles by Wolfe (2004, 2010) and a monograph by Chen et al. (2004). Statistical procedures based on RSS may be parametric or nonparametric, and they may be either robust or non-robust to imperfect rankings. When the rankings are perfect, the measured values are independent order statistics from the distribution of interest. As a result, results from the theory of order statistics (see David and Nagaraja, 2003) can be used to generate RSS-based statistical procedures. These procedures that assume perfect rankings can be very efficient when the rankings are actually perfect, but they may perform poorly when the rankings are not perfect. Procedures that do not assume perfect rankings are usually somewhat less efficient in cases where the rankings are actually perfect, but they typically outperform procedures that assume perfect rankings when the ranking are not perfect. Some examples of both types of procedures were given by Frey et al. (2007, Section 1). When deciding whether to use a procedure that assumes perfect rankings or a procedure that is robust to violations of perfect rankings, it is helpful to first test whether the rankings are perfect or not. In particular, one might test the null hypothesis of perfect rankings against the general alternative that the rankings are not perfect. Nonparametric tests for these hypotheses were proposed by Frey et al. (2007), Li and Balakrishnan (2008) and, Vock and Balakrishnan (2011). In each of the three papers, the focus was on tests that are appropriate for the case of balanced RSS, and power comparisons were carried out via simulation. In this paper, we again consider nonparametric tests for perfect rankings. We show that by using a new algorithm for computing the probability that specified independent random variables have a particular ordering, it is possible to obtain most powerful simple-versus-simple rank tests for perfect rankings. We obtain these tests for a variety of small-sample settings and a variety of models for imperfect rankings, and we compare the power of these most powerful rank tests to that of the nonparametric tests currently available in the literature. We find that the existing tests perform surprisingly well, with the best performance being that of the test that rejects the null hypothesis of perfect rankings when the null probability of the observed set of ranks is small. We explain how to obtain most powerful rank tests in Section 2, and we describe the needed algorithm in Section 3. We review existing tests for perfect rankings and models for imperfect rankings in Section 4, and we do our power comparisons in Section 5. We conclude in Section 6 with a discussion, and in the Appendix, we provide an new, more efficient R function for computing the recommended test statistic. 2. Most powerful rank tests for perfect rankings Suppose that we test H0 : Perfect rankings against Ha : Imperfect rankings using data {X[i]j , i = 1, . . . , m, j = 1, . . . , ni }. If the population distribution has continuous distribution function F (·) and if H0 holds, then each value F (X[i]j ) has a Beta(i, m + 1 − i) distribution by the probability-integral transform and standard results on order statistics. Suppose that m we let R[i]j be the rank of X[i]j among the N = i=1 ni values {X[i]j }. The value R[i]j is then also the rank of F (X[i]j ) among the values {F (X[i]j )}. Thus, the distribution of the ranks {R[i]j } is distribution-free under H0 . This means that any test based only on the ranks {R[i]j } is a distribution-free test. To evaluate the performance of a particular rank test, we need to know the joint distribution of the ranks {R[i]j }. For a fixed choice of {R[i]j }, let ki be the in-set rank for the value with rank i among the N measured values. It then follows that the probability of seeing that fixed choice of {R[i]j } is P (W1 < · · · < WN ), where W1 , . . . , WN are independent random variables such that Wi is distributed like a ki th judgment order statistic. When H0 holds, we may take Wi ∼ Beta(i, m + 1 − i), and probabilities of the form P (W1 < · · · < WN ) can be computed using the algorithm given in Frey (2007). Given a fully-specified model for the imperfect rankings, the distributions of W1 , . . . , WN are no longer Beta distributions, but the alternative probability P (W1 < · · · < WN ) can still be computed using the new algorithm described in Section 3. Suppose that we have identified a fully-specified model for imperfect rankings, say Ha . It then follows from the Neyman–Pearson Lemma (Casella and Berger, 2002, p. 388) that one obtains the most powerful rank test of H0 against Ha by rejecting when the ratio r ≡ Pa (W1 < · · · < WN )/P0 (W1 < · · · < WN ) between the alternative and null probabilities is large. If the total sample size N is relatively small, then it is possible to obtain the most powerful test by listing out all possible choices for the ranks {R[i]j }, computing the probability of seeing each possible choice under both the null and the alternative hypotheses, and determining the appropriate critical region. In general, the most powerful level-α test will vary from one choice of Ha to another. As an illustration, we consider a model for imperfect rankings which is similar to the models considered by Dell and Clutter (1972). We take X to be the variable of interest, and we assume that the rankings are done according to a related covariate Y . We also assume that the joint distribution of X and Y is bivariate normal with X standard normal, Y standard normal, and the correlation set at ρ = 0.7. Table 1 provides the numbers needed to obtain the most powerful rank test of H0 against this particular alternative when m = 4 and n = 1. The sequences listed in Table 1 are the 4! = 24 different ways in which the independent values with in-set ranks 1, 2, 3, and 4 might be ordered. Each sequence is listed in the form k1 k2 k3 k4 so that the sequence 1234 indicates that X[1]1 < X[2]1 < X[3]1 < X[4]1 . The probabilities P0 and Pa are the probabilities of the various sequences under the null and the alternative hypotheses, and r is the ratio Pa /P0 between the two probabilities. Since the ratios r were computed

J. Frey, L. Wang / Computational Statistics and Data Analysis 60 (2013) 157–168

159

Table 1 Null probabilities, alternative probabilities, ratios, and rejection probabilities for all possible sequences k1 k2 k3 k4 for the example discussed in Section 2. Sequence

P0

Pa

Ratio r

φ

1234 1243 1324 1342 1423 1432 2134 2143 2314 2341 2413 2431 3124 3142 3214 3241 3412 3421 4123 4132 4213 4231 4312 4321

0.4028 0.1269 0.1628 0.0240 0.0255 0.0128 0.1269 0.0388 0.0255 0.0020 0.0041 0.0012 0.0240 0.0031 0.0128 0.0009 0.0006 0.0004 0.0020 0.0009 0.0012 0.0003 0.0004 0.0002

0.1864 0.1069 0.1271 0.0506 0.0506 0.0354 0.1069 0.0608 0.0506 0.0144 0.0209 0.0106 0.0506 0.0196 0.0354 0.0098 0.0079 0.0057 0.0144 0.0098 0.0106 0.0052 0.0057 0.0041

0.46 0.84 0.78 2.11 1.99 2.77 0.84 1.57 1.99 7.33 5.06 8.75 2.11 6.24 2.77 10.71 12.44 14.90 7.33 10.71 8.75 15.59 14.90 17.88

0.000 0.000 0.000 0.148 0.000 1.000 0.000 0.000 0.000 1.000 1.000 1.000 0.148 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

before doing any rounding, the tabled ratios do not in general coincide with the ratios between the tabled probabilities, which have been rounded to four decimal places. By summing appropriate null probabilities from Table 1, we find that rejecting H0 whenever r > 2.11 yields a test with α level 0.042921 and that rejecting whenever r > 1.99 (the next highest available value) yields a test with α level 0.090899. As a result, it follows from the Neyman–Pearson Lemma that the most powerful level-0.05 test is a randomized test in which we reject with probability (0.05 − 0.042921)/(0.090899 − 0.042921) ≈ 0.148 when r = 2.11, with probability 1 when r > 2.11, and with probability 0 when r < 2.11. The column labeled φ gives the rejection probability for each of the possible sequences. Having found the most powerful test, we can also obtain the power for that test by summing the product of φ and the alternative probability Pa over all of the sequences. For the test featured in Table 1, this power is 0.2245. 3. An algorithm for computing alternative probabilities This section develops a method for computing the alternative probabilities needed for obtaining the most powerful rank tests described in Section 2. Specifically, this section develops an algorithm for computing P (W1 < · · · < WN ) when W1 , . . . , WN are independent random variables with known distribution functions. We first develop an algorithm that is exact when the distribution functions satisfy a certain condition. We then explain how the exact algorithm may be used to compute the probability even when the distributions do not satisfy the condition. We also present some numerical results that illustrate the accuracy of the algorithm. Let W1 , . . . , WN be independently distributed random variables with continuous distribution functions F1 , . . . , FN , respectively. We can compute the probability P (W1 < · · · < WN ) exactly if there exists a sequence of real numbers a1 < · · · < aK such that the random variables W1 , . . . , WN and the sequence a1 < · · · < aK satisfy the following condition. Condition 1. Given a1 < · · · < aK , define disjoint intervals I0 = (−∞, a1 ], I1 = (a1 , a2 ], . . . , IK −1 = (aK −1 , aK ], and IK = (aK , ∞). The random variables W1 , . . . , Wn and the sequence a1 < · · · < aK satisfy the equal conditionals condition if for each j ∈ {0, 1, . . . , K }, the conditional distributions [Wi |Wi ∈ Ij ] are the same for all i for which P (Wi ∈ Ij ) is positive. Assume that the random variables W1 , . . . , WN and the sequence of constants a1 < · · · < aK satisfy Condition 1. To compute P (W1 < · · · < WN ) exactly, we first develop an alternate representation for P (W1 < · · · < WN ). We can write that P (W1 < · · · < WN ) =









··· w1 =−∞

wN =−∞

I (w1 < · · · < wN ) dF1 (w1 ) . . . dFN (wN ).

(1)

Since the disjoint intervals I0 , . . . , IK have the real line as their union, we can also decompose the full integral (1) into a sum K  i 1 =0

···

K   iN =0

w1 ∈Ii1

...

 wN ∈IiN

I (w1 < · · · < wN ) dF1 (w1 ) . . . dFN (wN )

(2)

160

J. Frey, L. Wang / Computational Statistics and Data Analysis 60 (2013) 157–168

of simpler integrals. It is immediately apparent, moreover, that because of the indicator function, the summands on the right side of (2) can be nonzero only if i1 ≤ i2 ≤ · · · ≤ iN . If we define S = {(i1 , . . . , iN ) ∈ {0, . . . , K }N |i1 ≤ · · · ≤ iN }, then we can write P (W1 < · · · < WN ) as





w1 ∈Ii1

(i1 ,...,iN )∈S

...

 wN ∈IiN

I (w1 < · · · < wN ) dF1 (w1 ) . . . dFN (wN ).

(3)

We now consider the computation of individual terms in the sum (3). Suppose that the vector (i1 , . . . , iN ) ∈ S satisfies the strict inequality i1 < i2 < · · · < iN . Then the indicator function in the expression

 w1 ∈Ii1

...

 wN ∈IiN

I (w1 < · · · < wN ) dF1 (w1 ) . . . dFN (wN )

(4)

is always 1, meaning that we can write (4) as



...

w1 ∈Ii1

 wN ∈IiN

dF1 (w1 ) . . . dFN (wN ) =

=



 w1 ∈Ii1 N  

dF1 (w1 ) · · ·



 wN ∈IiN

Fr (air +1 ) − Fr (air ) =



r =1

N 

dFN (wN )

p(r , ir ),

r =1

where p(i, j) is defined by p(i, j) ≡ P (Wi ∈ Ij ) = Fi (aj+1 ) − Fi (aj ). Now suppose that the vector (i1 , . . . , iN ) ∈ S includes repeated values. To be specific, suppose that (i1 , . . . , iN ) includes si ≥ 0 copies of each value i for i = 0, . . . , K . Then, noting that (4) is P (W1 ∈ Ii1 , W2 ∈ Ii2 , . . . , WN ∈ IiN , and W1 < · · · < WN ), we can use conditioning to rewrite (4) as a product P (W1 ∈ Ii1 , W2 ∈ Ii2 , . . . , WN ∈ IiN )P (W1 < · · · < WN |W1 ∈ Ii1 , . . . , WN ∈ IiN ).

(5)

By the independence of W1 , . . . , WN , the first factor in (5) is N 

p(r , ir ).

r =1

Then, using the assumption that W1 , . . . , WN and a1 < · · · < aK satisfy Condition 1, we may write the second factor as



K i =0 s i

!

−1

. This follows from noting that for any j, i, and l,

P (Wi < · · · < Wi+l−1 |Wi , . . . , Wi+l−1 ∈ Ij ) =

1 l!

.

Thus, (4) can be written as



N 

p(r , ir )

 K 

r =1

 −1 si !

,

(6)

i=0

and we can write P (W1 < · · · < WN ) as

 (i1 ,...,iN )∈S

 N   

r =1

p(r , ir )

 −1   si ! ,  i=0

 K 

(7)

where si is the number of the indices i1 , . . . , iN that are equal to i. The formula (6) suggests a Markov property that makes it possible to sequentially compute the probability (4) corresponding to any vector (i1 , . . . , iN ) ∈ S. Starting with a value of 1, we think of moving from left to right through the indices i1 , . . . , iN . If the tth index is it , then we multiply our current value by p(t , it ), in this way accounting for the first factor in (6). If the tth index is appearing for the jth time, then we divide by j, in this way accounting for the second factor in (6). At the tth stage, the only information needed to carry out the computation is the index it and the number of times that the index it appeared among the previous indices i1 , . . . , it −1 . This sequential computation method leads to a method for computing the full probability (7). To compute (7) efficiently, we define an appropriate set of probabilities and compute them using a recursive scheme. For t = 1, . . . , N , i = 0, . . . , K , and j = 1, . . . , t, define st (i, j) to be the probability that W1 < W2 < · · · < Wt , that Wt −j+1 , . . . , Wt ∈ Ii , and that Wt −j ̸∈ Ii . Then st (i, j) is the probability that the first t random variables are in increasing

J. Frey, L. Wang / Computational Statistics and Data Analysis 60 (2013) 157–168

161

Table 2 Estimated probabilities obtained by using the algorithm from Section 3 with different numbers of intervals. The true probability is 1.53027 · 10−14 . Intervals

Estimated probability

Relative error

10 20 50 100 200 500 1000

1.73321 1.58209 1.53862 1.53236 1.53080 1.53036 1.53029

1.33 3.39 5.46 1.37 3.41 5.46 1.37

· · · · · · ·

−14

10 10−14 10−14 10−14 10−14 10−14 10−14

· · · · · · ·

10−1 10−2 10−3 10−3 10−4 10−5 10−5

order and that exactly the last j of the first t random variables fall in the interval Ii . The values {st (i, j)} for t = 1 are given by s1 (i, j) =

p(1, i), 0,



j = 1, otherwise,

(8)

and the values {st +1 (i, j)} can be computed from the values {st (i, j)} using

 t   p(t + 1, i) st (r , l), st +1 (i, j) = r < i l = 1   p(t + 1, i)st (i, j − 1)/j,

j = 1,

(9)

j > 1.

Once we obtain the values {sN (i, j)}, the probability P (W1 < · · · < WN ) is computed as P (W1 < · · · < WN ) =

K  N 

sN (i, j).

(10)

i=0 j=1

The full algorithm is summarized below, and code for implementing the algorithm using the statistical software package R is available from the authors. Algorithm 1. Compute P (W1 < · · · < WN ) as follows. Step 1. Compute {s1 (i, j)} using Eq. (8). Step 2. For t = 1, . . . , N − 1, compute {st +1 (i, j)} from {st (i, j)} using Eq. (9). Step 3. Obtain P (W1 < · · · < WN ) using Eq. (10). In the context described in Section 2, we typically cannot find values a1 < · · · < aK such that the values a1 < · · · < aK and the distributions F1 , . . . , FN satisfy Condition 1. However, if we choose progressively finer grids of values a1 < · · · < aK , the probability that we obtain from Algorithm 1 becomes closer and closer to the true value. This occurs both because there are fewer cases where multiple random variables show up in the same interval and because as we restrict to finer and finer intervals, the conditional distributions on the intervals become more and more uniform, thus making Condition 1 approximately true. To illustrate the general method, we compute the probability of seeing the sequence k1 · · · k15 = 123451234512345 when m = 5, n = 3, and the rankings are perfect. Since the rankings are perfect, this same probability can be obtained using the exact algorithm from Frey (2007). Thus, we have a gold standard against which to compare our results here. If we take the population distribution to be standard uniform, then an ith order statistic has a Beta(i, m + 1 − i) distribution. Thus, the distributions F1 , . . . , F15 are known, but they do not satisfy Condition 1 for any choice of a1 < · · · < aK . For a fixed number of intervals K + 1, we take the values a1 < · · · < aK to be given by ai ≡ i/(K + 1) for i = 1, . . . , K , and we explore how the accuracy of Algorithm 1 changes as the number of intervals increases. Results are given in Table 2, where the relative error values were computed by dividing the absolute error by the true probability. We see from Table 2 that when there are only 10 intervals, the estimated probability differs from the true probability 1.53027 · 10−14 by more than 13%. As the number of intervals increases, however, the accuracy of the estimated probability improves rapidly, and by the time the number of intervals is 1000, the relative error is on the order of 10−5 . In our power calculations in Section 5, we use 1000 intervals, and the results that we obtain can be expected to be at least as accurate as those shown in Table 2. This follows for several reasons. First, in the power study, the total sample size N is never much larger than the sample size 15 that is used here. Second, perfect rankings is the case where the distributions F1 , . . . , FN depart most from Condition 1. When the rankings are imperfect, the distributions F1 , . . . , FN will tend to be more similar to each other, and the approximation involved in Algorithm 1 will be more accurate. Third, in the power study, the set size m is never larger than the set size 5 that we have used here. For a fixed total sample size, larger set sizes present a greater challenge for Algorithm 1 because the distributions F1 , . . . , FN tend to be more similar when m is small than when m is large.

162

J. Frey, L. Wang / Computational Statistics and Data Analysis 60 (2013) 157–168

4. Models for imperfect rankings and tests for perfect rankings To compare the performance of the existing tests for perfect rankings to that of the most powerful rank tests in different scenarios, we computed powers for these tests for a variety of set sizes, sample size vectors (n1 , . . . , nm ), and alternatives to perfect rankings. For each fully specified alternative, we computed the power by considering all possible values for the ranks {R[i]j }, computing the probability of each possibility under the alternative, and summing the alternative probabilities over the entire rejection region. We used level 0.05 in each case, and we used randomized tests so that each test has true level exactly 0.05. We considered four different models for imperfect rankings, and at each alternative, we computed the power for four different tests. The four models that we considered are the same four that were considered by Vock and Balakrishnan (2011). The first of these models was the bivariate normal model, due to Dell and Clutter (1972), in which the rankings are done according to a random variable Y that has correlation ρ with the random variable of interest X . The second model was the fractionof-random-rankings model, used by Frey et al. (2007), in which each set is either ranked perfectly, or, with probability λ ∈ [0, 1], is ranked randomly. Under this model, the distribution function for an ith judgment order statistic is given by F[i] (t ) = (1 − λ)F(i) (t ) + λF (t ), where F(i) (t ) is the distribution function for a true ith order statistic and F is the distribution function for the population distribution. The third model was the fraction-of-backwards-rankings model, used by Frey et al. (2007), in which each set is either ranked perfectly, or, with probability λ ∈ [0, 1], is ranked backwards. Under this model, the distribution function for an ith judgment order statistic is given by F[i] (t ) = (1 − λ)F(i) (t ) + λF(m+1−i) (t ). The final model was the fraction-of-neighbors model, due to Vock and Balakrishnan (2011), in which, for λ ∈ [0, 1], the distribution function for an ith judgment order statistic is given by F[i] (t ) = (1 − λ)F(i) (t ) + (λ/2) F(i+1) (t ) + F(i−1) (t ) ,





where F(0) (t ) is taken to be F(1) (t ) and F(m+1) (t ) is taken to be F(m) (t ). For each of the last three models, the distribution functions F[i] (t ) may be computed as linear combinations of Beta distribution functions since we may take the population distribution to be uniform on the interval [0, 1]. Thus, applying Algorithm 1 from Section 3 is straightforward. For the first model, however, a bit of additional calculation is needed to obtain the distribution function for an ith judgment order statistic. Suppose that ρ > 0, and assume that both X and Y are standard normal. We may then write an ith judgment order statistic X[i] as X[i] = ρ Y(i) +



1 − ρ 2 Zi ,

where Y(i) is a true ith order statistic from the standard normal distribution and Zi is an independent standard normal random variable. Thus,



F[i] (t ) = P (X[i] ≤ t ) = P ρ Y(i) +



1 − ρ 2 Zi ≤ t



    = E P ρ Y(i) + 1 − ρ 2 Zi ≤ t |Zi     t − 1 − ρ 2 Zi = E P Y(i) ≤ |Zi . ρ Since Y(i) is an ith order statistic from the standard normal distribution, the distribution function for Y(i) is B (Φ (t ); i, m + 1 − i), where Φ is the standard normal distribution function and B(·; α, β) is the distribution function for the Beta distribution with parameters α and β . Thus, F[i] (t ) =



 



B Φ z =−∞

t−

1 − ρ2z



ρ





1

; i, m + 1 − i √



2 e−z /2 dz ,

and this integral can be easily computed using numerical integration. For each fully-specified imperfect rankings model, we computed the power for four tests. Specifically, we computed the power for the most powerful rank test, which changes from one alternative to another, for the rank test that rejects when the null probability NP of the observed set of ranks {R[i]j } is too small, for the W ⋆ test developed by Frey et al. (2007), and for the J test developed by Vock and Balakrishnan (2011). The test statistic NP can be computed using the algorithm given in Frey (2007), and the last two test statistics are computed as W⋆ =

ni m   i=1 j=1

iR[i]j

J. Frey, L. Wang / Computational Statistics and Data Analysis 60 (2013) 157–168

163

Table 3 Powers of level-0.05 tests for different set sizes m, numbers of cycles n, and correlations ρ when the imperfect rankings are done according to the bivariate normal model. m

n

ρ

MP

NP

W⋆

J

2

8

0.9 0.8 0.7 0.6 0.5

0.1019 0.1722 0.2566 0.3496 0.4458

0.1018 0.1720 0.2563 0.3493 0.4456

0.1000 0.1676 0.2490 0.3394 0.4337

0.1000 0.1676 0.2490 0.3394 0.4337

3

4

0.9 0.8 0.7 0.6 0.5

0.1317 0.2401 0.3596 0.4777 0.5867

0.1316 0.2400 0.3594 0.4776 0.5866

0.1294 0.2346 0.3509 0.4669 0.5748

0.1289 0.2336 0.3496 0.4655 0.5734

4

2

0.9 0.8 0.7 0.6 0.5

0.1421 0.2555 0.3720 0.4819 0.5808

0.1420 0.2553 0.3718 0.4818 0.5806

0.1403 0.2511 0.3653 0.4737 0.5718

0.1372 0.2448 0.3568 0.4640 0.5617

5

1

0.9 0.8 0.7 0.6 0.5

0.1366 0.2335 0.3287 0.4182 0.5003

0.1363 0.2332 0.3287 0.4182 0.5003

0.1358 0.2316 0.3260 0.4147 0.4962

0.1283 0.2174 0.3071 0.3929 0.4731

and ni  nh m   

m−1

J =

I (R[i]j > R[h]l ).

i=1 h=i+1 j=1 l=1

The test based on W ⋆ rejects H0 when W ⋆ is too small, and the test based on J rejects H0 when J is too large. A variety of tests other than the tests based on NP, W ⋆ , and J were considered by Frey et al. (2007), Li and Balakrishnan (2008), and Vock and Balakrishnan (2011), but those tests were found to be less powerful. Thus, in our Section 5 power comparisons, we restricted our attention to the four tests just mentioned. In the m = 2 case, the test based on W ⋆ is equivalent to the test based on J. Thus, the powers for those tests differ only for m > 2. 5. Power comparisons In our power comparisons, we considered both balanced RSS and unbalanced RSS. For balanced RSS, we considered the

(m, n) pairings (2, 8), (3, 4), (4, 2), and (5, 1), and for unbalanced RSS, we considered the m and n ≡ (n1 , . . . , nm ) pairings 2 and (3, 6), 3 and (1, 2, 3), 4 and (1, 2, 3, 1), and 5 and (2, 0, 2, 2, 1). The balanced RSS (m, n) pairings were chosen to make nm as large as possible subject to the constraint that the total number of possible sequences k1 . . . kN was less than 40,000, and the unbalanced RSS vectors n were chosen to be highly unbalanced. For the bivariate normal model, we considered correlations ρ from 0.5 to 0.9, and for the fraction-of-random-rankings and fraction-of-backwards-rankings models, we considered fractions λ from 0.1 to 0.5. For the fraction-of-neighbors model, we considered fractions λ from 0.2 to 1.0. Power results for the balanced RSS cases are given in Tables 3–6, and power results for the unbalanced RSS cases are given in Tables 7–10. In each table, the column labeled MP shows the power for the most powerful rank test at the particular alternative, and the columns labeled NP, W ⋆ , and J show the powers for the three existing tests described in Section 4. Table 3 gives power results for balanced RSS and the bivariate normal model. We see from the table that under this model for imperfect rankings, the NP test tends to have nearly the same power as the most powerful test at each particular alternative. The test based on W ⋆ is slightly less powerful than the test based on NP, and the test based on J has the least power. However, even the test based on J never falls more than a few percentage points short of the maximum possible power in the scenarios considered. Table 4 gives power results for balanced RSS and the fraction-of-random-rankings model. We see from the table that the NP test is usually quite close in power to the maximum possible power at each particular alternative. The test based on W ⋆ and the test based on J tend to be less powerful, but also always come within a few percentage points of the maximum possible power in all of the scenarios considered. Similar results hold in Table 5, which gives power results for the fractionof-backwards-rankings model. Table 6 gives power results for balanced RSS and the fraction-of-neighbors model. Here the NP test tends to outperform the W ⋆ and J tests, and the test based on J tends to outperform the test based on W ⋆ . However, all three of the tests are always within a few percentage points of the maximum possible power. The biggest departures from maximal power occur

164

J. Frey, L. Wang / Computational Statistics and Data Analysis 60 (2013) 157–168 Table 4 Powers of level-0.05 tests for different set sizes m, numbers of cycles n, and fractions λ when the imperfect rankings are done according to the fraction-of-random-rankings model. m

n

λ

MP

NP

W⋆

J

2

8

0.1 0.2 0.3 0.4 0.5

0.1068 0.1775 0.2598 0.3499 0.4436

0.1018 0.1706 0.2530 0.3444 0.4398

0.0961 0.1590 0.2363 0.3243 0.4183

0.0961 0.1590 0.2363 0.3243 0.4183

3

4

0.1 0.2 0.3 0.4 0.5

0.1569 0.2635 0.3712 0.4766 0.5764

0.1386 0.2418 0.3518 0.4620 0.5669

0.1326 0.2321 0.3402 0.4497 0.5549

0.1269 0.2224 0.3284 0.4378 0.5441

4

2

0.1 0.2 0.3 0.4 0.5

0.1570 0.2634 0.3670 0.4657 0.5580

0.1486 0.2518 0.3554 0.4561 0.5512

0.1496 0.2528 0.3557 0.4554 0.5493

0.1368 0.2327 0.3324 0.4317 0.5272

5

1

0.1 0.2 0.3 0.4 0.5

0.1419 0.2309 0.3169 0.3995 0.4781

0.1417 0.2309 0.3169 0.3994 0.4781

0.1418 0.2307 0.3164 0.3985 0.4765

0.1284 0.2083 0.2885 0.3679 0.4454

Table 5 Powers of level-0.05 tests for different set sizes m, numbers of cycles n, and fractions λ when the imperfect rankings are done according to the fraction-of-backwards-rankings model. m

n

λ

MP

NP

W⋆

J

2

8

0.1 0.2 0.3 0.4 0.5

0.1775 0.3499 0.5369 0.7066 0.8377

0.1706 0.3444 0.5345 0.7062 0.8377

0.1590 0.3243 0.5134 0.6897 0.8275

0.1590 0.3243 0.5134 0.6897 0.8275

3

4

0.1 0.2 0.3 0.4 0.5

0.3025 0.5161 0.6913 0.8215 0.9084

0.2482 0.4574 0.6429 0.7879 0.8886

0.2399 0.4516 0.6419 0.7902 0.8920

0.2241 0.4285 0.6195 0.7731 0.8813

4

2

0.1 0.2 0.3 0.4 0.5

0.2858 0.4857 0.6488 0.7755 0.8682

0.2613 0.4528 0.6165 0.7487 0.8489

0.2675 0.4619 0.6262 0.7573 0.8555

0.2340 0.4179 0.5849 0.7251 0.8338

5

1

0.1 0.2 0.3 0.4 0.5

0.2354 0.4003 0.5445 0.6679 0.7707

0.2306 0.3921 0.5343 0.6573 0.7609

0.2322 0.3943 0.5366 0.6592 0.7624

0.2078 0.3574 0.4957 0.6204 0.7293

in the (n, m) = (3, 4) case with λ = 1.0, and these departures are on the order of eight percentage points for the least powerful of the three tests, which is the W ⋆ test in this case. Neither the test based on W ⋆ nor the test based on J was explicitly designed for use with unbalanced RSS. However, taking the two test statistics to be defined exactly as given in Section 4, we studied their performance in settings where the sample sizes n1 , . . . , nm were highly unbalanced. The results of our power comparisons appear in Tables 7–10. We see from Table 7 that under the bivariate normal model, the power of the NP test is consistently very close to the maximum possible. The powers for the tests based on W ⋆ and J are also high, with the test based on W ⋆ tending to be the more powerful of the two. We see a similar pattern in Table 8, which shows results for the fraction-of-random-rankings model. The NP test tends to be very close in power to the maximum possible, while the test based on W ⋆ is slightly less powerful, and the test based on J is the least powerful of the three. In Table 9, which gives results for the fraction-of-backwards-rankings model, we see that the NP test is consistently within a few percentage points of the maximum possible power. We also see that the test based on W ⋆ is comparable in power to the NP test. It slightly outperforms the NP test in the cases shown in the bottom half of Table 9, but is outperformed by the NP test in the cases shown in the top half of Table 9. The J test is less powerful than the NP and W ⋆ tests in all of the

J. Frey, L. Wang / Computational Statistics and Data Analysis 60 (2013) 157–168

165

Table 6 Powers of level-0.05 tests for different set sizes m, numbers of cycles n, and fractions λ when the imperfect rankings are done according to the fraction-of-neighbors model. m

n

λ

MP

NP

W⋆

J

2

8

0.2 0.4 0.6 0.8 1.0

0.1775 0.3499 0.5369 0.7066 0.8377

0.1706 0.3444 0.5345 0.7062 0.8377

0.1590 0.3243 0.5134 0.6897 0.8275

0.1590 0.3243 0.5134 0.6897 0.8275

3

4

0.2 0.4 0.6 0.8 1.0

0.1285 0.2371 0.3679 0.5085 0.6453

0.1248 0.2266 0.3475 0.4771 0.6048

0.1189 0.2122 0.3230 0.4432 0.5640

0.1201 0.2173 0.3345 0.4622 0.5899

4

2

0.2 0.4 0.6 0.8 1.0

0.0976 0.1581 0.2307 0.3136 0.4041

0.0956 0.1514 0.2159 0.2875 0.3643

0.0932 0.1447 0.2034 0.2678 0.3367

0.0928 0.1456 0.2070 0.2757 0.3500

5

1

0.2 0.4 0.6 0.8 1.0

0.0785 0.1109 0.1473 0.1878 0.2218

0.0772 0.1066 0.1381 0.1713 0.2061

0.0766 0.1051 0.1354 0.1673 0.2007

0.0757 0.1040 0.1346 0.1673 0.2016

Table 7 Powers of level-0.05 tests for different sample size vectors n and correlations ρ when the imperfect rankings are done according to the bivariate normal model. n

ρ

MP

NP

W⋆

J

(3, 6)

0.9 0.8 0.7 0.6 0.5

0.0835 0.1245 0.1717 0.2240 0.2803

0.0835 0.1245 0.1717 0.2240 0.2803

0.0826 0.1223 0.1681 0.2190 0.2738

0.0826 0.1223 0.1681 0.2190 0.2738

(1, 2, 6)

0.9 0.8 0.7 0.6 0.5

0.1062 0.1743 0.2490 0.3262 0.4032

0.1061 0.1743 0.2490 0.3262 0.4032

0.1047 0.1710 0.2439 0.3195 0.3951

0.0997 0.1602 0.2270 0.2970 0.3680

(1, 2, 3, 1)

0.9 0.8 0.7 0.6 0.5

0.1177 0.1953 0.2752 0.3535 0.4282

0.1177 0.1953 0.2752 0.3535 0.4282

0.1163 0.1921 0.2702 0.3470 0.4207

0.1140 0.1869 0.2619 0.3359 0.4071

(2, 0, 2, 2, 1)

0.9 0.8 0.7 0.6 0.5

0.1598 0.2887 0.4138 0.5260 0.6227

0.1594 0.2885 0.4138 0.5260 0.6227

0.1561 0.2813 0.4038 0.5145 0.6108

0.1435 0.2558 0.3692 0.4750 0.5699

m > 2 scenarios shown in Table 9. Even so, the test based on J is within ten percentage points of the maximum possible power for all of the scenarios considered. The results in Table 10, which are for the fraction-of-neighbors model, show a similar pattern, except that the NP test is consistently more powerful than the W ⋆ test. None of the three tests falls more than a few percentage points short of the maximum possible power in any of the scenarios considered. 6. Discussion By using a new algorithm for computing the probability that specified independent random variables have a particular ordering, we have obtained most powerful simple-versus-simple rank tests of the null hypothesis of perfect rankings against fully specified alternatives. We have also compared the power of these most powerful tests to that of existing rank tests in the literature, finding that the existing tests are surprisingly close to optimal across a wide range of alternatives to perfect rankings. Among the three existing tests that we studied in the power comparison, the test with the best overall performance was the NP test that rejects the null hypothesis when the null probability of the observed set of ranks {R[i]j } is small.

166

J. Frey, L. Wang / Computational Statistics and Data Analysis 60 (2013) 157–168 Table 8 Powers of level-0.05 tests for different sample size vectors n and fractions λ when the imperfect rankings are done according to the fraction-of-random-rankings model. n

λ

MP

NP

W⋆

J

(3, 6)

0.1 0.2 0.3 0.4 0.5

0.0840 0.1239 0.1691 0.2198 0.2754

0.0827 0.1224 0.1683 0.2196 0.2754

0.0802 0.1174 0.1610 0.2104 0.2648

0.0802 0.1174 0.1610 0.2104 0.2648

(1, 2, 6)

0.1 0.2 0.3 0.4 0.5

0.1108 0.1787 0.2509 0.3250 0.3990

0.1048 0.1687 0.2391 0.3134 0.3895

0.1016 0.1623 0.2301 0.3026 0.3777

0.0944 0.1487 0.2108 0.2787 0.3500

(1, 2, 3, 1)

0.1 0.2 0.3 0.4 0.5

0.1193 0.1895 0.2602 0.3311 0.4022

0.1139 0.1821 0.2530 0.3253 0.3979

0.1144 0.1825 0.2529 0.3245 0.3961

0.1080 0.1716 0.2389 0.3085 0.3789

(2, 0, 2, 2, 1)

0.1 0.2 0.3 0.4 0.5

0.1690 0.2827 0.3903 0.4916 0.5849

0.1647 0.2773 0.3856 0.4874 0.5815

0.1641 0.2760 0.3834 0.4843 0.5772

0.1447 0.2438 0.3434 0.4406 0.5331

Table 9 Powers of level-0.05 tests for different sample size vectors n and fractions λ when the imperfect rankings are done according to the fraction-of-backwards-rankings model. n

λ

MP

NP

W⋆

J

(3, 6)

0.1 0.2 0.3 0.4 0.5

0.1239 0.2198 0.3346 0.4586 0.5832

0.1224 0.2196 0.3346 0.4586 0.5832

0.1174 0.2104 0.3230 0.4468 0.5726

0.1174 0.2104 0.3230 0.4468 0.5726

(1, 2, 6)

0.1 0.2 0.3 0.4 0.5

0.1954 0.3553 0.5083 0.6417 0.7518

0.1708 0.3131 0.4609 0.6007 0.7230

0.1635 0.3012 0.4473 0.5883 0.7136

0.1450 0.2702 0.4119 0.5554 0.6881

(1, 2, 3, 1)

0.1 0.2 0.3 0.4 0.5

0.1981 0.3399 0.4730 0.5949 0.7036

0.1815 0.3135 0.4420 0.5636 0.6751

0.1873 0.3243 0.4565 0.5801 0.6917

0.1693 0.2976 0.4277 0.5535 0.6698

(2, 0, 2, 2, 1)

0.1 0.2 0.3 0.4 0.5

0.2931 0.4913 0.6497 0.7716 0.8618

0.2764 0.4673 0.6241 0.7488 0.8439

0.2806 0.4749 0.6337 0.7586 0.8525

0.2449 0.4258 0.5850 0.7180 0.8229

This NP test was proposed by Frey et al. (2007), who pointed out that it is the most powerful rank test of perfect rankings against perfectly random rankings. However, those authors recommended use of the W ⋆ test because of its greater ease of use. Given what we have found in our power study here, we recommend use of the NP test rather than the W ⋆ test, and to facilitate use of this test, we have created a more efficient R function for computing the test statistic. This function, which improves on the earlier function given in Frey (2007) by replacing ‘‘for’’ loops with vector-based calculations, is available in the Appendix. Limited tables of critical values for the NP test in the case of balanced RSS appeared in Frey et al. (2007). Since the approach that we have used here involves enumerating sequences, we have used only small total sample sizes, and we have not used set sizes that exceed five. This is a limitation. However, these small-sample-size, small-set-size cases are very important in practice. First, it is with small sample sizes that it is most difficult to distinguish perfect and imperfect rankings. When the total sample size is large, even suboptimal tests for perfect rankings are likely to have good power to detect significant departures from perfect rankings. Second, it is small set sizes in the range two to five that are most practical in applications of RSS. McIntyre (1952, 2005), for example, notes that because of logistical difficulties that increase

J. Frey, L. Wang / Computational Statistics and Data Analysis 60 (2013) 157–168

167

Table 10 Powers of level-0.05 tests for different sample size vectors n and fractions λ when the imperfect rankings are done according to the fraction-of-neighbors model. n

λ

MP

NP

W⋆

J

(3, 6)

0.2 0.4 0.6 0.8 1.0

0.1239 0.2198 0.3346 0.4586 0.5832

0.1224 0.2196 0.3346 0.4586 0.5832

0.1174 0.2104 0.3230 0.4468 0.5726

0.1174 0.2104 0.3230 0.4468 0.5726

(1, 2, 6)

0.2 0.4 0.6 0.8 1.0

0.1046 0.1712 0.2489 0.3343 0.4252

0.1010 0.1640 0.2359 0.3138 0.3945

0.0994 0.1610 0.2317 0.3085 0.3882

0.0975 0.1566 0.2239 0.2958 0.3691

(1, 2, 3, 1)

0.2 0.4 0.6 0.8 1.0

0.0832 0.1226 0.1682 0.2198 0.2771

0.0818 0.1181 0.1581 0.2016 0.2478

0.0800 0.1134 0.1497 0.1884 0.2291

0.0785 0.1101 0.1441 0.1798 0.2166

(2, 0, 2, 2, 1)

0.2 0.4 0.6 0.8 1.0

0.0875 0.1322 0.1899 0.2416 0.3043

0.0861 0.1284 0.1762 0.2288 0.2854

0.0839 0.1228 0.1664 0.2140 0.2653

0.0799 0.1145 0.1534 0.1961 0.2423

with the set size, ‘‘it is unlikely in practice that there will be any appreciable improvement in efficiency. . . through increasing the number of ranks beyond five’’. Thus, our findings are very relevant for researchers using RSS in practice. Acknowledgments The authors thank the two anonymous reviewers for helpful comments and suggestions. Appendix. A new R function

######################################################## #This function computes the null probability of seeing a # particular set of ranks. The vector k contains the # sequence of in-set ranks, and m is the set size. #The output lprob is the log of the null probability. ######################################################## np