Sequential Statistical Methods

Sequential statistical methods

Donald A. Berry
Department of Biostatistics
The University of Texas
M. D. Anderson Cancer Center
November 2000

Introduction

Statistics plays two fundamental roles in empirical research. One is in determining the data collection process: the experimental design. The other is in analyzing the data once it has been collected. For the purposes of this article I distinguish two types of experimental designs: sequential and nonsequential. In a sequential design the data that accrue in an experiment can affect the future course of the experiment. For example, an observation made on one experimental unit treated in a particular way may determine the treatment used for the next experimental unit. The term adaptive is commonly used as an alternative to sequential. In a nonsequential design the investigator can carry out the entire experiment without knowing any of the interim results.

The distinction between sequential and nonsequential is murky. An investigator’s ability to carry out an experiment exactly as planned is uncertain, as information that becomes available from within and outside the experiment may lead the investigator to amend the design. Also, a nonsequential experiment may give results that encourage the investigator to run a second experiment, one that might even be simply a continuation of the first. Considered separately, both experiments are nonsequential, but the larger experiment that consists of the two separate experiments is sequential.

In a typical type of nonsequential design, 20 patients who are suffering from depression are administered a drug and their improvements are assessed. An example sequential variation is the following. Patients’ improvements are recorded “in sequence” during the experiment. The experiment stops should it happen that at least 9 of the first 10 patients or no more than 1 of the first 10 patients improve. On the other hand, if between 2 and 8 of the first 10 patients improve then sampling continues to a second set of 10 patients, making the total sample size equal to 20 in that case. Another type of sequential variation is when the dose of the drug is increased for the second 10 patients should it happen that fewer than 4 of the first 10 improve.

Much more complicated sequential designs are possible. For example, the first patient may be assigned a dose in the middle of a range of possible doses. If the patient improves then the next patient is assigned the next lower dose and if the first patient does not improve then the next patient is assigned the next higher dose. This process continues, always dropping the dosage if the immediately preceding patient improved and increasing the dosage if the immediately preceding patient did not improve. This is called an “up-and-down” design.

Procedures in which batches of experimental units (such as groups of 10 patients each) are analyzed before proceeding to the next stage of the experiment are called group-sequential. Designs such as the up-and-down design, in which the course of the experiment can change after each experimental unit responds are called fully sequential. So, a fully sequential design is a group-sequential design in which the group size is 1. Designs in which the decision of when to stop the experiment depends on the accumulating results are called sequential stopping. Using rules to determine which treatments to assign to the next experimental unit or batch of units is called sequential allocation.

Designs of most scientific experiments are sequential, although perhaps not formally so. Investigators usually want to conserve time and resources. In particular, they do not want to continue an experiment if they have already learned what they set out to learn, and this is so whether their conclusion is positive or negative, or if finding a conclusive answer would be prohibitively expensive. (Where the investigator discovers that the standard deviation of the observations is much larger than originally thought is an example of an experimental design that would be prohibitively expensive to continue because the required sample size would be large.)

Sequential designs are difficult or impossible to use in some investigations. For example, results might take a long time to obtain and waiting for them would mean delaying other aspects of the experiment. Suppose one is interested in whether grade-schoolers diagnosed with attention deficit hyperactivity disorder (ADHD) should be prescribed Ritalin. The outcome of interest is whether children on Ritalin will be addicted to drugs as adults. Consider assigning groups of ten children to Ritalin and ten to placebo and waiting to observe their outcomes before deciding whether to assign an additional group of ten patients to each treatment. The delay in observation means that it would probably take hundreds of years to get an answer to the overall question. The long-term nature of the endpoint means that any reasonable experiment addressing this question would necessarily be nonsequential, with large numbers of children assigned to the two groups before any information at all would become available about the endpoint.

Analyzing data from sequential experiments—Frequentist case

Consider an experiment of a particular type, say one to assess ESP ability. A subject claiming to have ESP is asked to choose between two colors. The null hypothesis of no ability is that the subject is only guessing, in which case the correct color has a probability of 1/2. Suppose the subject gets 13 correct out of 17 tries. How should these results be analyzed and reported?

The answer depends on one’s statistical philosophy. Frequentists and Bayesians take different approaches. Frequentist analyses depend on whether the experiment’s design is sequential, and if it is sequential the conclusions will differ depending on the actual design used.

In the nonsequential case the subject is given exactly 17 tries. The frequentist P-value is the probability of results as extreme as or more extreme than those observed. The results are said to be statistically significant if the P-value is less than 5 percent. A convention is to include both 13 or more successes and 13 or more failures. (This “two-sided” case allows for the possibility that the subject has ESP but has inverted the “extra sensory” signals.) Assuming the null hypothesis and that the tries are independent, the probabilities of the numbers of successes is binomial. Binomial probabilities can be approximated using the normal distribution. The z-score for 13 out of 17 is about 2, and so the probability of 13 or more successes or 13 or more failures is about 0.05 (the exact binomial probability is 0.049) and so the results are statistically significant at the 5 percent level and the null hypothesis is rejected.

Now suppose the experiment is sequential. The frequentist significance level is now different, and it depends on the actual design used. Suppose the design is to sample until the subject gets at least 4 successes and at least 4 failures—same data, different design. Again, more extreme means more than 13 successes (and exactly 4 failures) or more than 13 failures (and exactly 4 successes). The total probability of these extreme values is 0.021—less than 0.049—and so the results are now more highly significant than if the experiment’s design had been nonsequential.

Consider another sequential design, one of a type of group-sequential designs commonly used in clinical trials. The experimental plan is to stop at 17 tries if 13 or more are successes or 13 or more are failures, and hence the experiment is stopped on target. But if after 17 tries the number of successes is between 5 and 12 then the experiment continues to a total of 44 tries. If at that time, 29 or more are successes or 29 or more are failures then the null hypothesis is rejected. To set the context, suppose the experiment is nonsequential, with sample size fixed at 44 and no possibility of stopping at 17; then the exact significance level is again 0.049. When using a sequential design, one must consider all possible ways of rejecting the null hypothesis in calculating a significance level. In the group-sequential design there are more ways to reject than in the nonsequential design with the sample size fixed at 17 (or fixed at 44). The overall probability of rejecting is greater than 0.049 but is somewhat less than 0.049 + 0.049 because some sample paths that reject the null hypothesis at sample size 17 also reject it at sample size 44. The total probability of rejecting the null hypothesis for this design is actually 0.080. Therefore, even though the results beyond the first 17 observations are never observed, the fact that they might have been observed makes 13 successes of 17 no longer statistically significant (since 0.08 is greater than 0.05).

To preserve a 0.05 significance level in group-sequential or fully sequential designs, investigators must adopt more stringent requirements for stopping and rejecting the null hypothesis. That is, they must include fewer observations in the region where the null hypothesis is rejected. For example, the investigator in the above study might drop 13 successes or failures in 17 tries and 29 successes or failures in 44 tries from the rejection region. The investigator would stop and claim significance only if there are at least 14 successes or at least 14 failures in the first 17 tries, and claim significance after 44 tries only if there are at least 30 successes or at least 30 failures. The nominal significance levels (those appropriate had the experiment been nonsequential) at n=17 and n=44 are 0.013 and 0.027, and the overall (or adjusted) significance level of rejecting the null hypothesis is 0.032. (No symmetric rejection regions containing more observations allow the significance level to be greater than this but still smaller than 0.05.) With this design, 13 successes of 17 is not statistically significant (as indicated above) because this data point is not in the rejection region.

The above discussion is in the context of significance testing. But the same issues apply in all types of frequentist inferences, including confidence intervals.

The implications of the need to modify rejection regions depending on the design of an experiment are profound. In view of the penalties that an investigator pays in significance level that are due to repeated analyses of accumulating data, investigators strive to minimize the number of such analyses. They shy away from using sequential designs and so may miss opportunities to stop or otherwise modify the experiment depending on accumulating results.

What happens if investigators fail to reveal that other analyses did occur or that the experiment might have continued had other results been observed? Any frequentist conclusion that fails to take the other analyses into account is meaningless. Strictly speaking, this is a breach of scientific ethics when carrying out frequentist analyses. But it is difficult to find fault with investigators who do not understand the subtleties of frequentist reasoning and who fail to make necessary adjustments to their inferences.

For more information about the frequentist approach to sequential experimentation, see Armitage (1975), Lan and DeMets (1989), O’Brien and Fleming (1979), and Whitehead (1992).

Analyzing data from sequential experiments—Bayesian case

When taking a Bayesian approach (or a likelihood approach), conclusions are based only on the observed experimental results and do not depend on the experiment’s design. So the murky distinction that exists between sequential and nonsequential designs is irrelevant in a Bayesian approach. In the example considered above, 13 successes of 17 tries will give rise to the same inference in each of the designs considered. Bayesian conclusions depend only on the data actually observed and not otherwise on the experimental design (Berger and Wolpert 1984, Berry 1985a, 1987, 1988).

The Bayesian paradigm is inherently sequential. Bayes’ theorem prescribes the way learning takes place under uncertainty. It specifies how an observation modifies one’s state of knowledge (Berry 1996). Moreover, each observation that is planned has a probability distribution. After 13 successes in 17 tries, the probability of success on the next try can be found. This requires a distribution, called a prior distribution, for the probability of success on the first of the 17 tries. Suppose the prior distribution is uniform from 0 to 1. (This is symmetric about the null hypothesis of 1/2, but it is unlikely to be anyone’s actual prior distribution in the case of ESP because it gives essentially all the probability to some ESP ability.). The predictive probability of a success on the 18th try is then (13+1)/(17+2)=0.737, called Laplace’s rule of succession (Berry 1996, p. 204). Whether to take this 18th observation can be evaluated by weighing the additional knowledge gained (having 14 successes out of 18, with probability 0.737, or 13 successes out of 18, with probability 0.263) with the costs associated with the observation.

Predictive probabilities are fundamental in a Bayesian approach to sequential experimentation. They indicate how likely it is that the various possibilities for future data will happen, given the data currently available. Suppose that after 13 successes of 17 tries one is entertaining taking an additional 27 observations. One may be interested in getting at least 30 successes out of the total of 44 observations—which means at least 17 of the additional 27 observations are successes. The predictive probability of this is about 50%. Or one may be interested in getting successes in at least 1/2 (22) of the 44 tries. The corresponding predictive probability is 99.5%.

The ability to use Bayes’ theorem for updating one’s state of knowledge and the use of predictive probabilities makes the Bayesian approach appealing to researchers in the sequential design of experiments. As a consequence, many researchers who prefer a frequentist perspective use the Bayesian approach in the context of sequential experimentation. If they are interested in finding the frequentist operating characteristics (such as significance level and power), these can be calculated by simulation.

The next section considers a special type of sequential experiment. The goals of the section are to describe some of the calculational issues that arise in solving sequential problems and to convey some of the interesting aspects of sequential problems. It takes a Bayesian perspective.

Sequential allocation of experiments: Bandit problems

In many types of experiments, including many clinical trials, experimental units are randomized in a balanced fashion to the candidate treatments. The advantage of a balanced design is that it gives maximal information about the differences between treatments. In some types of experiments, including some clinical trials, it may be important to obtain good results on the units that are part of the experiment. Treatments—or arms—are assigned based on accumulating results; that is, assignment is sequential. The goal is to maximize the overall effectiveness—of those units in the experiment but also, perhaps, including those units not actually in the experiment but whose treatment might benefit from information gained in the experiment.

Specifying a design is difficult. The first matter to be considered is the arm selected for the initial unit. Suppose that the first observation is X1. The second component of the design is the arm selected next, given X1 and also given the first arm selected. The third component depends on X1 and the second observation X2 and on the corresponding arms selected. And so on. A design is optimal if it maximizes the expected number of successes. An arm is optimal if it is the first selection of an optimal design.

Temporarily consider an experiment with n units and two available arms. Outcomes are dichotomous; arm 1 has success probability p1 and arm 2 has success probability p2. The goal is to maximize the expected number of successes among the n units. Arm 1 is standard and has known success proportion p1. Arm 2 has unknown efficacy. Uncertainty about p2 is given in terms of a prior probability distribution. To be specific, suppose that this is uniform on the interval from 0 to 1.

If n=1 then the design requires only an initial selection, arm 1 or arm 2. Choosing arm 1 has expected number of successes p1. Choosing arm 2 has conditional expected number of successes p2, and an unconditional expected number of successes, the prior probability of success, which is 1/2. Therefore arm 1 is optimal if p1 > 1/2 and arm 2 is optimal if p1 < 1/2. (Both arms—and any randomization between them—are optimal when p1=1/2.)

The problem is more complicated for n 2. Consider n=2. There are two initial choices and two choices depending on the result of the first observation. There are 8 possible designs. One can write a design as {a; aS, aF}, where a is the initial selection, aS is the next selection should the first observation be a success, and aF is the next selection should the first observation be a failure. To find the expected number of successes for a particular design one needs to know such quantities as the probability of a success on arm 2 after a success on arm 2 (which is 2/3) and the probability of a success on arm 2 after a failure on arm 2 (which is 1/3). The possible designs and their associated expected numbers of successes are given in the following table.

Design Expected # successes

{1; 1, 1} 2p1

{1; 1, 2} p1+p12+(1–p1)/2

{1; 2, 1} p1+p1/2+(1–p1)p1

{1; 2, 2} p1+1/2

{2; 1, 1} ½+p1

{2; 1, 2} ½+(1/2)p1+(1/2)(1/3)

{2; 2, 1} ½+(1/2)(2/3)+(1/2)p1

{2; 2, 2} 2(1/2)=1

It is easy to check that only three of these expected numbers of successes (shown in bold-face) are candidates for the maximum. If p1 > 5/9 then {1; 1, 1} is optimal; if 1/3 < p1 < 5/9 then {2; 2, 1} is optimal, and if p1 < 1/3 then {2; 2, 2} is optimal. For example, if p1=1/2 then it is optimal to use the unknown arm 2 initially. If the outcome is a success, then a decision is made to “stay with a winner” and use arm 2 again. If a failure occurs, then the decision is made to switch to the known arm 1.

Enumeration of designs is tedious for large n. Most designs can be dropped from consideration based on theoretical results (Ross 1983, Berry 1985, Berry and Fristedt 1985). For example, there is a break-even value of p1, say p1*, such that arm 1 is optimal for p1 > p1*. Also, one need consider only those designs that continue to use arm 1 once it has been selected. But many designs remain. Backward induction can be used to find an optimal design (Berry and Fristedt 1985). The following table gives the optimal expected proportion of successes for selected values of n and for fixed p1=1/2:

n 1 2 5 10 20 50 100 200 500 1000 10000

Prop. Succ. 0.500 0.542 0.570 0.582 0.596 0.607 0.613 0.617 0.621 0.622 0.624

Asymptotically, for large n, the maximal expected proportion of successes is 5/8, which is the expected value of the maximum of p1 and p2.

Both arms offer the chance of success on the current unit, but only Arm 2 gives information that can help in choosing between the arms for treating later units. The following table gives the break-even values p1* for selected values of n:

n 1 2 5 10 20 50 100 200 500 1000 10000

p1* 0.500 0.556 0.636 0.698 0.758 0.826 0.869 0.902 0.935 0.954 0.985

This table shows that information is more important for larger n. For example, if p1=0.75 then arm 1 would be optimal for n=10, but it would be advisable to test arm 2 when n=100; this is so even though arm 1 has probability of 0.75 of being better than arm 2.

When there are several arms with unknown characteristics, the problem is still more complicated. Optimal designs may well indicate selection of an arm that was used previously and set aside in favor of another arm because of inadequate performance. For methods and theory for solving such problems, see Berry (1972), and Berry and Fristedt (1985). The optimal designs are generally difficult to describe. Berry (1978) provides easy-to-use sequential designs that are not optimal but that perform reasonably well.

Suppose the n units in the experiment are a subset of the N units on which arms 1 and 2 can be applied. Berry and Eick (1995) consider the case of two arms with dichotomous response and show how to incorporate all N units into the design problem. They find the optimal Bayes design when p1 and p2 have independent uniform prior distributions. They compare this with various other sequential designs and with a particular nonsequential design: balanced randomization to arms 1 and 2. The Bayes design performs best on average, of course, but it is robust in the sense that it outperforms the other designs for essentially all pairs of p1 and p2.

Further reading

The pioneers in sequential statistical methods were Abraham Wald (Wald 1947, 1950) and George Barnard (Barnard 1944). They put forth the sequential probability ratio test (SPRT), which is of fundamental importance in sequential stopping problems. The study of the SPRT dominated the theory and methodology of sequential experimentation for decades.

For further reading about Bayesian versus frequentist issues in sequential design, see Anscombe (1963), Berger (1986), Berger and Berry (1988a, 1988b), Berger and Wolpert (1984), and Berry (1985, 1987, 1988, 1989, 1993). For further reading about the frequentist perspective, see Armitage (1975), Chernoff (1967), Chow, Robbins and Siegmund (1971), Ferguson (1967), Ghosh (1970), O’Brien and Fleming (1979), Sen (1981, 1985), Siegmund (1985), Wetherill (1975), Whitehead (1992), and Woodroofe (1982, 1983). For further reading about Bayesian design issues, see Anscombe (1963), Berry (1991a, 1991b, 1993, 1995), Berry and Ho (1988), Berry and Stangl (1996), Berry, Wolff and Sack (1994), Chernoff (1967, 1968), Chernoff and Ray (1965), Cornfield (1966), DeGroot (1970), Lewis and Berry (1994, 2000) and Lindley and Barnett (1965). For further reading about bandit problems, see Bellman (1956), Berry (1972, 1978, 1985), Berry and Eick (1995), Berry and Fristedt (1985), Bradt, Johnson and Karlin (1956), Eick (1987, 1988), Friedman, Padilla and Gelfand (1964), Gittins (1979), Horowitz (1978), Murphy (1965), Murray (1971), Rapoport (1966, 1967, 1970), Robbins (1952), Rothschild (1974), Viscusi (1979), Weitzman (1979), Whittle (1982, 1983) and Woods (1959). There is a journal called Sequential Analysis that is dedicated to the subject of this article.

References:

Anscombe FJ (1963). Sequential Medical Trials. J Am Stat Assoc 58:365-383.
Armitage P (1975). Sequential Medical Trials, John Wiley and Sons, New York.
Barnard GA (1944). Statis Meth & Qual Cont Rep No QC / R/ 7, British Ministry of Supply, London, England.
Bellman R (1956). A problem in the sequential design of experiments. Sankhya A 16:221-229.
Berger JO (1986). Statistical Decision Theory and Bayesian Analysis, 2nd ed, Springer-Verlag, New York.
Berger JO, Berry DA (1988a). Statistical analysis and the illusion of objectivity. Am Sci 76:159-165.
Berger JO, Berry DA (1988b). The Relevance of Stopping Rules in Statistical Inference. In Statistical Decision Theory and Related Topics IV, 1, 29-72, Springer-Verlag, New York.
Berger JO, Wolpert RL (1984). The Likelihood Principle, Institute of Mathematical Statistics, Hayward, CA.
Berry DA (1972). A Bernoulli two-armed bandit. Ann Math Stat 43:871-897.
Berry DA (1978). Modified two-armed bandit strategies for certain clinical trials. J Am Stat Assoc 73:339-345.
Berry DA (1985a). Interim analysis in clinical trials: Classical vs. Bayesian approaches. Stat Med 4:521-526.
Berry DA (1985b). One- and Two-armed Bandit Problems. Encyclopedia of Statistical Sciences, Vol. VI, 418-422, eds. Kotz S and Johnson NL, John Wiley and Sons, New York.
Berry DA (1987). Interim analysis in clinical trials: The role of the likelihood principle. Am Stat 41:117-122.
Berry DA (1988). Interim analysis in clinical research. Cancer Invest 5:469-477.
Berry DA (1989). Monitoring accumulating data in a clinical trial. Biometrics 45:1197-1211.
Berry DA (1991a). Experimental design for drug development: A Bayesian approach. J Biopharm Stat 1:81-101.
Berry DA (1991b). Bayesian methodology in phase III trials. Drug Inf J 25:345-368.
Berry DA (1993). A case for Bayesianism in clinical trials (with discussion). Stat Med 12:1377-1404.
Berry DA (1995). Decision Analysis and Bayesian Methods in Clinical Trials. In Recent Advances in Clinical Trial Design and Analysis, 125-154, ed. Thall P, Kluwer Press, New York.
Berry DA (1996). Statistics: A Bayesian Perspective, Duxbury Press, Belmont, CA.
Berry DA, Eick SG (1995). Adaptive assignment versus balanced randomization in clinical trials: A decision analysis. Stat Med 14:231-246.
Berry DA, Fristedt B (1985). Bandit Problems: Sequential Allocation of Experiments, Chapman-Hall, London.
Berry DA, Ho C-H (1988). One-sided sequential stopping boundaries for clinical trials: A decision-theoretic approach. Biometrics 44:219-227.
Berry DA, Stangl DK (1996). Bayesian Methods in Health-Related Research. In Bayesian Biostatistics, 1-66, eds. Berry DA and Stangl DK, Marcel Dekker, New York.
Berry DA, Wolff MC, Sack D (1994). Decision making during a phase III randomized controlled trial. Control Clin Trials 15:360-379.
Bradt RN, Johnson SM, Karlin S (1956). On sequential designs for maximizing the sum of n observations. Ann Math Stat 27:1060-1070.
Chernoff H (1967). Sequential models for clinical trials. In 5th Berkeley Symposium on Mathematical Statistics and Probability, Volume IV, 805-812, California Press, Berkeley, CA.
Chernoff H (1968). Optimal stochastic control. Sankhya A 30:221-252.
Chernoff H, Ray SN (1965). A Bayes sequential sampling inspection plan. Ann Math Stat 36:1387-1407.
Chow YS, Robbins H, Siegmund D (1971). Great Expectations, Houghton Mifflin, Boston, MA.
Cornfield J (1966). Sequential trials, sequential analysis and the likelihood principle. Am Stat 20(2):18-23.
DeGroot MH (1970). Optimal Statistical Decisions, McGraw-Hill, New York.
Eick SG (1987). The two-armed bandit with delayed responses. Ann Stat 16:254-264.
Eick SG (1988). Gittins procedures for bandits with delayed responses. J R Statist Soc B 50:125-132.
Ferguson TS (1967). Mathematical Statistics, Academic Press, New York.
Friedman MP, Padilla G, Gelfand H (1964). The learning of choices between bets. J
Math Psychol 1:375-385.
Gittins J C (1979). Bandit processes and dynamic allocation indices. J R Statist Soc B 41:148-177.
Gosh BK (1970). Sequential Tests of Statistical Hypotheses, Addison-Wesley, Reading, MA.
Horowitz AD (1978). Experimental study of the two-armed bandit problem. Ph.D. Thesis, University of North Carolina at Chapel Hill.
Lan KK, DeMets DL (1989). Changing frequency of interim analysis in sequential monitoring. Biometrics 45(3):1017-1020.
Lewis RJ, Berry DA (1994). Group sequential clinical trials: A classical evaluation of Bayesian decision-theoretic designs. J Am Stat Assoc 89:1528-1534.
Lewis RJ, Berry DA (2000). Decision Theory. In Encyclopedia of Biostatistics, eds. Armitage P and Colton T, John Wiley and Sons, New York. (To appear)
Lindley DV, Barnett BN (1965). Sequential sampling: two decision problems with linear losses for binomial and normal random variables. Biometrika 52:507-532.
O’Brien PC, Fleming TR (1979). A multiple testing procedure for clinical trials. Biometrics 35:549-556.
Murphy RE Jr (1965). Adaptive Processes in Economic Systems, Academic Press, New York and London.
Murray FS (1971). Multiple probable situation: A study of five one-armed bandit problems. Psychon Sci 22:247-249.
Rapoport A (1966). A study of multistage decision task with an unknown duration. Human Factors 8:54-61.
Rapoport A (1967). Dynamic programming models for multistage decision making tasks. J Math Psychol 4:48-71.
Rapoport A (1970). Minimization of risk and maximization of expected utility in multistage betting games. Acta Psychologica 34:375-386.
Robbins H (1952). Some aspects of the sequential design of experiments. B Am Math Soc 58:527-536.
Ross SM (1983). Introduction to Stochastic Dynamic Programming. Academic Press, New York.
Rothschild M (1974). A two-armed bandit theory of market pricing. J Econ Theory 9:185-202.
Sen PK (1981). Sequential Nonparametrics, John Wiley and Sons, New York.
Sen PK (1985). Theory and Applications of Sequential Nonparametrics, SIAM, Philadelphia, PA.
Siegmund D (1985). Sequential Analysis, Springer-Verlag, New York.
Viscusi WK (1979). Employment Hazards: An Investigation of Market Performance, Harvard University Press, Cambridge, MA.
Wald A (1947). Sequential Analysis, John Wiley and Sons, New York.
Wald A (1950). Statistical Decision Functions, John Wiley and Sons, New York.
Weitzman ML (1979). Optimal search for the best alternative. Econometrics 47:641-654.
Wetherill GB (1975). Sequential Methods in Statistics, John Wiley and Sons, New York.
Whitehead J (1992). The Design and Analysis of Sequential Clinical Trials, Horwood, Chichester, UK.
Whittle P (1982, 1983). Optimization Over Time, Vols I and II, John Wiley and Sons, New York.
Woodroofe M (1982). Nonlinear Renewal Theory in Sequential Analysis, SIAM, Philadelphia, PA.
Woods PJ (1959). The effects of motivation and probability of reward on two-choice learning. J Exp Psych 57:380-385.

Design	Expected # successes
{1; 1, 1}	2p1
{1; 1, 2}	p1+p12+(1–p1)/2
{1; 2, 1}	p1+p1/2+(1–p1)p1
{1; 2, 2}	p1+1/2
{2; 1, 1}	½+p1
{2; 1, 2}	½+(1/2)p1+(1/2)(1/3)
{2; 2, 1}	½+(1/2)(2/3)+(1/2)p1
{2; 2, 2}	2(1/2)=1

n	1	2	5	10	20	50	100	200	500	1000	10000
Prop. Succ.	0.500	0.542	0.570	0.582	0.596	0.607	0.613	0.617	0.621	0.622	0.624