Sample Size for Multiple Regression: Obtaining Regression

Coefficients That Are Accurate, Not Simply Significant

Ken Kelley and Scott E. Maxwell

University of Notre Dame

An approach to sample size planning for multiple regression is presented that

emphasizes accuracy in parameter estimation (AIPE). The AIPE approach yields

precise estimates of population parameters by providing necessary sample sizes in

order for the likely widths of confidence intervals to be sufficiently narrow. One

AIPE method yields a sample size such that the expected width of the confidence

interval around the standardized population regression coefficient is equal to the

width specified. An enhanced formulation ensures, with some stipulated probabil-

ity, that the width of the confidence interval will be no larger than the width

specified. Issues involving standardized regression coefficients and random pre-

dictors are discussed, as are the philosophical differences between AIPE and the

power analytic approaches to sample size planning.

Sample size estimation from a power analytic per-

spective is often performed by mindful researchers in

order to have a reasonable probability of obtaining

parameter estimates that are statistically significant.

In general, the social sciences have slowly become

more aware of the problems associated with under-

powered studies and their corresponding Type II er-

rors, which can yield misleading results in a given

domain of research (Cohen, 1994; Muller & Benig-

nus, 1992; Rossi, 1990; Sedlmeier & Gigerenzer,

1989). The awareness of underpowered studies in the

literature has led vigilant researchers attempting to

curtail this problem in their investigations to perform

a power analysis (PA) prior to data collection. Re-

searchers who have used various power analytic pro-

cedures have undoubtedly strengthened their own re-

search findings and added meaningful results to their

respective research areas. However, even with PA be-

coming more common, it is known that null hypoth-

eses of point estimates are rarely exactly true in

nature (Cohen, 1994). Therefore, performing sample

size planning solely for the purpose of obtaining sta-

tistically significant parameter estimates may often be

improved by planning sample sizes that lead to accu-

rate parameter estimates, not merely statistically sig-

nificant ones.

The zeitgeist of null hypothesis significance testing

seems to be losing ground in the behavioral sciences

as the generally more informative confidence interval

begins to gain widespread usage. Instead of simply

testing whether a given parameter estimate is some

exact and specified value, typically zero, forming a

100(1 − ␣) percent confidence interval around the

parameter of interest frequently provides more mean-

ingful information. Although null hypothesis signifi-

cance tests and confidence intervals can be thought of

as complementary techniques, confidence intervals

can provide researchers with a high degree of assur-

ance that the true parameter value is within some

confidence limits. Understanding the likely range of

the parameter value typically provides researchers

with a better understanding of the phenomenon in

question than does simply inferring that the parameter

is or is not statistically significant. With regard to

accuracy in parameter estimation (AIPE), all other

things being equal, the narrower the confidence inter-

val, the more certain one can be that the observed

parameter estimate closely approximates the corre-

sponding population parameter. Accuracy in this

Editor’s Note. Samuel B. Green served as action editor

for this article.—SGW

Correspondence concerning this article should be ad-

dressed to Ken Kelley or Scott E. Maxwell, Department of

Psychology, University of Notre Dame, 118 Haggar Hall,

Notre Dame, Indiana 46556. E-mail: [email protected] or

[email protected]

2003, Vol. 8, No. 3, 305–321 1082-989X/03/$12.00 DOI: 10.1037/1082-989X.8.3.305

305

sense is a measure of the discrepancy between an

estimated value and the parameter it represents.

One position that can be taken is that AIPE leads to

a better understanding of the effect in question and is

more important for a productive science than a di-

chotomous decision from a null hypothesis signifi-

cance test. Many times obtaining a statistically sig-

nificant parameter estimate provides a research

community with little new knowledge of the behavior

of a given system. However, obtaining confidence

intervals that are sufficiently narrow can help lead to

a knowledge base that is more valuable than a collec-

tion of null hypotheses that have been rejected or that

failed to reach significance, given that the desire is to

understand a particular phenomenon, process, or sys-

tem.

If we assume that the correct model is fit, observa-

tions are randomly sampled, and the appropriate as-

sumptions are met, (1 − ␣) is the probability that any

given confidence interval from a collection of confi-

dence intervals calculated under the same circum-

stances will contain the population parameter of in-

terest. However, it is not true that a specific

confidence interval is correct with (1 − ␣) probability,

as a computed confidence interval either does or does

not contain the parameter value. The meaning of a

100(1 − ␣) percent confidence interval for some un-

known parameter was summarized by Hahn and

Meeker (1991) as follows: “If one repeatedly calcu-

lates such [confidence] intervals from many [techni-

cally an infinite number of] independent random

samples, 100(1 − ␣)% of the intervals would, in the

long run, correctly bracket the true value of [the pa-

rameter of interest]” (p. 31). It is important to realize

that the probability level refers to the procedures for

constructing a confidence interval, not to a specific

confidence interval (Hahn & Meeker, 1991).

Many of the arguments in the present article re-

garding the use and utility of confidence intervals

echo a similar sentiment that has been long recom-

mended, as well as the more recent discussions in

Wilkinson and the American Psychological Associa-

tion Task Force on Statistical Inference (1999), essen-

tially an entire issue of Educational and Psychologi-

cal Measurement (Thompson, 2001) devoted to

confidence intervals and measures of effect size, Al-

gina and Olejnik (2000), and Steiger and Fouladi

(1997), as well as the still salient views offered by

Cohen (1990, 1994). In fact, Cohen (1994) argued

that the reason confidence intervals have previously

seldom been reported in behavioral research is be-

cause the widths of the intervals are often “embar-

rassingly large” (p. 1002). The AIPE approach pre-

sented here attempts to curtail the problem of

embarrassingly large confidence intervals and pro-

vides sample size estimates that lead to confidence

intervals that are sufficiently precise and thereby pro-

duce results that are presumably more meaningful

than simply being statistically significant.

In the context of multiple regression, sample size

can be approached from at least four different per-

spectives: (a) power for the overall fit of the model,

(b) power for a specific predictor, (c) precision of the

estimate for the overall fit of the model, and (d) pre-

cision of the estimate for a specific predictor. The

goal of the first perspective is to estimate the neces-

sary sample size such that the null hypothesis of the

population multiple correlation coefficient equaling

zero can be correctly rejected with some specified

probability (e.g., Cohen, 1988, chapter 13; Gatsonis &

Sampson, 1989; S. B. Green, 1991; Mendoza &

The formal definition of accuracy is given by the square

root of the mean square error and can be expressed by the

following formulation:

RMSE ⳱ √E[

␪

−

␪

)

] ⳱ √E[(

␪

− E[

␪

])

] + (E[

␪

− ␪])

where E is the expectation operator and

␪

is an estimate of

␪, the value of the parameter of interest (Hellmann &

Fowler, 1999; Rozeboom, 1966, p. 500). The first compo-

nent under the second radical sign represents precision,

whereas the second component represents bias. Thus, when

the expected value of a parameter is equal to the parameter

value it represents (i.e., when it is unbiased), accuracy and

precision are equivalent concepts and the terms can be used

interchangeably.

It should be noted that the interpretation of confidence

intervals given in the present article follows a frequentist

interpretation. The Bayesian interpretation of a confidence

interval was well summarized by Carlin and Louis (1996),

who stated that “the probability that [the parameter of in-

terest] lies in [the computed interval] given the observed

data y is at least (1 − ␣)” (p. 42). Thus, the Bayesian frame-

work allows for a probabilistic statement to be made about

a specific interval. However, when a Bayesian confidence

interval is computed with a noninformative prior distribu-

tion (which uses only information obtained from the ob-

served data), the computed confidence interval will exactly

match that of a frequentist confidence interval; the interpre-

tation is what differs. Regardless of whether one approaches

confidence intervals from a frequentist or a Bayesian per-

spective, the suggestions provided in this article are equally

informative and useful.

KELLEY AND MAXWELL

306

Stafford, 2001). With the second perspective, sample

size is computed on the basis of the desired power for

the test of a specific predictor rather than the desired

power for the test of the overall fit of the model (Co-

hen, 1988, chapter 13; Maxwell, 2000).

The precision of the overall fit of the model leads to

another reason for planning sample size. One alterna-

tive within this perspective provides the necessary

sample size such that the width of the one-sided

(lower bound) confidence interval of the population

multiple correlation coefficient is sufficiently precise

(Darlington, 1990, section 15.3.4). Another alterna-

tive within this perspective provides the sample size

such that the total width of the confidence interval

around the population multiple correlation squared is

specified by the researcher (Algina & Olejnik, 2000).

The final perspective for sample size estimation

within the multiple regression framework provides the

main purpose of the present article. Necessary sample

size from this perspective is obtained such that the

confidence interval around a regression coefficient is

sufficiently narrow. Oftentimes confidence intervals

are computed at the conclusion of a study, and only

then is it realized the sample size used was not large

enough to yield precise estimates. The AIPE approach

to sample size planning allows researchers to plan

necessary sample size, a priori, such that the com-

puted confidence interval is likely to be as narrow as

specified.

Figure 1 illustrates the relation between confidence

intervals and null hypothesis significance testing as

they relate to the issue of sample size for AIPE and

PA. Specifically, the figure shows the limits of a con-

fidence interval for a standardized regression coeffi-

cient in each of four hypothetical studies with a dif-

ferent predictor variable in each instance. In all four

studies the null hypothesis that the regression coeffi-

cient equals zero is false.

From a purely power analytic perspective, Study 1

is considered a “success.” The confidence interval in

this study shows that the parameter is not likely to be

zero and is thus judged to be statistically significant.

However, the confidence interval is wide, and thus the

parameter is not accurately estimated. In this study

little information about the population parameter is

learned other than it is likely to be some positive

value, a “failure” according to the goals of AIPE. This

study had an adequate sample size from the perspec-

tive of power, but a larger sample is needed in order

to obtain a more precise estimate.

Study 2, on the other hand, not only indicates that

the null hypothesis should be rejected but also pro-

vides precise information about the size of the popu-

lation parameter. Here the confidence interval is nar-

row, and thus the population parameter is precisely

estimated. Study 2 is a success according to both the

PA and AIPE frameworks.

Study 3 shows a nonsignificant effect that is ac-

companied by a wide confidence interval, illustrating

a failure by both methods. Had a larger sample size

Figure 1. Illustration of possible scenarios in which planned sample size was considered a

“success” or “failure” according to the accuracy in parameter estimation and the power

analysis frameworks. Parentheses are used to indicate the width of the confidence interval.

SAMPLE SIZE AND ACCURACY IN PARAMETER ESTIMATION

307

been used and had the effect been of approximately

the same magnitude, the width of the confidence in-

terval would have likely been smaller, leading to a

potential rejection of the null hypothesis. Thus, the

sample size of Study 3 was inadequate from both

perspectives.

Study 4 illustrates a case in which the confidence

interval contains zero, yet the parameter is estimated

precisely. Study 4 exemplifies a failed PA but a suc-

cessful application of AIPE, as the population param-

eter is bounded by a narrow confidence interval. Of

course, one could argue that this study is not literally

a failure from a PA perspective, because as a condi-

tional probability, power depends on the population

effect size. In this study the population effect size may

be smaller than the minimal effect size of theoretical

or practical importance.

The goals for PA and AIPE are fundamentally dif-

ferent. The goal of PA is to obtain a confidence in-

terval that correctly excludes the null value, thus mak-

ing the direction of the effect unambiguous. The

necessary sample size from this perspective clearly

depends on the value of the effect itself. On the other

hand, the goal of AIPE is to obtain an accurate esti-

mate of the parameter, regardless of whether the in-

terval happens to contain the null value. Thus, sample

size from the AIPE perspective does not depend on

the value of the effect itself. However, these two

methods of sample size planning are not rivals; rather

they can be viewed as complementary. In general, the

most desirable study design is one in which there is

enough power to detect some minimally important

effect while also being able to accurately estimate the

size of the effect. In this sense, designing a study can

entail selecting a sample size based on whichever per-

spective implies the need for the largest sample size

for the desired power and precision. We revisit this

possibility in the Power Analysis Versus Accuracy in

Parameter Estimation section, in which AIPE and PA

are formally compared in a multiple regression frame-

work.

For the moment let us suppose that a researcher has

decided to adopt the AIPE perspective. Provided the

input population parameters are correct, the tech-

niques that are presented in this article allow research-

ers to plan sample size in a multiple regression frame-

work such that the confidence interval around the

regression coefficient of interest is sufficiently nar-

row.

One approach provides the necessary sample

size such that the expected width of the confidence

interval will be the value specified. However, achiev-

ing an interval no larger than the specified width will

be realized only (approximately) 50% of the time. A

reformulation provides the necessary sample size such

that there is a specified degree of assurance that the

computed confidence interval will be no larger than

the specified width. The precision of the confidence

interval and the degree of assurance of this precision

depend on the goals of the researcher. Not surpris-

ingly, all other things being equal, greater precision

and greater assurance of the precision necessitate a

larger sample size. It is believed that if AIPE were

widely applied, it would facilitate the accumulation of

a more meaningful knowledge base than does a col-

lection of studies reporting only parameters that are

statistically significant but which do not precisely

bound the value of the parameter of interest.

Sample Size Estimation for

Regression Coefficients

In order to develop a general set of procedures for

determining the sample size needed to obtain a de-

sired degree of precision for confidence intervals in

multiple regression analysis, we use standardized re-

gression coefficients.

Standardized regression coef-

ficients are used for two reasons in developing pro-

cedures for determining sample size using an AIPE

approach. First, due to the arbitrary nature of the

many measurement scales used in the behavioral sci-

ences, standardized coefficients are more directly in-

terpretable. Second, standardized coefficients provide

a more general framework in that variances and co-

variances need not be estimated when planning an

appropriate sample size.

Although the present article illustrates AIPE in a mul-

tiple regression framework, the extension to other applica-

tions of the general linear model is not difficult, many of

which can be thought of as special cases of multiple regres-

sion.

The use of standardized regression coefficients may

give rise to technical issues that are addressed in a later

section of this article. Standardizing regression coefficients

in the presence of random predictors has many appealing

characteristics with regard to interpretability, but under cer-

tain circumstances problems can develop when using this

popular technique.

If the desire is to form confidence intervals around un-

standardized regression coefficients, the techniques pre-

sented here are equally useful. The desired width of the

computed confidence interval is measured in terms of the

KELLEY AND MAXWELL

308

The formula for a 100(1 − ␣) percent symmetric

confidence interval for a single population standard-

ized regression coefficient, ␤

, can be written as fol-

lows:

␤

Ⳳ t

共

1−␣

Ⲑ

2;N−p−1

兲

冑

1 − R

共

1 − R

兲

共

N − p − 1

兲

(1)

where ␤

is the observed standardized regression co-

efficient, j represents a specific predictor ( j ⳱ 1,...,

p), p is the number of predictors (independent or con-

comitant variables, covariates, or regressors), R

the observed multiple correlation coefficient of

the model, R

represents the observed multiple cor-

relation coefficient predicting the jth predictor (X

)

from the remaining p − 1 predictors, and N is the

sample size (Cohen & Cohen, 1983; Harris, 1985).

The value that is added to and subtracted from ␤

define the upper and lower bounds of a symmetric

confidence interval is defined as w, which is the half-

width of the entire confidence interval. Thus, the total

width of a confidence interval is 2w. The value of w

is of great importance for accuracy in estimation, be-

cause the width of the interval determines the preci-

sion of the estimated parameter.

In the procedure for planning sample size, the criti-

cal value for t

(1−␣/2;N−p−1)

is replaced by the critical

(1−␣/2)

value. Justification for this can be made be-

cause precise estimates generally require a relatively

large sample size, and replacing the critical t

(1−␣/

2;N−p−1)

value with the critical z

(1−␣/2)

value has vir-

tually no impact on the outcome for the sample size in

most cases.

The formula used to determine the

planned sample size, such that confidence intervals

around a particular population regression coefficient,

␤

, will have an expected value of the width specified,

is obtained by solving for N in Equation 1 and by

making use of the presumed knowledge of the popu-

lation multiple correlation coefficients:

N =

冉

共

1−␣

Ⲑ

兲

冊

冉

1 − R

冊

+ p + 1, (2)

where R

represents the population multiple correla-

tion coefficient predicting the criterion (dependent)

variable Y from the p predictor variables and R

represents the population multiple correlation coeffi-

cient predicting the jth predictor from the remaining p

− 1 predictors. The calculated N should be rounded to

the next larger integer for sample size. The w in the

above equation is the desired half-width of the confi-

dence interval. It should be kept in mind that this

procedure yields a planned sample size that leads to a

confidence interval width for a specific predictor. In

practice, both R

and R

must be estimated prior to

data collection, a complication we address momen-

tarily. Although not frequently acknowledged in the

behavioral literature on regression analysis, Equation

1 is derived assuming predictors are fixed and un-

standardized. Equation 2 is a reformulation of Equa-

tion 1 and thus is based on the same assumptions.

Results from a Monte Carlo study are provided later

in the article indicating that sample size estimates

based on Equation 2 are reasonably accurate when

predictors are random and have been standardized.

Equation 2 is intended to determine N such that the

expected half-width of an interval is under the re-

searcher’s control. However, there is approximately

only a 50% chance that the interval will be no larger

than specified. The reason for this can be seen from

Equation 1. Notice that the width of an interval will

depend in part on R

and R

, both of which will vary

from sample to sample. Thus, for a fixed sample size,

the interval width will also vary over replications.

However, it is possible to modify Equation 2 in order

to increase the likelihood that the obtained interval

will be no wider than desired.

ratio of the standard deviation of Y to the standard deviation

of X

. Thus, following the methods presented for standard-

ized regression coefficients, application to unstandardized

coefficients is straightforward.

We introduce the notational system used throughout the

article. A boldface italicized R denotes the population mul-

tiple correlation coefficient, while a standard-print italicized

R is used for its corresponding sample value. A population

correlation matrix is denoted by a nonitalicized, boldface,

nonserif-font R. A population zero-order correlation coef-

ficient is denoted as a lowercase rho (␳), whereas a vector of

population zero-order correlation coefficients is denoted as

a boldface lowercase rho (␳).

The z approximation is poor if the correlations between

the predictors and the criterion are large and the correlations

among the predictors are small. In this case, the standard

error of ␤

is small, producing a relatively small estimated

sample size. Under these conditions, the degrees of freedom

of the critical t value are small, and thus the critical t value

will not closely match the critical z value. We do not believe

that this occurs frequently in behavioral research. The al-

SAMPLE SIZE AND ACCURACY IN PARAMETER ESTIMATION

309

If ␥ is the desired degree of uncertainty of the com-

puted confidence interval being the specified width,

Equation 2 can be modified with a multiplicative fac-

tor that will provide a modified N such that a re-

searcher can have approximately 100(1 − ␥) percent

assurance that a computed confidence interval will be

of the specified width or less. For example, if there

were a desire to be 80% confident that the obtained w

would be no larger than the desired half-width, ␥

would be defined as 0.20 and there would be only a

20% chance that the half-width of the confidence in-

terval around ␤

would be larger than the specified w.

Hahn and Meeker (1991, section 8.3) showed how

to plan sample size for confidence intervals when a

specified width around the mean of a normal distri-

bution is desired, as well as modifying that formula to

obtain 100(1 − ␥) percent confidence that the interval

will be of the desired width or less. Taking similar

logic and applying it to multiple regression leads to

the creation of a formula for a modified N, N

. This

modified formulation provides the necessary sample

size in order for researchers to be 100(1 − ␥) percent

confident that the ␤

of interest will have a corre-

sponding confidence interval width that is no larger

than specified. The formula for N

is given as fol-

lows:

冉

共

1−␣

Ⲑ

兲

冊

冉

1 − R

冊冉

␹

共

1−␥;N−1

兲

N − p −1

冊

+ p + 1,

(3)

where N is the value obtained in Equation 2 and

␹

(1−␥;N−1)

is the critical value from a chi-square dis-

tribution at the 1 − ␥ quantile having N − 1 degrees of

freedom. Like N, N

should also be rounded to the

next larger integer.

Rather than using the parameter value of the vari-

ance for ␤

as was done in the calculation of N,to

compute N

, Equation 3 uses the upper bound of the

100(1 − ␥) percent confidence interval for the vari-

ance of ␤

. Recall that in any given sample the ob-

tained variance of ␤

will be either larger or smaller

than the parameter value specified in Equation 2.

Equation 3 uses the maximum value expected for the

variance of ␤

at the 100(1 − ␥) percent confidence

level. This value is substituted into Equation 2 for the

variance of ␤

and thus leads to Equation 3. Because

the only random variable in Equation 2 is the variance

of ␤

, use of Equation 3 provides probabilistic assur-

ance that the obtained confidence interval of interest

around ␤

will have a half-width no larger than the

specified w with 100(1 − ␥) percent confidence.

With regard to choosing a 100(1 − ␥) percent con-

fidence interval for estimation, when compared with a

100(1 − ␣) percent confidence interval for hypothesis

testing, important distinctions arise. The most obvious

difference in the present context is that ␥ represents

the probability of obtaining a confidence interval with

an observed w that is larger than the specified w,

whereas alpha is the probability of rejecting a null

hypothesis that is true. When making use of Equation

3, a researcher is expected to obtain a w that is larger

than the value specified only 100␥ percent of the time,

regardless of whether or not the null hypothesis is

true. Whereas alpha is typically thought of as one of

two essentially constant values, .05 or .01, ␥ is chosen

by the researcher in order to achieve some desired

degree of assurance that the precision of the estimated

parameter will be realized. Thus, confidence intervals

formed in the realm of hypothesis testing represent an

attempt to accomplish a different goal than those

formed when a researcher’s interest is in obtaining a

precise estimate of the parameter of interest.

Specifying Population Parameters as

Input Values

As illustrated in the last section, determining

sample size through an AIPE approach requires one to

know, or anticipate, R

and R

. This is by no means

an easy task, but with some careful planning and

sound theoretical judgment, it is possible to develop

appropriate estimates of the two parameters. In the

remainder of this section we suggest different meth-

ods for anticipating the values of R

and R

, such

that sample size planning can be accomplished.

Given that estimates are available for the p(p + 1)/2

zero-order population correlation coefficients, the

squared multiple correlation coefficient predicting Y

from the p predictors can be calculated using the fol-

lowing equation:

⳱ ␳ⴕ

−1

␳

, (4)

where ␳

is the population p × 1 column vector of

correlations of each X

regressor with Y (and ␳ⴕ

, its

transpose), and R

is the p × p population intercor-

ternative method is to solve for the appropriate sample size

iteratively, which generally adds unnecessary complica-

tions.

KELLEY AND MAXWELL

310

relation matrix of all of the predictor variables with

one another.

Finding the squared multiple correlation coefficient

of variable j from the other p − 1 predictors can be

readily computed from R

in two steps. The first

step is to calculate r

, which for the jth predictor

variable is defined as the jth principal diagonal ele-

ment of R

−1

(Harris, 1985). In the second step, R

for the jth predictor variable is found from the fol-

lowing expression:

= 1 −

(5)

The inverse of r

is known as the tolerance of variable

j with the other p − 1 predictors. The tolerance (1 −

) is the proportion of variance of a predictor that

cannot be explained by the remaining p − 1 predictor

variables included in the model. As the tolerance of X

approaches zero, X

becomes highly correlated with

the remaining predictor variables and R

becomes

larger, which means there is more predictability, or

collinearity, of predictor X

from the other p − 1 pre-

dictors (Darlington, 1990, p. 128).

The second method of finding R

is a variation of

the first method and depends on the notion of ex-

changeability. An exchangeable structure (Maxwell,

2000) is one in which the intercorrelations of the pre-

dictors are all the same and the correlations of the

predictors with the criterion variable are all the same

(but ␳

and ␳

need not be equal to one another,

where ␳ represents a population zero-order correlation

coefficient). Thus, instead of estimating the p(p + 1)/2

zero-order correlations, it is necessary to estimate

only two correlations, one for the correlation of each

of the predictors with one another and another corre-

lation for each of the predictors with the criterion

variable. The two zero-order correlations used in ex-

changeable structures should be of the general mag-

nitude as the set of correlations they represent. Since

B. F. Green (1977) showed that “many linear com-

posites [that is, predicted scores] are barely different

from using equal weights” (p. 274), the exchangeable

structure offers a potentially useful tool when plan-

ning necessary sample size (see Maxwell, 2000, for a

thorough treatment and rationale of the exchangeable

structure, as well as a similar correlational structure

that is somewhat relaxed). Many times an exchange-

able structure may be a sensible place to start when

planning sample size for a multiple regression analy-

sis, unless there are obvious theoretical reasons not to

do so (B. F. Green, 1977; Raju, Bilgic, Edwards, &

Fleer, 1999; Wainer, 1976).

If a researcher does not have a good idea of the

relationship of the zero-order correlations, conven-

tions such as Cohen’s (1988, section 3.2) small (␳ ⳱

.10), medium (␳ ⳱ .30), and large (␳ ⳱ .50) effect

sizes for correlations can be used. These correlations

can be used directly in Equation 4 or used in an ex-

changeable structure. For example, if exchangeability

seems reasonable and the predictor variables are mod-

erately or highly correlated with one another, a re-

searcher could fill the off-diagonal elements of the

intercorrelation matrix with values of .30, .40, or

.50. Further, suppose that it is reasonable to expect

that the correlations of the predictors with the crite-

rion are, in general, small or medium. In this case the

vector ␳

can be filled with correlations of .10, .20,

or .30. Once acceptable estimates for the two types of

correlations have been determined, the multiple cor-

relations can be obtained from Equations 4 and 5.

The third way to determine values for R

and R

is to consult previous literature in order to determine

likely values for these two parameters or for likely

values of the zero-order correlation coefficients

(whether the data follow an exchangeable structure or

not). Meta-analytic studies may be of help when es-

timating the required population parameters; how-

ever, in many domains of research, meta-analytic

studies have not yet been conducted or the construct

of interest may differ from those previously examined.

The final method is presented here more as a warn-

ing than a recommendation. This method is based on

the commonly recommended approach of sample size

planning based on parameter estimates obtained from

pilot studies. Pilot studies are sometimes undertaken

when literature reviews provide little or no informa-

tion about the population parameter(s) necessary for

sample size planning. However, a potential problem

with pilot studies is that these small-scale investiga-

tions may yield parameter estimates that do not

closely correspond with the parameter values they

represent. Thus, basing Equations 2 and 3 on param-

A caution is warranted when estimating the p(p + 1)/2

zero-order correlation coefficients, as it is feasible to esti-

mate an impossible set of correlations. If an impossible set

is estimated, the multiple correlation coefficient can be

greater than one. If this were to occur, adjustments to R

and/or ␳

must be made, such that a realistic set of pa-

rameter values can be used for estimating N and N

SAMPLE SIZE AND ACCURACY IN PARAMETER ESTIMATION

311

eter estimates obtained from pilot studies may yield

inappropriate estimates of the required sample size if

the obtained estimates do not closely approximate

their corresponding parameter values.

When planning an appropriate sample size, regard-

less of whether it is for an application of PA or AIPE,

it is typically unrealistic to proceed as if the values of

the necessary population parameters are known ex-

actly. Given that, a researcher who uses methods of

sample size planning should conduct a sensitivity

analysis. A sensitivity analysis involves calculating

appropriate sample sizes using a range of realistic

values of the necessary population parameters. In the

context of the present article, a researcher would

specify likely values of R

and R

in order to de-

termine their effects on N and N

. For the values of N

and N

computed with the various parameter values

in the sensitivity analysis, the most appropriate esti-

mate of sample size is chosen given what is deemed to

be the most appropriate input parameter values. It is

also advantageous to triangulate planned sample sizes

from multiple methods, rather than focusing only on a

single technique. The suggestion of a sensitivity

analysis and multiple methods of obtaining estimates

of sample size are provided in order for the researcher

to have a firm grasp on the nonlinear relationship

between the required sample size and the unknown

parameter values.

Although the particular value of w is arbitrary and

depends only on the desired width for the confidence

interval, researchers should keep in mind the likely

range of ␤

when choosing w, even though the value

of ␤

itself need not be known. Although there have

been conventions established regarding the magnitude

of particular effect sizes (e.g., Cohen’s, 1988, conven-

tions for the standardized mean difference and the

zero-order correlation coefficient), no such conven-

tions have been established for standardized regres-

sion coefficients. For example, a medium standard-

ized regression coefficient might be viewed as

resulting from medium zero-order correlations. In re-

ality, however, the population ␤

will depend greatly

on the number of predictors, even when all zero-order

correlations are medium. In such multiparameter situ-

ations, it becomes very difficult to develop a mean-

ingful scale for small, medium, and large effect sizes.

Even though effect size conventions do not exist for

the relative size of the standardized regression coef-

ficient, the likely value of ␤

is in the interval [−1, 1].

In the special case in which there is only one predic-

tor, ␤

is literally the population correlation coeffi-

cient between the predictor and the criterion variable.

However, if there is more than one predictor variable,

the ␤

s are not confined to the interval [−1, 1], as they

do not represent correlations. Thus, the choice of w is

not necessarily obvious, in large part because of the

interpretation of the standardized regression coeffi-

cient and its interrelatedness with the other predictors

in the model. Not surprisingly, all other things being

equal, the smaller the specified w, the larger the re-

quired sample size.

Example and Application of the Procedures

Suppose that a researcher is interested in perform-

ing an analysis using multiple regression. Further sup-

pose that the researcher is interested in obtaining a

precise estimate of a particular population standard-

ized regression coefficient. In particular, rather than

having an embarrassingly large confidence interval

around the estimated ␤

of interest, the researcher de-

cides that a confidence interval with an expected

width of 0.20 will provide a sufficiently precise esti-

mate of ␤

; thus, w is defined as 0.10. The researcher

is also interested in calculating N

, such that there

will be an 80% chance that the ␤

of interest will have

a corresponding confidence interval that has a half-

width no larger than the specified w of 0.10.

Suppose that after consulting past research and in

line with theory, the researcher determines that an

exchangeable correlational structure seems reason-

able, and the five predictor variables that are to be

used in the analysis are hypothesized to correlate with

one another at .40. Further, suppose there is reason to

believe that there is likely to be a medium effect, a

correlation of .30, between each of the predictor vari-

ables and the criterion.

Following Equation 4, the R

can be shown to equal

.17, and from Equation 5, the R

predicting the jth

regressor from the remaining p − 1 predictors equals

.29. The researcher then solves for the estimated N by

use of Equation 2, which yields a value of 453.98.

When rounded to the next largest integer, the esti-

mated N from Equation 2 provides the researcher with

an estimated sample size of 454. Accordingly, if the

Cohen (1988) even acknowledged the difficulties and

inconsistencies in conventions for effect size measures in

the context of multiple regression. These inconsistencies are

due to the interrelatedness of p, the multiple correlation

coefficients, and the zero-order correlation coefficients (Co-

hen, 1988, p. 413; see also Maxwell, 2000, p. 438).

KELLEY AND MAXWELL

312

input parameter values were correct, using a sample

size of 454 will yield a confidence interval around ␤

that has an expected half-width of 0.10.

To compute N

, such that there is an 80% chance

of obtaining a confidence interval for ␤

with a half-

width no larger than 0.10, the researcher uses Equa-

tion 3. Implicit in Equation 3 for this example is the

fact that the sample variance of ␤

is expected to be

less than the parameter value 80% of the time. Be-

cause the obtained w will be less than the w specified

if the variance of ␤

is smaller in the sample than the

parameter value used to estimate sample size, the ob-

tained w will be no greater than the specified w with

a probability of .80.

The .80 quantile of the chi-square distribution with

N − 1 degrees of freedom is 478.12. This critical

chi-square value is then divided by N − p − 1, yielding

a variance correction factor of 1.07. Following Equa-

tion 3, N

is estimated at 484.10 and after being

rounded up to the next largest integer yields a value of

485. If the parameter values estimated by the re-

searcher were correct, using an N

of 485 will pro-

vide the researcher with approximately an 80%

chance of obtaining a w of 0.10 or less for the confi-

dence interval around the beta weight of interest. No-

tice that sample size increases by only 31 (or 6.83%)

when specifying 80% confidence that the obtained w

would be less than the specified width. Typically N

is not considerably greater than N and should be con-

sidered for the added assurance it provides for a pre-

cise estimate with what generally amounts to a rela-

tively small cost.

When the assumption of exchangeability does not

hold, generally a different sample size will be esti-

mated for each of the p predictors. In the following

example, suppose a researcher hypothesizes the fol-

lowing population parameters for the R

intercorre-

lation matrix and the ␳

vector, respectively:

冋

.40 1

.60 .05 1

册

␳

冋

.50

.30

.10

册

Further suppose the desired half-width and alpha were

set to 0.15 and .05, respectively. In this scenario, the

planned sample sizes would be estimated as 237, 154,

and 201 for Predictors 1, 2, and 3, respectively. Fur-

thermore, if the researcher wanted to have 90% con-

fidence that the obtained w would be less than or

equal to 0.15, N

would be 268, 180, and 229 for

Predictors 1, 2, and 3, respectively. Thus, when ex-

changeability does not hold, planning sample size for

a specific predictor may provide expected ws nar-

rower or wider than the specified value for the re-

maining p − 1 predictors, depending on the tolerance

of the predictor for which sample size was calculated.

When interest lies in the w for a specific predictor,

no problems arise regardless of whether the correla-

tional structure is or is not exchangeable. Sample size

is calculated for the specific predictor regardless of

whether the tolerance for the predictor of interest is

smaller or larger than any of the remaining p − 1

predictors. Under this strategy, researchers are con-

cerned foremost with the width of the confidence in-

terval for the beta of interest and less so for the re-

maining p − 1 predictors. For example, in the scenario

in the previous paragraph, a researcher whose ques-

tion pertains specifically to estimating the relationship

between X

and Y controlling for X

and X

should

choose an N of 201 or an N

of 229.

Another strategy in situations in which exchange-

ability does not hold leads to the expected value of all

of the confidence intervals being as narrow as or nar-

rower than the specified w.Inthisapproachthe

sample size used for the study is the largest of the p

different sample sizes. Thus, the expected half-width

for the predictor with the lowest tolerance is w,

whereas the expected half-widths for the remaining p

− 1 confidence intervals will be less than w; to what

degree depends on the tolerance of the other predic-

tors. For example, given N

values of 268, 180, and

229 for the three predictors, respectively, a researcher

interested in a narrow confidence interval for each and

every predictor should choose an N

of 268.

Power Analysis Versus Accuracy in

Parameter Estimation

Estimating sample size from a PA perspective is

conceptually different than estimating sample size to

achieve AIPE. This conceptual difference can poten-

tially translate into very different practical implica-

tions. This section considers the relative sample sizes

required by the two approaches. Maxwell (2000)

showed that sample size could be estimated for a

given predictor to obtain a specified power using the

following formula:

N =

冉

␭

␤

冊冉

1 − R

冊

+ p − 1, (6)

where ␭ is a noncentrality parameter from an F dis-

tribution with 1 numerator and N−p− 1 denominator

degrees of freedom. The ␭ value in Equation 6 is a

SAMPLE SIZE AND ACCURACY IN PARAMETER ESTIMATION 313

tabled critical value that determines the power of a

given statistical test for a predictor of interest. The

required value of ␭ for a specified degree of power

can be obtained from Cohen’s (1988, pp. 448–455)

tables or from the appropriate noncentral F distribu-

tion.

The relative sample size required for AIPE versus

PA can be compared by the following two multipli-

cative ratios found in Equations 2 and 6, respectively:

冉

共

1−␣

Ⲑ

兲

冊

versus

冉

␭

␤

冊

Unless p is very large, the ratio of required sample

size for AIPE compared with PA is approximately

(1−␣/2)

␤

)

/(␭w

) to 1. Note that the population stan-

dardized regression coefficient is the only one of the

four values beyond the researcher’s control. Whereas

␣, ␭, and w are chosen to coincide with the goals of

the research project, the PA approach requires that the

parameter value or the minimally important value of

the standardized regression coefficient be specified.

Note that a value for the standardized regression co-

efficient is not necessary when planning sample size

for precision. For this reason, planning sample size

from the AIPE perspective is actually easier than ap-

proaching sample size planning from the PA perspec-

tive.

Unless p is very large, sample size for PA is ap-

proximately

N = M

冉

1 − R

冊

, (7)

where M

⳱ ␭/␤

, which is the multiplier used for

the PA approach. Similarly, sample size for AIPE is

approximately

N = M

AIPE

冉

1 − R

冊

, (8)

where M

AIPE

⳱ (z

(1−␣/2)

/w)

, which is the multiplier

used in the AIPE approach. Figure 2 depicts the re-

Figure 2. Relationship of the relative planned sample size for the accuracy in parameter

estimation (AIPE) and the power analytic (PA) approaches to sample size planning as a

function of the population beta weight (approximate sample size in the special case when R

⳱ R

KELLEY AND MAXWELL

314

lationship of the multipliers for PA and AIPE for

population betas for various values of power and pre-

cision (␣ ⳱ .05). As Equations 7 and 8 show, multi-

plying the corresponding value on the ordinate for

either power or precision in Figure 2 by the ratio

(1 − R

)/(1 − R

) yields an approximate sample

size. More generally, the relative elevation of a curve

or line represents the relative sample size required to

achieve a desired level of power or precision.

Several practical implications emerge from Figure

2. First, as the curves and lines show, as the popula-

tion ␤

becomes larger, sample size for power can be

much smaller than it is for precision. Conversely,

when the ␤

is small, sample size for power can be

much larger than is required for precision. For ex-

ample, when the ␤

equals 0.30, the sample size re-

quired to obtain a confidence interval with an ex-

pected half-width of 0.10 is just over 4 times as large

as the sample size needed to obtain a power of .80.

However, when ␤

is 0.08, the sample size needed for

a power of .80 is more than 3 times larger than that

needed to obtain a confidence interval with an ex-

pected half-width of 0.10. Note that these relation-

ships hold true regardless of the values of R

and

, as both of these values play the same role in

Equations 7 and 8. Second, for constant values of R

and R

, sample size for precision is independent of

the value of ␤

, whereas smaller samples can provide

adequate power for larger values of ␤

. Third, implicit

in Equation 8 and as depicted in Figure 2, halving the

width of a confidence interval for ␤

requires approxi-

mately a fourfold increase in sample size. Fourth, in

the special case in which R

is equal to R

—that is,

(1 − R

)/(1 − R

) ⳱ 1.00—the values on the ordi-

nate based on the curve for power and the line for

precision are approximately the required sample sizes.

Thus, it is clear that the two methods are different

from the outset and can yield very different estimates

of sample size in the same study. Each is designed to

answer a different question, and as can be seen, they

do just that. The two approaches differ on a philo-

sophical level, one designed to achieve a narrow in-

terval and one designed to obtain an interval that does

not contain the specified null value. The point is that

depending on what the researcher’s question is and

the desired outcome, a different approach to sample

size estimation will be needed. Neither approach is

necessarily “right” or “wrong” for a given problem;

these approaches are merely different in the questions

that they attempt to answer. It is recommended that

the two approaches be used in conjunction with one

another in order to achieve reasonable statistical

power while obtaining confidence intervals that are

sufficiently narrow.

Random Versus Fixed Predictors and the Issue

of Standardization

In the present article it was assumed that the pre-

dictor variables were random and that all variables

were standardized. The reason that standardized val-

ues were discussed exclusively is because correlations

tend to be easier to hypothesize and work with than

variances and covariances, which would be necessary

to carry out AIPE in the unstandardized case. Another

reason why standardized regression coefficients are

beneficial is because of the arbitrariness of most

scales of measurement used in the behavioral sci-

ences. Furthermore, a widely used convention of the

magnitude of effect is available for correlations in

psychology (Cohen, 1988, section 3.2). It should be

clear, however, that if the hypothesized values are

correct when finding N and N

for standardized val-

ues, they will provide the same relative degree of

precision around the unstandardized regression coef-

ficients. The relative degree of precision regarding w

is scaled in terms of the ratio of the standard deviation

of the criterion to the standard deviation of the jth

predictor (s

With regard to random and fixed predictor values in

the unstandardized case, Sampson (1974) showed that

regardless of the predictors being fixed or random,

“we obtain the same estimates for the regression co-

efficients and the variance of the error” (p. 684 from

Theorem 1). There is, however, a difference between

the two cases. Note that if R

⳱ 0, then the distribu-

tion of R

is identical in both cases and follows a

central F distribution. However, the distribution of R

is different for the two cases when R

⫽ 0 (Stuart,

Ord, & Arnold, 1999, section 28.29). In fact, the dis-

tribution of R

is a noncentral F distribution in the

case of fixed predictors, whereas it is not in the case

of random predictors (Rencher, 2000, pp. 240–241).

Accordingly, the distribution of the test statistic under

the null hypothesis is the same for the fixed as well as

the random X case, but the power functions for the test

statistic are different for the two cases (Rencher,

2000, chapter 10). Gatsonis and Sampson (1989)

showed that Cohen’s (1988) power tables for deter-

mining sample size are approximations, because Co-

hen treated random predictors as though they were

fixed. However, Gatsonis and Sampson concluded

that “Cohen’s approximation works quite well in

SAMPLE SIZE AND ACCURACY IN PARAMETER ESTIMATION 315

many situations” (p. 519). Thus, practically speaking,

random versus fixed X values have little effect on

applied research because the consequences, in most

cases, are trivial. The issue of standardization, how-

ever, is quite different, especially when standardiza-

tion is performed on random predictor variables.

Even though multiple regression using standardized

random predictors is common practice in behavioral

research, as well as in many other fields, there are

nuances associated with this strategy that are not

widely known and are potentially problematic. As

previously stated, the formula (see Equation 1) for the

standard error of a regression coefficient that is ran-

dom and standardized is approximate. The formula, as

given explicitly in sources such as Cohen and Cohen

(1983) and Harris (1985) and implicitly in many oth-

ers, treats the standard deviation of each predictor as

a constant value. This is obviously not the case when

the predictors are random, as the standard deviation of

the predictor is itself a random variable. This is con-

trasted with the situation in which the values of the

predictor variables are preset in advance and thus the

standard deviation of those predictors would not vary

across replications of the study.

In order to transform an unstandardized regression

coefficient to a standardized regression coefficient,

one can multiply the raw score regression coefficient

by s

, so as to remove the (generally arbitrary)

scaling of Y and X

. Likewise, this same procedure is

commonly done in order to obtain the standard error

of the standardized regression coefficient.

However,

“standard errors of standardized parameters, in gen-

eral, are not a simple rescaling of the standard errors

of the original parameter estimates” (Jamshidian &

Bentler, 2000, p. 74). The problem with scaling the

standard error of a standardized regression coefficient

in the random predictor case can be seen by a well-

known property of variances. If C is a constant and V

is a random variable, Var(CV) ⳱ C

Var(V), where

Var(⭈) represents the variance of the quantity in pa-

rentheses. However, if C

is itself a random variable,

then Var(C

V) ⫽ C

Var(V). Common formulas for the

standard error of standardized regression coefficients

(e.g., Equation 1) assume that the standard deviation

of the predictor is fixed. In the case of random pre-

dictor variables, such an assumption implies that

Var(C

V) ⳱ C

Var(V). Because this assumption is

false, the variability of X

is not taken into consider-

ation when calculating the standard error of standard-

ized regression coefficients from the random X

case, which generally leads to incorrect standard errors.

In structural equation modeling (SEM), which can

be viewed as a generalization of multiple regression,

several authors have illustrated the potential problems

of analyzing a correlation matrix as if it were a co-

variance matrix (e.g., Babakus, Ferguson, & Jo¨reskog,

1987; Browne, 1982; Cudeck, 1989; Jamshidian &

Bentler, 2000). Steiger (2001) concluded that SEM

parameter estimates based on a correlation matrix

(analogous to standardized coefficients in multiple re-

gression) may be correct, whereas their standard er-

rors are incorrect (see also Lawley & Maxwell, 1971,

chapter 7, for technical details). MacCallum and Aus-

tin (2000) stated that when a correlation matrix is

analyzed as if it were a covariance matrix in SEM, “in

all cases, standard errors of parameter estimates as

well as confidence intervals and test statistics for pa-

rameter estimates will be incorrect,” and they further

emphasized that the “correct standard errors will gen-

erally be smaller than the incorrect values which re-

sults in narrower confidence intervals and larger test

statistics” (p. 217). For the reasons outlined in this

section regarding the approximate nature of Equation

1, a simulation study was conducted to verify the

integrity of the procedures suggested throughout the

article.

Results of Monte Carlo Simulations

If Equation 1 was exact, the assumptions were met,

and the multiple correlation coefficients were cor-

rectly specified, the sample size estimation proce-

dures presented here yield correct estimates of re-

quired sample size. However, whenever the values of

the predictors are random and standardized, rather

than being fixed, Equation 1 is an approximation. In

applications of multiple regression to observational

studies in the behavioral sciences, predictors are typi-

cally random, not fixed. Further, standardization often

occurs in the behavioral sciences because of the in-

The reason that multiplying the standard error of the

unstandardized regression coefficient by s

removes

the scaling of the jth predictor can be seen by the formula

for the standard error of the unstandardized regression co-

efficient: (s

) √(1 − R

)/[(1 − R

)(N − p − 1)]. Multi-

plying this formula by s

removes the scaling of Y and X

from the standard error and is commonly, yet inappropri-

ately, assumed to be the correct standard error for the jth

standardized regression coefficient when the predictor is

random.

KELLEY AND MAXWELL

316

terpretational problems associated with arbitrary

scales of measurement. Under these circumstances, it

was unclear whether basing planned sample size on

Equation 2 would produce an interval with the desired

width. In addition to ensuring that Equation 2 consis-

tently yields accurate estimates of sample size, a

Monte Carlo study was necessary because Equation 3

implicitly assumes Equation 2 is correct.

One scenario studied in the Monte Carlo simulation

was the aforementioned exchangeable structure with

five predictors and where ␳

⳱ .40 and ␳

⳱ .30.

The simulation revealed that Equations 2 and 3 pro-

duced very accurate results in this situation. Recall

that when w is specified as 0.10 for this scenario,

Equation 2 dictates a necessary sample size of 454.

The mean w for the five betas, each based on 10,000

replications, using a sample size of 454, was 0.101,

with a standard deviation of 0.003; the median w was

also 0.101. Recall that having an 80% chance of ob-

taining a w no larger than the specified value of 0.10

requires a necessary sample size of 485 based on

Equation 3. The mean and the median confidence in-

terval half-width using a sample size of 485 was

0.098, with a standard deviation of 0.003. Most im-

portant, 81.64% of the obtained ws were no larger

than the specified value of 0.10. Further, the 80th

percentile for the empirical distribution of the ob-

tained ws was 0.10. In summarizing the results for this

scenario, the suggested procedures yielded an original

sample size such that the mean of the ws was 0.101

and a modified sample size that led to just over 80%

of the confidence intervals being no larger than speci-

fied.

This example was selected because we thought it

was reasonably typical of a behavioral research sce-

nario. However, this single scenario cannot address

the extent to which the approximation is accurate for

other situations. To investigate the general accuracy

of the procedures, we undertook a large Monte Carlo

simulation study to address the appropriateness of

Equation 2. In the simulation study 166 different con-

ditions were examined. In the different conditions a

variety of correlational structures were used. The ws

were specified to be 0.025, 0.05, 0.10, 0.15, 0.20,

0.15, and 0.35, using ps of 2, 5, and 10. Presumably

the simulations encompass the likely ranges of w and

p that is commonly of interest to behavioral research-

ers, combined with a variety of correlation structures

to show generality. Each condition in the simulation

study was based on 10,000 replications. The results

showed that the suggested procedures generally per-

formed very well. Because of the large number of

conditions that were studied, the tabled results could

not be presented; however, detailed descriptions of

the results follow.

The mean, median, and standard deviation of the

percentage of error were determined for each of the

166 conditions that were examined. The percentage of

error was determined by subtracting the specified w

from the mean of the obtained ws, dividing this dif-

ference by the specified w, and then multiplying by

100. For example, if the mean of the obtained ws was

0.204 when the specified w was 0.20, the percentage

of error would be computed as follows: 100(0.204 −

0.20)/0.20 ⳱ 2.00. Thus, in this condition the mean of

the obtained ws was 2.00% larger than the specified w.

In the simulation conditions in which p was 2, all

combinations of small, medium, and large correla-

tions among the predictors as well as the criterion (27

total) were completely crossed with ws of 0.05, 0.10,

and 0.20. Thus, a total of 81 different conditions were

examined for p ⳱ 2. The mean and median of the

percentage of error were 0.33 and 0.17, respectively,

with a standard deviation of 0.34. The minimum per-

centage of error was 0.01 for a case in which w was

0.05, and the maximum percentage of error was 1.85

for a case in which w was 0.20. Thus, in the worst

case out of the 81 different conditions for p ⳱ 2, the

mean of the obtained w was less than 0.01 units larger

than expected.

In the case in which p was 5, the results are re-

ported separately for two different types of correla-

tional structures. In the first type of correlational

structure, 25 different exchangeable structures were

examined. In any single one of the 25 combinations,

all predictors correlated equally among themselves

and each correlated equally with the criterion vari-

able. Correlations among predictors consisted of ␳

values of .10, .20, .30, .40, and .50. Correlations of the

predictors with the criterion consisted of ␳

values

of .10, .20, .30, .40, and .50. Thus, ␳

and ␳

each

varied from small to large by .10 and yielded a 5 × 5

factorial design.

Two combinations of correlations are excluded

The complete set of simulation results is available in

tabular format from Ken Kelley or Scott E. Maxwell. The

code, which was written in R/S-PLUS, is also available on

request. Note that the anonymous reviewers were provided

with the simulation results as part of their assessment of our

procedures.

SAMPLE SIZE AND ACCURACY IN PARAMETER ESTIMATION

317

from the following descriptive statistics because their

multiple correlations between the predictors and cri-

terion are greater than .80 and not representative of

most psychological research.

The mean and median

percentage of error for the remaining 23 ws were 1.87

and 1.03, respectively, with a standard deviation of

2.22. The minimum percentage of error was 0.22, and

the maximum was 10.00. This worst case occurred

when the correlations among the predictors were .10

and the correlations between the predictors and crite-

rion were .40. This correlational structure is unlikely

in most behavioral research because R ⳱ .76. How-

ever, even this condition had a mean w that was only

0.01 units larger than expected.

The other simulations that were conducted for p ⳱

5 were based on two published correlational struc-

tures. The first was a subset of a correlation matrix

obtained from the developmental literature (Smari,

Petursdottir, & Porsteinsdottir, 2001), and the other

was obtained from an example given in an SEM text

(Table 7.1 in Loehlin, 1998). The mean and median of

the absolute percentage of error for the 30 conditions

(15 from each example) were 0.55 and 0.23, respec-

tively, with a standard deviation of 0.76. The mini-

mum of the absolute percentage of error was 0.01 in

a condition in which w was 0.025, and the maximum

was 2.75 in a condition in which w was 0.35. Thus,

the worst condition in this situation produced a mean

w of 0.36 when the specified w was 0.35.

For p ⳱ 10, the correlation matrix used was a

subset of one obtained from the clinical–counseling

literature that had previously been cited in an SEM

text (Worland, Weeks, Janes, & Strock, 1984, as cited

in Kline, 1998, p. 254). The mean and median of the

percentage of error for the 30 conditions that were

examined were 0.18 and 0.09, respectively, with a

standard deviation of 0.19. The smallest absolute per-

centage of error was less than 0.01 for a case in which

w was 0.05, and the largest percentage of error was

0.67 for a condition in which w was 0.20. Thus, the

condition with the largest discrepancy had a percent-

age of error less than 1%.

Recall the cited SEM literature in which it has been

shown that the standard errors of parameter estimates

are generally inflated when a correlation matrix is

treated as a covariance matrix. Because ordinary least

squares (OLS) multiple regression is a special case of

SEM, it follows that the standard errors of OLS mul-

tiple regression are often inflated when predictor vari-

ables are random and standardized. In 130 of the 166

conditions investigated (78.31%), the confidence in-

terval coverage was greater than 95% (the nominal

alpha was set to .05). The mean and median percent-

age of coverage were 95.53 and 95.24, respectively,

with a standard deviation of 0.78. Whereas the small-

est percentage of coverage was 94.34, the largest per-

centage of coverage was 97.89. Thus, the results of

the simulations have shown empirically the approxi-

mate nature of Equation 1 and the fact that OLS mul-

tiple regression tends to have inflated standard errors

when predictor variables are random and have been

standardized.

The fact that Equation 1 is approximate and gen-

erally provides confidence intervals wider than nec-

essary raises some questions regarding its use as well

as the use of Equations 2 and 3 in the context of

sample size planning for precise estimates of stan-

dardized regression coefficients. For example, in the

case in which the largest confidence discrepancy oc-

curred, 97.89% of the computed confidence intervals

bracketed the population parameter. Applying Equa-

tion 1 to this condition (w ⳱ 0.10, ␳

⳱ .50, ␳

⳱ .10, ␳

⳱ .10, p ⳱ 2), we found that the popu-

lation correlations would suggest that the standard

error was 0.051. A simulation based on 1,000,000

replications showed that, consistent with the SEM lit-

erature, the standard deviation of the regression coef-

The two excluded cases consisted of unlikely scenarios

for much behavioral research. The first excluded scenario

consisted of correlations among the predictors of .10 and

correlations between the predictor and the criterion of .50.

Such a combination of correlations leads to an R of .95 and

where the requirement of a positive definite correlation ma-

trix is nearly violated. In this case the mean w was 0.151

when it was specified to be 0.10. Poor performance of the

technique in this particular scenario is not surprising, given

that many statistical procedures fail when parameters ap-

proach their theoretical bounds. The second excluded case is

similar to the first and consisted of correlations among the

predictors of .20 and correlations between the predictor and

the criterion of .50. This combination of correlations leads

to an R of .83. In this second excluded scenario, the mean

w was 0.112 when it was specified to be 0.10.

Many behavioral scientists would see no problem with

an empirical alpha smaller than the nominal alpha level and

thus with being more conservative. However, a toxicologist

or bioscientist working with chemical agents or medicine

would likely argue that a Type II error may be more costly

than a Type I error, as concluding that there is “no effect”

of a noxious substance could be a harmful mistake. Further,

power and precision will be sacrificed if the actual Type I

error rate is smaller than the nominal alpha level.

KELLEY AND MAXWELL

318

ficients was 0.044, a value smaller than implied by

Equation 1. This result suggests that the sample size

calculated from Equation 2, which assumes the stan-

dard error from Equation 1 is correct, is approximate

and in this particular case somewhat negatively bi-

ased. Unfortunately, no exact formula for the standard

error is known to exist when predictors are random

and standardized. Thus, given the current state of

knowledge, researchers need to continue to use Equa-

tion 1 for forming confidence intervals around regres-

sion coefficients for predictors that are random and

standardized. Equations 2 and 3 can then be used in

the research design phase in order to determine ap-

proximate sample sizes for precise estimates of the

regression coefficients of interest.

Limitations of the Procedure

Although the distribution of R

is asymptotically

normal throughout most of its domain (Stuart et al.,

1999, section 28.33), this is not the case as R

ap-

proaches its limits. When R

begins to approach zero,

the distribution of the observed R

values becomes

positively skewed because of the lower bound at zero.

The converse is true as R

begins to approach one, and

thus the distribution of the observed R

values will be

negatively skewed.

The fact that the distribution of R

becomes nega-

tively or positively skewed affects sample size esti-

mation in two ways. Recall from Equations 2 and 3

that there are two multiple correlations in the equa-

tions for determining sample size, the model R

in the

numerator and R

in the denominator. As R

ap-

proaches zero in the population, the estimated sample

size for a planned study based on Equation 2 or Equa-

tion 3 will, with everything else held constant, tend to

be larger than necessary. One way to understand why

overestimation occurs is to inspect Equation 1. On the

basis of this equation, a confidence interval becomes

narrower as 1 − R

becomes smaller. As R

ap-

proaches zero and thus the distribution of R

becomes

more positively skewed, the mean R

tends to be

greater than R

, implying that the mean 1 − R

tends

to be less than 1 − R

. Accordingly, the observed

confidence intervals will tend to be narrower than

expected based on the value of R

. The estimated

sample size from Equation 2 or Equation 3 is a func-

tion of R

; thus, confidence intervals based on sample

size estimates from these equations will tend to be

narrower than specified when the model R

ap-

proaches zero. In other words, for a desired degree of

precision, sample size estimates become inflated as R

approaches zero. The opposite pattern of results oc-

curs when R

begins to approach one. In this case the

proportion of variance unaccounted for is, on average,

larger in the sample than is implied by R

. Conse-

quently, the use of Equation 2 or Equation 3 will tend

to underestimate sample size.

The same phenomenon happens in the denominator

with R

as it does in the numerator with R

; the only

difference is that the relationship is the exact opposite.

Because R

is in the denominator of Equation 2, the

sample size is over- or underestimated in a reverse

fashion as was illustrated for R

For simplicity, the discussion has been limited to

regression models that include only main effects and

no interaction or other higher order (polynomial)

terms, as there are certain nuances associated with

multiplicative terms that have been scaled in multiple

regression models (see chapter 3 of Aiken & West,

1991, for details regarding multiplicative effects in

multiple regression). Furthermore, the procedures

given here assume that all predictors are included in

the regression model and that no selection of predic-

tors occurs (as would be the case in, e.g., a stepwise

regression analysis).

Discussion

Approaching sample size estimation from a per-

spective of AIPE rather than one exclusively empha-

sizing power is beneficial for a productive science.

Although planning sample size through PA studies is

important and undeniably improves research findings,

the accuracy in those parameter estimates should be at

least as much of a concern as their probability value,

perhaps even more so. An optimal experimental de-

sign consists of an adequate sample size from an

AIPE perspective as well as an adequate sample size

from the PA perspective. Ensuring that sample size is

adequate from both perspectives leads to parameter

estimates that will likely be accurate as well as sta-

tistically significant.

A special case in which precision is especially im-

portant occurs when the goal is to provide evidence in

support of the null hypothesis. If a confidence interval

is sufficiently narrow and power is of sufficient

strength (say, power > .90), at times it may be appro-

priate to show support for the null hypothesis, in the

sense that the value of the parameter is not meaning-

fully different from the null value. Note that this is not

“accepting the null hypothesis” but is merely showing

support for it (Greenwald, 1975).

SAMPLE SIZE AND ACCURACY IN PARAMETER ESTIMATION 319

The simulation study showed that the procedures

presented here were effective in accomplishing their

respective goals. The mean and median of the ob-

served ws were very close to their specified values

when the estimated N (Equation 2) was used to select

sample size. When using N, researchers are reminded

that this provides the necessary sample size such that

the expected half-width of the confidence interval is,

on average, the specified width. However, this does

not ensure that the particular observed w will be the

specified width in any given sample. The modified

sample size (Equation 3) takes into consideration the

variability of the standard error of ␤

and adjusts the

sample size accordingly, such that one can be ap-

proximately 100(1 − ␥) percent confident that the

width around a particular ␤

will have a corresponding

w that is no larger than the specified w.

A caution is given because of the problems that can

arise when using standardized variables from random

X values in the context of multiple regression. Al-

though there are numerous reasons to use standard-

ized values as input into multiple regression models,

and thus make use of their corresponding estimates

for interpretational reasons, the standard errors of

such estimates are generally not exact. Even though

the simulations show that the common method of

standardizing random predictors produces confidence

intervals for standardized regression coefficients that

are generally wider than they should be, the sample

size procedures we present typically produce the de-

sired degree of precision.

In conclusion, the AIPE procedures presented here

are applicable to researchers working within the

framework of OLS multiple regression who want to

determine sample size a priori in order to obtain ac-

curate parameter estimates. Given reasonably accu-

rate input parameters, use of these procedures pro-

vides researchers with confidence intervals around

regression coefficients whose expected widths are the

values specified or, alternatively, with some degree of

probabilistic assurance. As with all sample size plan-

ning, the AIPE procedures will be less accurate to the

extent that the input parameters deviate from their true

values. However, the problem with the choice of input

parameters should not be used as a reason to avoid

sample size planning. In addition, we have shown that

planning sample size for precise estimates of stan-

dardized regression coefficients requires less a priori

knowledge (i.e., fewer input parameters) than the cor-

responding planning necessary to obtain sufficient

statistical power. We believe that obtaining accurate

parameter estimates, not merely statistically signifi-

cant ones, leads to a more productive science and

yields research findings that are more beneficial to a

given area of inquiry.

References

Aiken, L. S., & West, S. G. (1991). Multiple regression:

Testing and interpreting interactions. Newbury Park,

CA: Sage.

Algina, J., & Olejnik, S. (2000). Determining sample size

for accurate estimation of the squared multiple correla-

tion coefficient. Multivariate Behavioral Research, 35,

119–137.

Babakus, E., Ferguson, C. E., & Jo¨reskog, K. G. (1987).

The sensitivity of confirmatory maximum likelihood fac-

tor analysis to violations of measurement scale and dis-

tribution assumptions. Journal of Marketing Research,

24, 222–228.

Browne, M. W. (1982). Covariance structures. In D. M.

Hawkings (Ed.), Topics in applied multivariate analysis

(pp. 72–141). New York: Cambridge University Press.

Carlin, B. P., & Louis, T. A. (1996). Bayes and empirical

Bayes methods for data analysis. New York: Chapman &

Hall.

Cohen, J. (1988). Statistical power analysis for the behav-

ioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.

Cohen, J. (1990). Things I have learned (so far). American

Psychologist, 45, 1304–1312.

Cohen, J. (1994). The earth is round (p < .05). American

Psychologist, 49, 997–1003.

Cohen, J., & Cohen, P. (1983). Applied multiple regression/

correlation analysis for the behavioral sciences (2nd ed.).

Hillsdale, NJ: Erlbaum.

Cudeck, R. (1989). Analysis of correlation matrices using

covariance structure models. Psychological Bulletin, 105,

317–327.

Darlington, R. B. (1990). Regression and linear models.

New York: McGraw-Hill.

Gatsonis, C., & Sampson, A. R. (1989). Multiple correla-

tion: Exact power and sample size calculations. Psycho-

logical Bulletin, 106, 516–524.

Green, B. F. (1977). Parameter sensitivity in multivariate

methods. Multivariate Behavioral Research, 12, 263–

288.

Green, S. B. (1991). How many subjects does it take to do

a regression analysis? Multivariate Behavioral Research,

26, 499–510.

Greenwald, A. G. (1975). Consequences of prejudice

against the null hypothesis. Psychological Bulletin, 82,

1–20.

KELLEY AND MAXWELL

320

Hahn, G. J., & Meeker, W. Q. (1991). Statistical intervals:

A guide for practitioners. New York: Wiley.

Harris, R. J. (1985). A primer of multivariate statistics (2nd

ed.). New York: Academic Press.

Hellmann, J. J., & Fowler, G. W. (1999). Bias, precision,

and accuracy of four measures of species richness. Eco-

logical Applications, 9, 824–834.

Jamshidian, M., & Bentler, P. M. (2000). Improved stan-

dard errors of standardized parameters in covariance

structure models: Implications for construct explication.

In R. D. Goffin & E. Helmes (Eds.), Problems and solu-

tions in human assessment (pp. 73–94). Dordrecht, the

Netherlands: Kluwer Academic.

Kline, R. B. (1998). Principles and practice of structural

equation modeling. New York: Guilford Press.

Lawley, D. N., & Maxwell, A. E. (1971). Factor analysis as

a statistical method (2nd ed.). London: Butterworth.

Loehlin, J. C. (1998). Latent variable models: An introduc-

tion to factor, path, and structural analysis (3rd ed.).

Mahwah, NJ: Erlbaum.

MacCallum, R. C., & Austin, J. T. (2000). Applications of

structural equation modeling in psychological research.

Annual Review of Psychology, 51, 201–226.

Maxwell, S. E. (2000). Sample size and multiple regression

analysis. Psychological Methods, 5, 434–458.

Mendoza, J. L., & Stafford, K. L. (2001). Confidence inter-

vals, power calculation, and sample size estimation for

the squared multiple correlation coefficient under the

fixed and random regression models: A computer pro-

gram and useful standard tables. Educational and Psy-

chological Measurement, 61, 650–667.

Muller, K. E., & Benignus, V. A. (1992). Increasing scien-

tific power with statistical power. Neurotoxicology and

Teratology, 14, 211–219.

Raju, N. S., Bilgic, R., Edwards, J. E., & Fleer, P. F. (1999).

Accuracy of population validity and cross-validity esti-

mation: An empirical comparison of formula-based, tra-

ditional empirical, and equal weights procedures. Applied

Psychological Measurement, 23, 99–115.

Rencher, A. C. (2000). Linear models in statistics. New

York: Wiley.

Rossi, J. C. (1990). Statistical power of psychological re-

search: What have we gained in 20 years? Journal of

Consulting and Clinical Psychology, 58, 646–656.

Rozeboom, W. W. (1966). Foundations of the theory of

prediction. Homewood, IL: Dorsey Press.

Sampson, A. R. (1974). A tale of two regressions. Journal

of the American Statistical Association, 69, 682–689.

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of sta-

tistical power have an effect on the power of studies?

Psychological Bulletin, 105, 309–316.

Smari, J., Petursdottir, G., & Porsteinsdottir, V. (2001). So-

cial anxiety and depression in adolescents in relation to

perceived competence and situational appraisal. Journal

of Adolescence, 24, 199–207.

Steiger, J. H. (2001). Driving fast in reverse: The relation-

ship between software development, theory, and educa-

tion in structural equation modeling. Journal of the

American Statistical Association, 96, 331–338.

Steiger, J. H., & Fouladi, R. T. (1997). Noncentrality inter-

val estimation and the evaluation of statistical methods.

In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.),

What if there were no significance tests? (pp. 221–257).

Mahwah, NJ: Erlbaum.

Stuart, A., Ord, J. K., & Arnold, S. (1999). Kendall’s ad-

vanced theory of statistics (Vol. 2A, 6th ed.). New York:

Oxford University Press.

Thompson, B. (Ed.). (2001). Confidence intervals around

effect sizes [Special issue]. Educational and Psychologi-

cal Measurement, 61 (4).

Wainer, H. (1976). Estimating coefficients in linear models:

It don’t make no nevermind. Psychological Bulletin, 83,

213–217.

Wilkinson, L., & the American Psychological Association

Task Force on Statistical Inference. (1999). Statistical

methods in psychology journals: Guidelines and expla-

nations. American Psychologist, 54, 594–604.

Received December 11, 2001

Revision received March 18, 2003

Accepted April 23, 2003

■

SAMPLE SIZE AND ACCURACY IN PARAMETER ESTIMATION

321