How the Size of a Random Sample Affects How Accurately It Represents a Population Ruben D. Cohen Department of Mechanical Engineering and Materials Science, Rice University, Houston, TX 77251 In a recent article' it was shown why fmite random samples represent populations accurately. In that work, however, no mention was made of the errors introduced when the samples are small compared with the size of the population. This work addresses the issue, and concludes that the errors are of 0(1/2n), with n being the size of the sample. Introduction In obtaining random samples from a large population, the main objective is to select the samples so that they mimic a s closely a s possible the statistical properties (mean. standard deviation. etc.) of the ~ o ~ u l a t i oThere n. are ba$ically two methods for minimizing krrors inherent to samoline. . Of the two methods. the more comrnonlv used is obtaining large samples a t random, which is based on the idea that, as sample size approaches the size of the population, the errors diminish to zero. In many situations, however, measurement facilities or associated costs do not allow one to obtain samples of unlimited sizes. Thus, typical random samples are usually far smaller than desired. To improve accuracy in such cases, therefore, samples of a given size, n, are selected many times (followed by replacement), and the statistical characteristics of the most observable group are chosen to represent those of the population. Thus, this method relies on increasing the frequency of sample selection instead of sample size, hopefully to reduce the errors introduced by "small" samples. The objective of this work is to examine sampling errors from the perspective of the second method mentioned above. that is. to extract fmite samnles many times and to relate'the statistical properties of tLe most Gobable observations (those that appear most frequently) to the characteristics of the population. It will be shown that even when the number of the sampling events approaches infinitv, there still remains a sm& e&r due to the finite size of the samples.
in which
being the size of the largest cluster. I t follows with ,,i from the two previous equations that
-
Problem Formulation and Analysis The simplest approach to studying this problem is by way of example. The one chosen here, which is similar to that in ~oheh,' involves a system containing a very large number (or population), N, of different-sized clusters of particles. It is important to mention that, in practical situations, N is extremely large, usually on the order of 10' or even higher. The clusters i n the population can be counted and grouped according to size: N1 clusters of size i = 1,Nz clusters of size i = 2, Ns clusters of size i = 3, and so on. Therefore, the actual probability, Pi,of finding a cluster of size i in the population is simply
'Cohen, R. D. 'Why Do Random Samples Represent Populations So Accuratelv?":J. Chem. Educ. 1991. 68.902. '~reyszig,'~. ~dvanced~ngineerini~.&hematics, 6th ed.; Wiley: New York. 1988.
200
Journal of Chemical Education
Thus, within eq 1lie all the statistical properties of the population. One such property, for example, is the average cluster size ,i that can be written using the previous equations.
The best way to obtain information about the s u e distribution probability Piof the population, without measuring the sizes of all 10' clusters, is to select samples that are significantly smaller than the size of the population. These samples still may be moderate to large in size, that is, with n 3 20. This, evidently, makes counting much easier. With a sample of size n, a random selection might acquire statistical properties that differ significantly from those of the population. This is due to the f a d that any size distribution ni of clusters in the s'ample is attainable as long as njsatisfies the constraint described by eq 3. This is due to randomness in the sampling.
Furthermore, if the sample is selected purely a t random, the probability of it having any arbitrary size distribution, nj (i = 1,2, 3, ...), can be shown from simple combinatoric arguments to follow the multinomial distribution2
where Pi is the actual size distribution probability of the population, as deiined earlier in eq 1. Note that the binomial distribution, as used in the reference cited in the first footnote, is just eq 4 simplified. Consequently, selecting a random sample of s u e n so that it comprises any arbitrary size distribution, ni,can be done with a probability of jXni,n: i = 1,2,3, ...1, as given above. I t is important to know
thatjXni,n: i = 1 , 2 , 3 , ...I is a probability distribution function that satisfies
possible ",'s
where
is the "range" of the population. It is given by
(5)
by virtue of eq 2. Upon closely examiningc(n;,n: i = 1 , 2 , 3 , ...I, as given by eq 4, one can conclude that if random samples of size n are to be selected many times (with replacement after each sample extraction and measurement), certain size distributions in the samples will appear more frequently than others. To illustraje! one can search for a most probable size distribution, n; if any, in the collection of the sample distribution data. This can be done by maximizing eq 4 with respect to n;, while satisfying the constraint of eq 3. Note that eq 5 is not a constraint, but a result ofc(ni,n: i = 1,2, 3, ...I being a probability distribution function. Carrying out this maximization with the help of Lagrange m~ltipliers,~ we write
where sgn is the "sign function" defmed by
In disco&tinuous populations, therefore, as illustrated in Figure 1, A is just the sum of the piecewise contributions since sgn(0) = 0. Finally, upon inserting eq 11 into 10 and introducing
to denote the most probable size distribution probability in the group of samples, one obtains
where h i s the Lagrange multiplier, which shall be evaluated shortly. Substituting eq 4 into 6 yields
after using Stirling's approximation for In n! and In n;!. In general, this is given by
which relates the sample with the population. It is easy to see from this that for a certain population having a certain range, A, and characteristic distribution, Pi, the mosk observable sample will acquire a distribution given by pi as predi~ted~by eq 14. Also, due to finite sample sizes, errors between p; and Pi become immediately apparent upon examining eq 14. However, we note that by i'3,weasing sample size n the distribution in the sample, pi, approaches that of the population, Pi. If we now define the percent xelative error, E;,as a measure of the difference between pi and Pi,that is
and is fairly accurate for x 2 4. For example, 4! = 24 and its Stirling approximation is about 23.5, resulting in about 1.4%error. Equation 7 can be recast into then it follows from eq 14 that where K s exp(-h). Since the exponential term in the above equation can be expanded in Taylor series, we can keep the first two leading terms, while st21 maintai~ingreasonable accuracy for moderate values ofni that is, ni 2 3. For example, we get about 1.2%error for ni = 3. Subsequently, eq 9 reduces to
or simply
To evaluate K, which contajns the Lagrange multiplier ;! we make useofthr f;st that n , must also satisly eq 3. Thus, after substituting eq 10 into 3, we get
3Hildebrand. F. B. Advanced Calculus for Applications, 2nd ed.: Prentice-Hall: New Jersey, 1976.
Figure 1,An example of a discontinuous probability distribution showing how A can be obtained. Volume 69 Number 3 March 1992
201
Equation 16 simply implies that in this type of sampling technique, the errors incurred arc of Orldn,. To illustrate the use of eqs 14 and 16, two examples are given below. Example 1 Consider a dispersion of clusters in a system, such a s the one shown schematically in Figure 2a. Suppose that the number of clusters in the population (system) is "infinitely" large (>-lo8), and that the actual size distribution, Pi, with range A = 4, has the form of the bar graph shown in Figure 2b, with PI= 0.1, Pz = 0.2, Pg = 0.3, and Pq= 0.4. The objective now is to select random samples from the population to try to gain information about P,.Assume,
Cluster size. i
Figures 2a and 2b. Schematic diagram of a dispersion of differentsized clustgs, and its corresponding size distribution probability bar graph with A = 4.
1
2
3
4
Size, i
however, that the measuring apparatus can not count more than 50 clusters a t a time. Then the procedure would be to obtain a sample of 50 clusters, measure its size distribution, and return the sample to the system. Then keep repeating the process until many sample measurements, in terms of bar graphs, are taken. The next appropriate step is to choose the distribution that appears most frequently, and take it to represent the actual pop9ation. As mentioned in the previous section, this is "iust D. as oredicted bv ea 14. Of course, it is expected that the two distrib~tions,;~and Pi,should differ due to the finite sample size. These are compared on the same graph in Figure 3a where the difference becomes evident. The difference is even more obvious when the relative errors between the (most probable) sample and the population, using eq 16, are evaluated. The error analysis, which is displayed in Figure 3b, indicates sampling errors of up to 6%.
.. .
" .
Example 2 Consider a bag containing a large number or population of black and white marbles. The probability of randomly selecting a black marble from this bag is given by Ps,and that of a white one is denoted by Pw.Also, suppose thatJor this particular example, PB = 0.2 and Pw = 0.8. Thus, A is equal to 2. We are to obtain information about Pn and Pw bv selecting samples a t random. Assume that we have two samplingmachines available: one that can take samples of size
Black
White
Color
-L
1
3
2
4
Size, i F gure 3a
Compar ng the s ze dlstr but on of the population wttn the most probable one of the sample For 1h s pan cu ar example, n = 50 ana I\ = 4 F gbre 3b Percent error oetween the d strfoLton probab ities of the population and sample
202
Journal of Chemical Education
White
Black
Color Figure 4a. Comparing the distribution of the population with the most probablcin the two samples. For this particular example, n = 25 and 50. and A = 2. Fiaure 4b. Percent error between the distribution orobabilities of the population and sample
n = 25 a t a time, and another that can select samples of size n = 50. Due to this limited samplingcapability, we must follow a procedure to improve accuracy: take samples, measure their distributions of black and white marbles, return the samples to the bag, and then repeat the process many times to produce distribution graphs. Finally, we choose the distribution that appears most frequently to be reprefentatixe of the population. This distribution is, of course, pe m d p w(as given by eq 141, which correspond to the most probable or observable for the black marbles and the white marbles, respectively. With the relationship between the 2 s and P s available from eq 14 and plotted in Figure 4a, the errors introduced by different finite sample sizes become apparent. These are depicted in Figure 4b, in which it is clear that larger
sample sizes lead to smaller errors. As mentioned earlier, the rate of convergence of these errors is of O(112n). Discussion and Conclusion The errors introduced when obtaining finite random samples from large populations are investigated. Simply, the method for analysis is based on increasing the frequency of sampling indefinitely, while keeping the sample size finite. I t follows that if we choose the most freauentlv observed data to represent the population, then a most probable sample distribution can be predicted teq 141.Ol'course, because of finite sample sizes, the measurements or data will deviate sliahtlv ti-om the actual . ~ooulation. thus lcadine 11, . a certain Gherent sampling error (eq 16). he error, however, converges relatively fast a s sample sizes become large. I n fact, the rate of convergence is inversely proportional to twice the sample size.
-
Volume 69 Number 3 March 1992
203