_______________________________

 

Custom Search

 


 

I’ll Give You a Definite Maybe
An Introductory Handbook for Probability, Statistics, and Excel

[This text has been prepared by Ian Johnston of Malaspina University-College (now Vancouver Island University), in Nanaimo, BC for use in Liberal Studies.  The text is in the public domain and may be used by anyone, in whole or in part, without permission and without charge, provided the source is acknowledged, released May 2000.  Minor editorial and formatting changes were made in November 2004]


Section Five: A Normal Distribution and The Normal Curve  

A. Introduction

In the previous sections we have talked about frequency distributions.  These refer, you will recall, to the particular scores in a set of results.  Some results are more frequent than others, and the density of the distribution may vary (e.g., clustered near to the mean, spread out widely on either side of the mean, bunched at either end of the values, and so on).  There are innumerable ways in which the scores in a set of results can be distributed.

Of particular interest to us in the remainder of this module is a frequency distribution in which the dispersion of the scores is symmetrical about the central mean, that is, in which the frequency of results above the average matches exactly the frequency of results below the average and in which the most frequent result falls at the average (see Histogram D in the sample histograms in Section Two).  In such a distribution, the high point (the most frequent results will be in the middle of the distribution, and each side of the high point will be a mirror image of the other.

For such a distribution, the histogram would be perfectly symmetrical around the centre; in other words, the tallest column (i.e., the most frequent value) will occur exactly in the centre of the diagram and the other columns (frequencies) will fall away on either side of the central value equally on either side.  Such a perfectly symmetrical distribution is called a normal distribution (in popular language the shape of this frequency distribution is commonly called a bell curve).

Notice that a normal distribution may come with very different dimensions (tall and skinny, short and wide), but the characteristics mentioned above hold in all cases (the high point, i.e., the most frequent value, is always in the centre, and the two sides of the curve are perfectly symmetrical).  In other words, the characteristic bell shape is always present.

Here are a few examples of histograms illustrating normal distribution.  These histograms illustrate the probability distribution for success in various coin tosses.  The x-axis here indicates the number of heads in a particular sequence of coin tosses; the y-axis represents the theoretical frequency of that result in the given number of fair tosses. 

The histograms will have different sizes and shapes, because the frequency distribution changes with the number of tosses.  But notice that all the histograms are perfectly symmetrical around the centre (the tallest and therefore most frequent value). 

Remember, once again, that these diagrams represent probability distributions, or the frequency of results theoretically calculated.  And since the total of all the probabilities for an event equals 1, the shaded area contained in all the columns equals 1.

 

 

This diagram indicates that in a three-coin-toss sequence (or three coins tossed simultaneously) there are four possible results: 0 heads, 1 head, 2 heads, and 3 heads (the values on the X-Axis).  The percentage frequency of these four possibilities we read off the Y-Axis.  We read the following diagrams in the same way: the number of heads on the X-Axis, and the percent probability on the Y-Axis.  Notice the perfect symmetry in these distributions.  

 

In the above histogram for a six-coin-toss sequence, there are six possible results (from 0 heads to 6 heads).  The most frequent result is in the centre (3 heads), and the frequencies decline as one moves away from the centre (indicated by the decreasing height of the columns).

 

Notice in the above histogram (for 20 coin tosses) how at the extremes (0, 1, 2, 18, 19, 20) the percent probability is so small that the value does not show on the graph.  Virtually all the results in a 20 coin-toss sequence will fall between 3 and 17, with the most frequent value in the centre (at 10).  The frequencies on either side of 10 are perfectly symmetrical (we can see that by the equal heights of 9 and 11, of 8 and 12, of 7 and 13, of 6 and 14, of 5 and 15, or 4 and 16, of 3 and 17.  

B. The Normal Curve

Notice how in the diagrams above, as the number of columns increases, the entire shape of the histogram begins to approximate a curve, with the shaded areas all under the top line.  And, in fact, we can readily convert these histograms (using rectangles) to a curve by joining up the central points on the top of each column.   

 

What we have when we do this is exactly the same frequency distribution picture as we had with the columns, except that we have filled in the gaps between columns.  Now we do not have the body of the columns, but that does not matter, because the important part of the histogram picture is the line defined by the top centre points of the columns (which indicates the percent probability of any particular value along the x-axis).  In such a diagram, the important factor is the area under the curve, for that graphically presents the total frequencies.  Equal areas under such a curve will represent equal frequencies (more on this later).

When we join up the columns in the histogram in this way, we produce a particularly useful statistical shape, the normal curve (1).

The normal curve or the normal distribution is an extremely important statistical concept, as important in many areas of enquiry as the right-angle triangle is in Euclidean geometry, and for the remainder of our short study of statistics we shall be dealing only with this frequency distribution.  So understand clearly what the normal distribution means.

When we say that a particular population characteristic is normally distributed, we mean the following:

1.      The normal frequency curve shows that the highest frequency falls in the centre (i.e., at the mean of the values in the distribution) with an equal and exactly similar curve on either side of that centre.  Thus, the most frequent value in a normal distribution is the average, with half the values falling below the average and half above it.

2.      The normal curve, often called a bell curve, is perfectly symmetrical.  Therefore the median (the arithmetical average), the mode (the most frequent value), and the median (the middle value) will coincide at the centre of the curve (the high point).  Make sure you understand this point.

3.      The further away any particular value is from the average (above or below), the less frequent that value will be (i.e., the frequencies will diminish on either side of the high central point).

4.      Because the two halves on either side of the centre are exactly symmetrical, the frequency of values above the mean will match exactly the frequencies of values below the mean, provided the distances between the values and mean are identical.  Thus, the frequency of a value 3 units to the right of the mean will be identical to the frequency of the value 3 units to the left of the mean.  This is a key idea; please make sure you understand it.

5.      The total frequency of all values in the population will be contained by the area under the curve.  This is obvious enough, since the total area under the curve represents all the possible occurrences of that characteristic. 

6.      Various areas under the curve will therefore indicate the percentage of the total frequency.  For instance, 50 percent of the area under the curve lies to the left of the mean (i.e., half of all normally distributed results will fall in this area), and 50 percent of the area under the curve lies to the right of the mean.  Therefore, 50 percent of all scores will lie to the left and 50 percent to the right of the mean.  Equal areas under the curve represent equal numbers in the frequency.  Again, please make sure you understand this important idea.

7.      Normal curves may have different shapes (i.e., tall and skinny, short and low, and so on).  What will determine the overall shape of the symmetrical curve will the value of the mean and the standard deviation in the population (these will define the shape in the same way the centre point and the radius define a circle).  But the general characteristics listed above will remain the same.

Please make very sure you understand each one of the above points, because much of what we do from this point on assumes that you are quite familiar with the properties of the normal curve.

Normal distributions are particularly important for a number of reasons (as we shall see), not the least of which is that many of the important characteristics we wish to study (including all inherited characteristics) are normally distributed.  What that means is that if we gather a very large number of samples of a particular measurement (e.g., height) and construct a frequency distribution, the result will be normal, that is, will manifest the characteristics listed above.

Note carefully that the normal curve is a theoretical depiction of the distribution of frequencies of the values.  It does not tell us that in any particular series of measurements of a normally distributed item half must lie above and half below the mean.  It indicates that there is a .5 probability that in any series of values, any particular score will lie above or below the mean and that the average will fall in the centre of the distribution.  Or, put another way, in any measurement of a heritable characteristic (height, intelligence, weight, and so on) 50 percent of the population will be below the arithmetical average (the mean), because such characteristics are normally distributed.  It is not the case that in any distribution exactly 50 percent of the population will fall below the mean—but that must be the case if the frequency distribution is a normal curve.

Not all values are normally distributed (please remember that): for example, the salaries of those working at Malaspina University-College, the responses to a public opinion questionnaire, levels of contaminant in the Georgia Strait.  But what makes this particular frequency distribution so important is that a great many things in our world are normally distributed (e.g., population heights, mortality rates, stock market fluctuations, yearly temperature averages, girth of trees, all repeated human measurements of a single natural phenomena, heritable characteristics, and so on).  It is an enormously useful and important analytical concept (2).

C. Further Properties of the Normal Curve

We have noted above some of the properties of the normal curve (most frequent value is at the centre, symmetry about the central value, diminishing frequency with the distance from the centre).  However, there are many more important features.  

You may have noticed that the shape of the curve in a normal distribution has a clear point on each side where the slope goes from concave (bulging outward) to convex (bulging inwards).  If you were walking up the curve you would notice that at first the slope increases, but at a particular point it would begin to decrease as you approach the summit.  The point at which this occurs is called the point of inflection.

If one draws a perpendicular line from the points of inflection, one on either side of the mean, to the base line (the X-axis) then the distance from that point to the value of the mean on the X-axis (in the centre) is equal to the standard deviation.  Make sure you understand this very important property of the normal curve.

Note that these two perpendicular lines drawn from the points of inflection on either side of the mean divide the area under the curve further, so that we now have four separate areas, as follows (see diagram on the next page):

1.  The area between the mean and one standard deviation above the mean (Area A);

2.  The area between the mean and one standard deviation below the mean (Area B);

3.  The area to the right of one standard deviation above the mean (Area C);

4.   The area to the left of one standard deviation below the mean (Area D).

 

Since the normal curve is perfectly symmetrical, Area A will equal Area B, and Area C will equal Area D.  And the total of A, B, C, and D will equal the total area under the curve (i.e., the entire population).  Since the curve never quite touches the X-axis at either end, there may be a value beyond the tails (a highly improbable value), but its frequency will be so low that we can virtually ignore it.

Mathematical calculations indicate that in any normal distribution, no matter what its height or width, about 68 percent of all the observations fall within one standard deviation from the mean (i.e., in Areas A and B combined).  Thus, 34 percent will lie between the mean and 1 standard deviation above the mean (in Area A), and 34 percent between the mean and 1 standard deviation below the mean (in Area B).  Hence, in a normal distribution 32 percent of the observations will fall outside 1 standard deviation, 16 percent on either side (i.e., 16 percent of the population will fall in Area C and 16 percent in area D).

We may express this, more appropriately, in the language of probability, as follows: in any normal distribution, there is approximately a .68 probability that a particular value will fall within 1 standard deviation (SD) of the mean; there is approximately a .34 probability that a particular value will lie between the mean and 1 SD above the mean (in Area A) and approximately a .34 probability that a particular value will lie between the mean and 1 SD below the mean (in Area B).  Similarly, there is approximately a .16 probability that a particular value will lie higher than 1 SD from the mean (in Area C), and approximately a .16 probability that a particular value will lie lower than 1 SD below the mean (in Area D).

The diagram below illustrates the areas under the normal curve for one and two standard deviations above and below the mean (i.e., this is the same as the previous diagram, except that the vertical lines indicating two standard deviations from the mean have been added to it, thus creating six areas under the curve).

 

The vertical lines represent the mean (at the centre), and distances of 1 and 2 standard deviations on either side of the mean.  As before, Area A and Area B are equal, each defined by the mean and 1 standard deviation on either side of it.  Each of these areas (A and B) contains approximately 34 percent of all the values in a normal distribution.

Area C and Area D, which are also equal, are defined by the vertical lines representing 1 and 2 standard deviations from the mean (on either side).  Each of these areas will contain approximately 13.5 percent of all the values in a normal distribution.

Areas E and F, at the extreme ends of the curve are defined as the areas marked off by the vertical line representing 3 standard deviations and the tail ends of the curve.  Each of these areas will contain 2.5 percent of all the values in a normal distribution (i.e., in a normal distribution, 5 percent of the population will be beyond 2 standard deviations: 2.5 above the mean, and 2.5 below the mean).

If we continued to draw standard deviation vertical lines to mark off three standard deviations from the mean (not shown on the diagram), we would have two very small areas at the extreme tips of the curve it indicate the values lying more than three standard deviations from the mean.  This area contains .3 percent of all the values in the normal distribution.

The same information given in the above paragraphs in terms of percentages can be restated in the language of probability as follows:

  1. In any normal distribution, there is a .34 probability that any particular value will fall between the mean and 1 standard deviation above the mean (in Area A), a .34 probability that any particular value will fall between the mean and 1 standard deviation below the mean (Area B); furthermore, there is a .135 probability that any particular value will fall between 1 and 2 standard deviations above the mean (Area C) and a .135 probability that any particular value will fall between 1 and 2 standard deviations below the mean (Area D).  Finally, there is a .475 probability that any particular value will fall within 2 standard deviations above the mean (somewhere in Areas A and C) and a .475 probability that any particular value will fall within 2 standard deviations below the mean (somewhere within Areas B and D).

  2. Further analysis of the mathematics of normal curves reveals that the area contained by the perpendicular lines representing 3 standard deviations from the mean contains 99.7 percent of the area under the curve and thus represents 99.7 percent of all the scores in the data set.  In other words, there is a 99.7 percent chance (or p = .997) that in any normal distribution, any particular value will fall within 3 standard deviations from the mean (3).  

  3. Thus, the areas beyond three standard deviations contain only .30 percent of the total area.  This means that in a normally distributed characteristic, the probability of a value lying more than three standard deviations from the mean is .003, or .0015 at the top end (above the mean) and .0015 at the bottom end (below the mean).  Thus, it is very rare indeed (but not impossible) for an observed value in a normal distribution to occur more than 3 standard deviations from the mean.

D. A Simple Application of the Mathematical Properties of the Normal Curve

This mathematical information about a normal curve provides enormously valuable information.  For if we know that a population is normally distributed (i.e., that the frequency distribution in the population follows a normal curve), then if we know the mean of that curve and the standard deviation, we know the probabilities of any particular value falling within specified areas of the curve.  We can thus make some important predictions about that population.

For instance, suppose we know that the height of men in a population (say, in Prince George) is normally distributed, that the mean height (from a sample we collect) is 68 in., and the standard deviation is 4 in.  We then know the probabilities for the distribution of heights in Prince George, as follows:

Approximately 34 percent of the men will be between 68 in. (the mean) and 72 in. (1 SD above the mean, 68 + 4); approximately 34 percent will be between 68 in. (the mean) and 64 in. (1 SD below the mean, 68 - 4); approximately 13.5 percent will be between 68 in. and 76 in. (between 1 SD above the mean and 2 SD above the mean); and approximately 13.5 percent will be between 64 in. and 60 in. (between 1 SD and 2 SD below the mean); and approximately 2.5 percent will be between 76 in. and 80 in. (between 2 and 3 SD above the mean); and approximately 2.5 percent will be between 60 in. and 56 in. (between 2 SD and 3 SD below the mean).

Thus, if a child of yours informs you that she is getting married to some man from Prince George, you already know some important things about your prospective son-in-law, even though you have never met.

There is a .34 probability that his height will be between 68 in. and 72 in.; there is a .34 probability that his height will be between 68 in. and 64 in.; or, putting these two together, that there is a .68 probability that his height is between 64 in. and 72 in.

We could obviously continue this analysis to take into account all the percentage frequencies indicated by the normal curve.

Now, this mathematical analysis of the normal curve holds for the frequencies of any value which is normally distributed.  Once we know the mean and the standard deviation, we are able to predict the probability of the value for any particular member of the population.  And this process is possible, to repeat the point, for any measurable factor whose frequencies are normally distributed (e.g., mortality rates, some test scores, volume of wood in trees, and so on).  Thus, once we know that a characteristic is normally distributed, what the values are for the mean and the standard deviation, we are in a position to make a number of conclusions about the probable distribution of the entire population.

E. Normal Curve: Summary

It is vitally important for an initial understanding of statistics to grasp the point that the features of the normal curve apply to all distribution frequencies of normally distributed items.  Normal curves may have many different heights and widths, but in all cases, these characteristics apply:

1.      The mean, median, and mode coincide at the high point of the curve and divide the results into two equal and perfectly symmetrical halves.

2.      Of all the scores in a perfectly normal distribution, approximately 34 percent will lie between the mean and 1 Standard Deviation above the mean, and approximately 34 percent will lie between the mean and 1 Standard Deviation below the mean.

3.      Of all the scores in a perfectly normal distribution, approximately 95 percent will fall between the lines representing 2 Standard Deviations from the mean (i.e., about 27 percent of all scores will fall between 1 and 2 standard deviations, with 13.5 percent on either side of the curve).

4.      Of all the scores, approximately 99 percent will lie between the lines indicating 3 standard deviations from the mean (i.e., approximately 5 percent of the sample will fall between 2 and 3 standard deviations, or approximately 2.5 percent on either side of the mean).

Note that these characteristics hold for any normal distribution regardless of the height or width of the normal curve.  Thus, once we know that the frequencies of a particular mathematical measurement is normally distributed, we know that the above groupings of the results should occur in any very large sample.

F. Self-Test on Normal Distribution Curve

1.      The duration times of a certain brand of battery are normally distributed, with a mean of 80 hours and a standard deviation of 10 hours.  As a marketing gimmick, the manufacturer decides to guarantee to replace any battery which fails prior to a certain time.  Approximately how long a guarantee should the company provide so that no more than 2.5 percent of the batteries fail prior to the guaranteed time?     

2.      You have a contract to make one thousand uniforms for the Canadian navy.  The heights of sailors are normally distributed, with a mean of 69 inches and a standard deviation of 2 inches.  What percentage of the uniforms will have to fit sailors shorter than 67 inches?  What percentage will have to be suitable for sailors taller than 73 inches?

3.      Let us assume the results from all large tests are normally distributed.  In the final results for Subject A, the mean percentage score is 80 and the Standard Deviation 5.  In Subject B, the mean percentage score is 70 and the Standard Deviation 2.5.  Suppose you score 75 percent in both courses.  What percentage of students received results better than you in Subject A and in Subject B?  What is your percentile rank in each subject?

The answers to these questions are given in Section I below.

G. A Normal Distribution and Bernouilli’s Theorem

It is important to grasp the point that the bell-like shape of a normal distribution only occurs with a great many samples from normally distributed data.  In fact, in any quality normally distributed (e.g., any heritable quality, like height), as Bernouilli’s Theorem tells us, the frequency distribution of the results will get closer and closer to the shape of a normal distribution as we increase the number of measurements (i.e., data in the sample).

To follow this point more clearly, consider the following diagrams.  They represent the frequency distributions of random numbers taken from a population of numbers which is known to be normally distributed (Excel generated the numbers and produced the charts).  In this case, the mean of the total population is 10 and the standard deviation 3 (chosen arbitrarily).

The first diagram illustrates the frequency distribution for a sample of 100 numbers.  You will notice that it does not look very bell-like.  The second diagram illustrates the frequency distribution for a sample of 1000 numbers.  You can see that the characteristic shape of the normal distribution is beginning to emerge. 

 

 

The final two diagrams illustrate the frequency distributions for samples of 2000 and 3000 numbers respectively.  Clearly, the final diagram, although still not a perfect bell curve, approximates much more closely than any of the others the characteristic shape of the normal distribution.  A larger sample (say, 10,000) would look even closer to the symmetrical bell shape.

 

 

 

 

When we are dealing with random number generation from a population which is not normally distributed but which is uniformly random, then increasing the number in the sample is not going to produce more and more closely any clear shape. 

Below, for example, are histograms for 1000 and for 2000 numbers between 1 and 400 randomly generated, but this time from a population which is not normally distributed.  Notice that there is no emerging bell curve shape as one increases the number of samples from 1000 to 2000.

 

 

   

H. A Final Word

It is particularly important that you take away from this section and the previous sections a clear sense of the meanings of the following key terms: mean, standard deviation, z-score (positive and negative), normal distribution, normal curve.

In addition, you must retain a clear sense that knowing the standard deviation and the mean of a certain normal curve enables one to ascertain the probability that certain results will fall within a certain distance of the mean.

Furthermore, from now on we assume that students are all familiar with the concept that the area under the normal curve indicates the theoretical distribution of frequencies in any normally distributed data.  Various areas under the curve represent the various probabilities that any one score will fall within the designated area.  Thus, the smaller the area for any group of scores, the smaller the probability that any score in that group will occur.  The tails of the curve (beyond 3 standard deviations) contain very small areas, and thus the probabilities of scores within those areas are very low (less than .01).

As a rough guide, remember that the majority (approximately 68 percent) of all scores in a normal distribution should fall within 1 standard deviation and the mean (or between a z-score of +1 and -1); almost all (95 percent of the scores) should fall between the mean and 2 standard deviations (or between a z-score of +2 and -2), and that the probability of a score falling within 3 standard deviations and the mean is approximately 100 percent.

This does not mean that it is impossible for a score in a normal distribution to fall further than 3 SD from the mean, simply that such a result is very rare (the value of p is close to 0).

Remember, too, that these characteristics refer only to data which is normally distributed.  These figures do not apply in other sorts of distributions (in which the shape of the frequency curve will be different).

You will understand very little of what comes in the next sections if you have not grasped clearly the above information.

I. Answers to Self-Test on the Normal Distribution Curve

1.      The manufacturer does not want to return more than 2.5 percent of his batteries.  Since the lifetime of the batters is normally distributed, we know that 95 percent of them will fall with 2 standard deviations of the mean, that is between 80 + 2SD and 80 - 2SD, or 80 + 20 and 80 -20, or between 100 hr and 60 hr.  Thus, 5 percent of the population of batteries will fall outside this range, 2.5 percent above and 2.5 percent below.  We are not worried about the batteries above this range, because owners are not going to complain about batteries lasting longer; the area of the population we are concerned with is the 2.5 percent below 2 standard deviations (i.e., below 60 hr).  Therefore, the manufacturer should set his guarantee at 60 hr.

2.      Sailors shorter that 67 inches fall into an area of the normal curve from the lower extremity to the line making 1 SD below the mean (since the mean is 69 in. and the Standard Deviation 2 in.).  In a normal distribution, the area to the left of 1SD below the mean is approximately 16 percent of the total population.  Similarly, sailors taller than 73 in fall into an area 2 SD to the right of the mean.  In a normal distribution, the area more than 2 SD to the right of the mean is equal to 2.5 percent of the total population.

3.      In Subject A your score of 75 is 5 marks below the Standard Deviation (of 5).  This is equivalent to 1 Standard Deviation below the mean (or a z-score of -1).  Since the marks are normally distributed, the percentage of students getting better marks than you includes the entire population to the right of one Standard Deviation below the mean, or 84 percent.  In Subject B your mark of 75 is 5 percent above the mean (or a z score of 2, since the Standard Deviation is 2.5).  Thus, the students who did better than you are those in the area to the right of two Standard Deviations above the mean, or 2.5 percent.  The percentile rank is the percentage of students who fared worse than you.  Thus, in the first test, you have a percentile score of 16; in the second test you have a percentile score of 97.2.


Notes to Section Five  

(1) The adjective normal does not mean "usual" or "customary" (although such a distribution is, in fact, quite common), but comes from "normative," meaning ideal. [Back to Text]

(2) The credit for first recognizing and developing the properties of the normal curve is generally given to the English mathematician Abraham de Moivre, 1667 to 1754, an acquaintance of Newton's and a member of the Royal Society, who used as his statistical laboratory the London coffee houses where all sorts of gambling went on.  The basic principle underlying Normal Distribution is that any data which are influenced by many small and unrelated random effects (like, for example, weight) are going to be normally distributed (at least to a very near approximation).  This principle is called the Central Limit Theorem.  See Appendix E for an illustration of how combining independent random effects produces a normal distribution. [Back to Text]   

(3) These percentage figures are approximate.  The more exact figures are as follows: the area between the mean and one standard deviation contains 34.13 percent of all results on either side of the mean; the area between the mean and two standard deviations contains 47.72 percent of all results on either side of the mean; the area between the mean and three standard deviations contains 49.87 percent of all results on either side of the mean.  For a complete lay out of the area under the normal curve at different standard deviations see Table A in Appendix B.  For the purpose of our exercises we will use the approximate values given above, except where noted.  [Back to Text]  



[Back to Table of Contents]

[Back to johnstonia Home Page]  


View  Stats