So far we've discussed two of the three characteristics used to describe distributions, now we need to discuss the remaining - variability. Notice in our distributions that not every score is the same, e.g., not everybody gets the same score on the exam. So what we need to do is describe the varied results, rougly to describe the width of the distribution.
Variability provides a quantitiative measure of the degree to which scores in a distribution are spread out or clustered together.
| Consider the two distributions to the right. They have the same shape (unimodal and symmetric) and the same center. However they are different with respect to how dispersed around their centers they are. The blue distribution has a lot of scores that are very far from the center, while most of the scores in the red distribution are very near the center. This difference is a difference in variability. |
|
In other words variablility refers to the degree of "differentness" of the scores in the distribution. High variability means that the scores differ by a lot, while low variability means that the scores are all similar ("homogeneousness").
We'll concentrate on three measures of variability, the quartiles, the interquartile range, and the standard deviation.
Another example: height and weight of baby boys
Think back to percentiles. 50%tile equals the point at which exactly half the distribution exists on one side and the other half on the other side.
_X f % c% 24 1 1.67 100 23 1 1.67 98.33 22 1 1.67 96.67 21 2 3.33 95.0 20 2 3.33 91.67 19 2 3.33 88.33 18 3 5.0 85.0 17 3 5.0 80.0 16 3 5.0 75.0 15 3 5.0 70.0 14 4 6.67 65.0 13 5 8.33 58.33 12 5 8.33 50.0 11 4 6.67 41.67 10 3 5.0 35.0 9 3 5.0 30.0 8 3 5.0 25.0 7 3 5.0 20.0 6 2 3.33 15.0 5 2 3.33 11.67 4 2 3.33 8.33 3 1 1.67 5.0 2 1 1.67 3.33 1 1 1.67 1.67 |
|
A related measure of variability is the interquartile range (IQR).
The interquartile range is the distance between the first quartile and the third quartile. So this corresponds to the middle 50% of the scores of our distribution.
So for the above distribution
Note that the interquartile range is often transformed into the semi-interquartile range which is 0.5 of the interquartile range.
SIQR = (Q3 - Q1) 2So for our example the semi-interquartile range is (8.0)(0.5) = 4.0
So the interquartile range focusses on the middle half of all of the scores in the distribution. Thus it is more representative of the distribution as a whole compared to the range and extreme scores (i.e., outliers) will not influence the measure (sometimes refered to as being robust). However, this still means that 1/2 of the scores in the distribution are not represented in the measure.
A boxplot is a graphic depiction of the 5 number summary. The center line represents the median, the red bars to the top and bottom are the quartiles, and the lines represent the largest and smallest points that are not considered outliers (may vary from stats package, typically something like +/- 1.5 IQRs determine the cut off for outliers). In SPSS you will get the boxplot of a single distribution in the Descriptive Stats - Explore submenu.
In essence, the standard deviation measures how far off all of the individuals in the distribution are from a standard, where that standard is the mean of the distribution.
So to get a measure of the deviation we need to subtract the population mean from every individual in our distribution.
Example: consider the following data set: the distribution of heights (in inches)
69, 67, 72, 74, 63, 67, 64, 61, 69, 65, 70, 60, 75, 73, 63, 63, 69, 65, 64, 69, 65
mean = m = 67
S (X - m) = (69
- 67) + (67 - 67) + .... + (65 - 67) = ?
= 2+ 0 + 5 + 7 + -4 + 0 + -3 + -6 + 2 + -2 + 3 + -7 + 8 + 6 + -4 + -4 + 2
+ -2 + -3 + 2 + -2
= 0
Notice that if you add up all of the deviations they should/must equal 0. Think about it at a conceptual level. What you are doing is taking one side of the distribution and making it positive, and the other side negative and adding them together. They should cancel each other out.
So what we have to do is get rid of the negative signs. We do this by squaring the deviations and then taking the square root of the sum of the squared deviations.
Sum of Squares = SS = S (X - m)2 = (69 - 67) 2 + (67 - 67) 2
+ .... + (65 - 67) 2 =
SS = 4+ 0 + 25 + 49 + 16 + 0 + 9 + 36 + 4 + 4 + 9 +49 + 64 + 36 + 16 + 16
+ 4 + 4 + 9
+ 4 + 4
SS = 362
The equation that we just used (SS = S (X - m)2) is refered to as the definitional formula for the Sum of Squares. However, there is another way to compute the SS, refered to as the computational formula. The two equations are mathematically equivalent, however sometimes one is easier to use than the other. The advantage of the computational formula is that it works with the X values directly.
The computational formula for SS is:
SS = SX2 - (SX)2 N
So for our example:
SS = [(69)2 + (67)2 + ..... + (69)2 + (65)2] - (69 + 67 + ... + 69 + 65)2 21 = 94631 - (1407)2= 94631 - 94269 = 362 21
Now we have the sum of squares (SS), but to get the Population Variance which is simply the average of the squared deviations (we want the population variance not just the SS, because the SS depends on the number of individuals in the population, so we want the mean). So to get the mean, we need to divide by the number of individuals in the population.
s = sqroot(s)
To review:
instead of m in the computaion of SS
- need to adjust the computation to tak into account that a sample will typically be less variable than the corresponding population.
- if you have a good, representative sample, then your sample and population means should be very similar, and the overall shape of the two distributions should be similar. However, notice that the variability of the sample is smaller than the variability of the population.
- to account for this the sample variance is divided by n - 1 rather than just n
sample variance = s2 = __SS _ n - 1
- and the same is true for sample standard deviation
So what we're doing when we subtract 1 from n is using degrees of freedom to adjust our sample deviations to make an unbiased estimation of the population values.
What are degrees of freedom? Think of it this way. You know what the sample mean is ahead of time (you've got to to figure out the deviations). So you can vary all but one item in the distribution. But the last item is fixed. There will be only one value for that item to make the mean equal what it does. So n - 1 means all the values but one can vary.
Example:
Okay, so let's do an example of computing the standard deviation of a sample
step 1: compute the SS
)2-- OR --
You can still use the computational formula to get SS
SS = SX2 - (S X)2 N = (1+4+9+16+16+25+36+49) - (1+2+3+4+4+5+6+7) 8 = 156 - 128 = 28.0
step 2: determine the variance of the sample (remember it is a sample, so we need to take this into account)
sample variance = s2 = _SS_ n - 1= 28/(8-1) = 28/7 = 4.0
step 3: determine the standard deviation of the sample
= sqroot 4.0 = 2.0
Consider the following data set:
3, 4, 4, 5, 5, 6, 8
1) Compute the sums of squares for this dataset.
2) Assume that the data are all the scores in the population. Compute the
standard deviation.
3) Assume that the data are a sample. Compute the standard deviation.
4) How do your answers to (2) and (3) compare? Is that what you
expected?
5) Add 2 to every score in the data set. How does your standard deviation
change? Why?