Consider the final round scores in the 2002 NEC World Golf Championship
65
69
68
68
68
67
70
71
71
72
69
69
71
72
70
70
71
74
67
67
68
69
72
71
72
72
72
70
70
71
71
71
72
73
65
71
74
72
74
70
75
72
72
75
71
70
73
72
70
70
78
74
74
71
73
71
68
74
73
70
69
68
77
72
70
70
74
73
70
69
78
74
73
69
84
75
73
These are all of the final round scores of the 77 golfers who particpated. In other words, this is the distribution of final round scores.
It is difficult to get a sense of the overall distribution by just looking at the raw scores. Instead, we use several descriptive statistical methods to summarize, simplify, and describe the distribution.
There are 3 characteristics used that completely describe a distribution:
shape, central tendency, and variability. We'll be talking
about central tendency (roughly, the center of the distribution) and
variability (how broad is the distribution) in future chapters.
In a symmetrical distribution, it is
possible to draw a vertical line through the middle so that one side of
the distribution is an exact mirror image of the other.
In a skewed distribution, the scores tend to pile up toward
one end of the scale and taper off gradually at the other end.
The section where the scores taper off towards one end of a distribution
is called the tail of the distribution.
<------ tail points: negatively skewed |
positively skewed: tail points this way ----> |
Kurtosis is a relative measure of the body and tail portions of the distribution.
Distributions that are "flat" are platykurtic
Distributions that are "peaked" are leptokurtic.
In addition to the shapes mentioned above, one should also look for whether a distribution is uni-modal or multi-modal.
If there are two (or more) clear peaks, then the distribution is bi-modal (or multi-modal if more than two).
Central tendency is a statistical measure that identifies a single score as representative of an entire distribution. The goal of central tendency is to find the single score that is most typical or most representative of the entire group.
We will focus on three measures of central tendency: the mean, the median, and the mode. All are measures of central tendency, but for some distributions, some are more meaningful or appropriate than the others.
Variability provides a quantitiative measure of the degree to which scores in a distribution are spread out or clustered together.
In other words variablility refers to the degree of "differentness" of the scores in the distribution. High variability means that the scores differ by a lot, while low variability means that the scores are all similar ("homogeneousness").
We'll concentrate on three measures of variability, the range, the interquartile range, and the standard deviation.
1) A frequency distribution tablesis an organized tabulation of the number of individuals located in each category on the scale of measurement.
Notice that if you add up the frequecy column, you get the total number of
observations
S f = N
_____________________________ X f % c% 84 1 1.3 100 83 0 0 98.7 82 0 0 98.7 81 0 0 98.7 80 0 0 98.7 79 0 0 98.7 78 2 2.6 98.7 77 1 1.3 96.1 76 0 0 94.8 75 3 3.9 94.8 74 8 10.4 90.9 73 7 9.1 80.5 72 12 15.6 71.4 71 12 15.6 55.8 70 13 16.9 40.3 69 7 9.1 23.4 68 6 7.8 14.3 67 3 3.9 6.5 66 0 0 2.6 65 2 2.6 2.6 ______________________________ 77 100
If you wanted to know what the total of all of the X's was, how would you
do it? The easiest way would be to multiply the (X) & (f) columns
and then add (sum) the results.
S (Xf )
Percentages. What percent of the group got this value for X? How
do you get this?
f / N * 100
For a histogram, vertical bars are drawn above each score so
that 1) the height of the bar corresponds to the frequency, & 2) The width
of the bar extends to the real limits of the score. A histogram is
used when the data are measured on an interval or a ratio scale.
For a bar graph, a vertical bar is drawn above each score
(or category) so that 1) The height of the bar corresponds to the
frequency, & 2) there is a space separating each bar from the next. A bar
graph is used when the data are measured on a nominal or an ordinal
scale.
Stem and leaf displays - These displays break each number
down into a lef part called the stem and a right part called the leaf. If
numbers are two digits, then the left digit is the stem and the right
digit is the leaf. -get a picture and can recover all of the individual
data points
8 | 8 | 4 7 | 555788 7 | 0000000000000111111111111222222222222333333344444444 6 | 557778888889999999 6 |
There are a number of different measures of center. Which is appropriate largely depends of the kind of variable and the shape of the distributions. So consider these three distributions:
Where is the single value that is most representative of the enitre distribution? For first - 5, for second is it 7 or 5 (this one is neg. skewed) for the third, is it 5, nobody is at 5. this one is bi-modal, that is it may be most appropriate to talk about having two middles - more on this in a bit
The most commonly known measure of central tendency is the arithmetic average, or the mean. We've already talked about how you would go about figuring this out from the data in a frequency distribution table.
The mean for a distribution is the sum of the scores divided by the number of scores.
The formula for the mean is:
mean = sum of all scores (X's) divided by the total number (N)
We can think of the mean in a couple of different ways.
Weighted means
the weighted means of two (or more) groups is achieved by adding the sums and dividing by the sums of the sample sizes. e.g., = S X1 + S X2
So suppose that I were to decide to make up my grading scale collapsing over all of my sections of stats. If I know that one section (n = 20) had a mean of 5 and the other 6 (n=30) how would I figure out the weighted mean? (20)(5) + (30)(6) = 100 + 180 = 5.6 20 + 30 50 |
2) if you add (or subtract) a constant to each score, then the mean will change by adding that constant. - suppose that you want to factor out the fact that each girl spent $2 buying supplies for the bakesale. So you want to subtract 2 from each amount. Now the total is $180, so the mean is 180/10 = $18. But notice you could have just subtracted $2 from the previous mean of $20 and arrived at the same answer.
3) if you multiply (or divide) each score by a constant, then the mean will change by being multiplied by that constant. - suppose that the troop sponser agreed to match the money made by each girlscout. That is they agree to give each girl scout an additional amount of money equal to however much they make on the sale. So now the total is $400, and the mean for each girl is 400/10 = $40.
So how do we find the median? Let's start by assuming that we have discrete categories.
3, 4, 4, 5, 5, 5, 6, 6, 7 |
2) With an even number of scores, just list them in order from lowest to
highest. Then find the middle two scores and determine the point
exactly midway between them. To do this add them together and
divide by two.
-so what is the median for our girl scouts?
$8, 10, 12, 15, 15, 18, 18, 19, 25, 60
middle two are 15 & 18 so 15 + 18 = 33 33/2 = 16.5
In a frequency distribution, the mode is the score or category that has the greatest frequency.
so the mode is 5 |
However, be aware that a frequency distribution may have more than one mode.
so the modes are 2 and 8
if one were bigger than the other it would be called the major mode and the other would be the minor mode |
- You cannot find a mean or median of a nominal scale, however you can find a mode for a nominal scale
- Use the median if:
2) there are undetermined values - if for some reason you don't know the value of one (or more) of your items (e.g., the person died before answering your question)
3) your distributions are 'open-ended' - by this we mean that there is no upper or lower limit on the possible values of your variable (e.g. your top answer on your questionare is '5 or more')
4) If your data are on an ordinal scale (rankings), then use the median.
symmetric distribution mean = median = mode |
|
positively skewed distribution mode < median < mean |
|
negatively skewed distribution mean < median < mode |
|
bimodal distribution mean = median, 2 modes |