|
In the last lab we learned to use the inferential statistical procedure of
estimation. In this lab (and the next) we'll learn about a related
inferential procedure: Hypothesis testing
|
Hypothesis testing is an inferential procedure that uses
sample data to evaluate the credibility of a hypothesis about a population.
| |
Hpothesis testing - the big picture view (more details will
follow)
Step 1: State assumptions, make a hypothesis and select a criteria for
the
decision
- the assumptions are related to the stat test you'll do and we'll
talk more about those as we discuss each individual test
- your hypothesis is an educated guess/prediction about the effect
of
particular events/treatments/factors (which result in
differences between populations)
- your hypothesis may be general (e.g., this course will change
comprehension abilities), or specific (e.g., this course will
improve comprehension abilities by at least 10%).
Step 2: Collect a sample
- randomly select individuals from a population
- randomly assign selected individuals to specific treatment groups
- after the treatment, the question that we have is, roughly, are all of
our individuals in the same population, or do we have
individuals belonging to a new population because of our
treatment
Step 3: Compute a test statistic
- things like z-scores, t-tests, F-tests (ANOVA)
Step 4: Compare the test statistic to a distribution to
make an
inference about the
parameter and hence draw a conclusion about the sample
- roughly, how likely is this difference due to sampling error?
Given this probability, what should we conclude?
The reasoning of statistical tests, like that of confidence intervals, is
based on asking what would happen if we repeated the experiment over and
over again.
Let's look at each of these steps in more detail.
Step1: Make a hypothesis and select a criteria for the
decision
The standard logic that underlies hypothesis testing is that there are
always (at least)
two hypotheses: the null hypothesis and the alternative
hypothesis
The null hypothesis (H0) predicts that the
independent variable
(treatment) has no effect on the dependent variable for the
population.
The alternative hypothesis (Ha) predicts that
the independent variable
will have an effect on the dependent variable for the population
The hypothesis
testing procedure assumes we are trying to reject
the null
hypothesis, not trying to prove the
alternative hypothesis.
Why?
Generally, it is easier to show that something isn't true, than
to prove that it is. This is especially true when we are
dealing with samples. Remember that we aren't testing
every individual in the population, only a subset.
Think about it this way. Suppose we had a hypothesis that all
dogs have 4 legs. To reject this hypothesis, we'd need to have a
sample which includes 1 or more dogs with more or fewer than 4 legs.
To accept it, we'd need to examine every dog in the
population and count their legs. It's much easier to get a sample to
show it's wrong than to test the whole population to show that it's
correct.
Example: Suppose that we know that in the US on
average
30% of registered voters vote in each election. You want to try to
increase that number with an ad campaign to try to get more people to
vote. So we conduct the ad campaign before a major election and then
record
the percentage of voters that vote in that election.
What will our hypotheses be in this case? H0 states
that the independent variable will have no effect so our
H0
is that m = 30% (indicating no effect
of ad campaign). Our
H1 is the opposite: that
m will not equal 30%.
Alternatively, we could make a specific alternative
hypothesis
if we chose. This would change our H0 too. Let's
consider
the specific case above where we expect that the ad campaign will
INCREASE voters. This means that we expect higher voting rates for
our sample than is in the population (30%). Here our
Ha
is that m > 30%. That means that
our
H0 is m < or = 30%.
Try some on your own. Each of the following situations calls for a
significance test for a
population mean m. State the null hypothesis
H0 and the alternative hypothesis Ha in each
case.
(1) The diameter of a spindel in a small motor is supposed to be 5mm. If
the spindle is either too small or too large, the motor will not work
properly. The manufacturer measures the diameter in a sample of motors to
determine whether the mean diameter has moved away from the target.
(2) Census Bureau data show that the mean household income in the area
served by a shopping mall is $52,500 per year. A market research firm
questions shoppers at the mall. The researchers suspect the mean
household income of mall shoppers is higher than that of the general
population.
(3) The examinations in a large psychology class are scaled after grading
so that the mean score is 50. The professor thinks that one teaching
assistant is a poor teacher and suspects that his students have a lower
mean than the class as a whole. The TA's students this semester can be
considered a sample from the population of all students in the course, so
the professor compares their mean score with 50.
So part of the first step is to set up your null hypothesis and your
alternative hypothesis (which we did above).
The other part of this step is to decide what criteria you
are going to use to either reject or fail to reject (not
accept) the null hypothesis. This is sometimes referred to as setting your
a level (that's alpha level).
So consider the problem that we have. We have a sample and its
descriptive statistics are different from the population's parameters. How do we
decide whether the difference that we see is due to a "real" difference
(which reflects a difference between two populations) or is due to
sampling error?
To deal with this problem the researcher must set a criteria in advance.
For example, think of the kinds of questions we were doing in
earlier labs. Given a population with a m = 65 and a s = 10, what is the probability
that our sample (of size n = 25) will have a mean of 70 or more?
To figure this out we computed the standard error and then a z-score.
p(
< 70): Need s =
10/sqroot(25) = 2. So z = (70 - 65) / 2 = 2.5 And p(
< 70) = p(z < 2.5) = 0.0062
We're going to be asking the same questions here, but taking it a step
further and saying things like, "Gee, the probability that my sample has a
mean of 70 or higher is 0.0062. That's pretty small. I'll bet that my
sample isn't really from this population, but is instead from another population."
Setting a criteria in advance is concerned with this part about saying
"that's pretty small". When we set the criteria in advance,
we are essentially saying, how small a chance is small
enough to reject the null hypothesis. Or in other words,
how big a difference do I need to have to reject the null
hypothesis. This cutoff p value is called alpha (
a).
Note: often alpha is determined by convention within
your own discipline. For example, some fields may say that p <
0.05 is low enough to reject the H0. While other fields may
chose p =< 0.01 as alpha.
Now let's look at some examples of this procedure (like the ones
we did in lab13)
with our new context of how small p is.
(4) A bottling company uses a filling machine to fill plastic bottles with
cola. The bottles are supposed to contain 300 milliliters (ml). In fact
the contents vary according to a normal distribution with a m = 298 ml and a standard deviation s = 3 ml. What is the probability that the mean contents
of the bottles in a
six-pack is less than 295 ml? How small is this probability (i.e., do you
think it is very likely that a sample of 6 bottles would have an average
contents of less than 295?)?
(5) IQ scores for the general population form a normal distribution with
m = 100 and s = 15. However, there are data that indicate that
children's intelligence can be affected if their mothers have German
measles during pregnancy. Using hospital records, a researcher obtained a
sample of n = 20 school children whose mothers all had German measles
during their pregnancies. The average IQ for this sample was 97.3. What is
the probability that this sample came from the general population
described
in the first sentence [Hint: you're looking for p(
<
97.3)]?
Assume that p < alpha is low enough to reject
H0 and assume the sample is different due to having had mothers
with German measles. Use
alpha = .05. Do
you think that there is enough evidence here to decide that the sample
came
from a different population than the general one described above?
That's the big picture of setting the criteria, now let's look at the
details.
What are the possible real world situations?
- H0 is correct
- H0 is wrong
What are the possible conclusions?
- H0 is correct
- H0 is wrong
So this sets up four possibilities (2 * 2):
- 2 ways of making mistakes
- 2 chances to be correct
|
Actual situation |
Experimenter's Conclusions |
|
|
| H0 is correct |
H0 is wrong |
|
| Reject H0 |
| Fail to reject H0 |
|
Type I error (oops!) |
correct (Yay!) |
correct (Yay!) |
Type II error (oops!) |
|
|
The two kinds of errors each have their own name, because they really
are reflecting different things.
Type I error (a, alpha) - the
H0 is actually correct, but the
experimenter rejected it
- e.g., there really is only one population,
even though the probability of getting
a sample was really small, you just
got one of those rare samples
Type II error (b, beta)- the
H0 is really wrong, but the experiment didn't
give us the evidence we need to reject it
- e.g., your sample really does come from
another population, but your sample
mean is too close to the original
population mean that you aren't can't
rule out the possibility that there is
only one population
In scientific research, we typically take a conservative approach, and set
our critera such
that we try to minimize the chance of making a Type I error (concluding
that there is
an effect of something when there really isn't). In other words,
scientists focus on
setting an acceptable alpha level (a), or level of significance.
The alpha level (a), or level of
significance, is a probabiity value that
defines the very unlikely sample outcomes when the null hypothesis
is true. Whenever an experiment produces very unlikely data (as
defined by alpha), we will reject the null hypothesis. Thus, the
alpha level also defines the probability of a Type I error - that is,
the
probability of rejecting H0 when it is actually true.
Note: In psychology a
is usually set at 0.05
Let's look at pictures of distributions to try and connect this
with what we've been talking about so far.
Consider the following sample mean distributions.
|
a = prob of making a type I error |
|
general alternative hypothesis
H0: no difference
H1: there is a difference
Two-tailed test
a = 0.05
so this is 0.025 in each tail
0.025 + 0.025 = 0.05
|
|
specific alternative hypothesis
|
So how do we interpret these graphs?
If our sample mean falls into the shaded areas then we reject the
H0. On the other
hand, if our sample mean falls outside of the shaded areas, then we may
not
reject the H0. These shaded regions are called the
critical regions. This is the same thing as comparing p with
alpha since the shaded regions are equal to the proportion set by
alpha.
(6) Suppose we think that listening to classical
music will affect the amount of time it takes a person to fall asleep so
we conduct a study to test this idea.
(a) Suppose that the average person in the population falls asleep in
15 minutes (without listening to classical music) with s
= 6 min,
state the null and
alternative hypotheses for this study.
(b) Assume that the amount of time it takes people in the population to
fall asleep is normally distributed. In the study we have a sample of
people
listen to classical music and then we measure how long it takes them to
fall
asleep. Suppose the sample of 36 people fall asleep in 12 minutes. What is
the
probability of obtaining a sample mean of 12 minutes or smaller?
Assuming a = .05, is your calculated p value
in the critical region (Hint: remember to consider two critical
regions)?
(c) Using your answer to part (b), what decision should be made about the
null hypothesis you stated in part (a)?
(d) Assume now that in reality classical music does not affect how long it
takes people to fall asleep. In this case, what kind of decision (correct,
Type I error or Type II error) have you made in part (c)?
(7) A developmental psychologist believes that a new technique can help
kids learn math skills faster than the current technique. He measures math
skills from a standardized math skills test. It is known that the
population of 5th graders in the US score and average of 80 on this
test. The psychologist uses the new technique on a sample of 5th graders
for one year and then has them take the standardized test at the end of
the
year to compare their scores with the population mean.
(a) What are the researcher's null and alternative hypotheses (Hint:
remember that he believes the new technique will increase
scores)?
(b) Suppose the psychologist calculated the z score for his sample mean
and found that there is a .0890 chance of getting a sample mean that large
or larger. If his alpha level is .05, is his sample in the critical
region?
If his new technique really does have and effect, what kind of decision
will he make for his test?
One-sample z test
Assumtions of the test (and most hypothesis testing)
1) Random sample - the samples must me representative of the
populations.
Random sampling helps to ensure the representativeness.
2) Independent observations -also related to the
representativeness issue, each
observation should be independent of all of the other observations.
That is,
the probability of a particular observation happening should remain
constant.
3) s is known and is constant -
the standard deviation of the original population
must stay constant. Why? More generally, the treatment is assumed to
be
adding (or subtracting) a constant from every individual in the
population.
So the mean of that population may change as a result of the treatment,
however, recall that adding (or subtracting) a constant from every
individual
does not change the standard deviation.
4) the sampling distribution is relatively normal - either because
the distribution of
the raw observations is relatively normal, or because of the Central
Limit
Theorem (or both).
Violations of any of these assumptions will severly compromise any
conclusions that you
make about the population based on your sample (basically, you need to
use other
kinds of inferential statistics that can deal with violations of various
assumptions)
So far we've been discussing the logic of our Hypothesis Testing
procedure. In this lab, we're going to cover one type of test statistic
and
put all the steps of hypothesis testing together. We're going to conduct
hypothesis testing using the one-sample z-test. We've already covered the
logic of how this works, but now we'll make it more formal as an
inferential test statistic.
Let's quickly recall the decision tree that we saw earlier in the
semester.
Find the string of decisions that lead to a 1-sample z-test.
The one-sample z-test is used to compare a single sample to a known
population mean m when we know:
(a) the distribution of sample means is normal
AND
(b) the population standard deviation s.
|
Let's look at a complete example using our hypothesis testing steps:
|
Suppose we were interested in whether the number of hours students spend
studying differs by class status (i.e., freshmen, sophomores, juniors,
seniors). Specifically, we want to know if seniors spend more time
studying than the average college student of any year. We know that the
general population of college student in the US spends an average of 4
hours a day studying for their courses, with a s
= 1 hour. To answer our question, we'll ask a sample of 50 seniors how
much time they spend studying per day.
|
In this study we are comparing a sample to a known population
m and s.
We also know that the distribution of sample means will be normal because
our sample size is greater than 30. That means we can use our z-score
procedure to conduct this test.
Step 1: State hypotheses and decision criterion.
For Ha: mseniors >
4 (because we are predicting that seniors study MORE)
For H0: mseniors
< 4 (because these are the other possibilities for the
comparison)
Since our Ha is a directional hypothesis (we're only
predicting an increase in study hours), we'll have a one-tailed test
because we'll only need to consider the critical region above the null
population mean.
We also need to set our alpha level for our decision criterion. We'll
use a = .05. This is the highest probability
of the null being true that we'll accept as evidence against it.
Step 2: Collect sample data
In this step, we actually collect our data. Suppose we asked the 50
seniors we selected for our sample how many hours a week they study and
the average reported mean for the sample was = 4.4 hours. With our sample mean, we're
ready to move on to Step 3 and calculate our z-score test statistic.
Step 3: Calculate test statistic
The first thing we need to do here is to calculate the standard error
. We'll have:
= = 1/sq.
root 50 = 1/7.07 = .14
Then we're ready to calculate z:
z = = (4.4 - 4) /
.14
= 2.86
This is our test statistic. We'll need to know the probability of
getting a sample mean this large or larger so we need to find z = 2.86 in
the unit normal table.
We find that the probability of a sample mean this large or larger is
.0021. Now it's time to do our last step to make our decision.
Step 4: Make a decision about H0
We need to determine where our test statistic (z = 2.86)
falls in the distribution by comparing it's p value with alpha.
Is it in the critical region? In this case it is
because it is lower than .05 so it would fall in the shaded region.
This means there is less than a .0021 chance that we'd get a sample mean
of
4.4 if seniors are the same as the general population. Since that's less
than .05 (our decision crtierion), that's low enough
that we can decide that seniors come from a different population with a
different mean than the general population of college students. This also
means we have enough evidence to reject the H0 hypothesis and
accept Ha that seniors study more than the general population
of
college students. So our decision is:
Reject the null hypothesis and accept the alternative
hypothesis.
Of course remember that there's still a chance (less than 5%) that we
made a Type I error, but we're reasonably sure we made the right decision.
|
One last note about statistical significance: Remember from the last lab
that the larger the sample size, the more likely we are to detect an
effect
that exists, because we're more likely to reject the null hypothesis.
However, this also means that with large sample sizes, even if the
effect is really small, we're more likely to reject the null and decide
there's an effect. Therefore, in the case of very large samples, we may
detect effects that lack practical significance because it is small
enough to be not important. We must be careful that when we find
statistical significance that our findings also have practical
significance, meaning the effect of the treatment is important.
|
Ok, now try some on your own. For problems (8) - (9), write out each step
of
hypothesis testing.
(8) A psychologist examined the effect of chronic alcohol abuse on memory.
In this study, a standardized memory test was used. Scores on this test
for the general population form a normal distribution with m = 50 and s = 6. A sample of
n = 22 alcohol abusers had a mean score of = 47. Is there evidence for memory
impairment among alcoholics? Use a = .05 for
a one-tailed test.
(9) On a vocational interest inventory that measures interest in several
categories, a very large standardization group of adults (i.e., a
population) has an average score of m = 22 and s = 4. Scores are
normally distributed. A
researcher would like to determine if scientists differ from the general
population in terms of writing interests. A random sample of scientists is
selected from the directory of a national science society. The scientists
are given the inventory, and their test scores on the literary scale are
as
follows: 21, 20, 23, 28, 30, 24, 23, 19. Do scientists differ from the
general population in their writing interests? Test at the .05 level of
significance for two tails.
|