Outline

randomness
basic probability
samples versus populations
simple random sampling
sampling error: variablility and bias

Lab 5: Sampling & Basic Probability

Why and how we use samples

Most of our research questions are concerned with large groups of individuals. However, it is usually the case that we can't test all of these individuals (usually because of resouce limitations like not enough time and/or money). So, while we're interested in looking at the large group as a whole (our population) we typically only look at a subset of individuals (our sample). A result of using samples is that the interpretation of analyses that we make about populations is grounded in probability.

Today's lab focuses on some of the basic differences between populations and samples, and the impact of these differences on our research questions. We will start with a bit of discussion about samples and populations, then discuss some basic probability theory, and then bring these two topics together.

Populations and Samples

Let's begin by looking at a picture of a population.
Let's assume that each circle is an individual, and that all of the individuals together constitute our population. To protect their identities, each individual is identified by a number (00 - 99) rather than their actual name. The shaded circles are individuals that support a leagalization of gambling proposal, while the unshaded circles are those indviduals who do not support the proposal.

Suppose that our local congressperson wants to know how the population feels about the proposal. One way to go about this is to ask all 100 individuals what their view is. Go ahead and count up how many people support the proposal. To find the proportion of individuals who support the proposal, divide the number of individuals supporting the position by the total number of individuals in the population.

1) One way to go about this is to ask all 100 individuals what their view is. Go ahead and count up how many people support the proposal. Enter this answer on your Lab Worksheet. To find the proportion of individuals who support the proposal, divide the number of individuals supporting the position by the total number of individuals in the population

proportion = p =

  number of individuals supporting the position  
the total number of individuals in the population

Now let's suppose that, the congress person needed to know the answer that afternoon, and didn't have enough staff to locate and contact all of the indivuals in time. So instead, the Statistical Advisor to Congress (I don't think that such a position actually exists, but perhaps one should) suggests that sampling techniques should be used. That is, rather than contact all of the individuals in the population, the staff should contact a subset of the individuals (take a sample from the population). Let's consider some different possible samples.

A simple random sample - the staff selects some number of individuals (n) at random, and contacts those ten and asks for their opinion.
To the right is one such sample of ten individuals (n = 10)

Now remember the original question, what proportion of the individuals in the population support the proposal. How do we use the sample to answer this question?
We use the proportion from the sample as an estimate of the population proportion.
The formula is pretty much the same, except now instead of a p we use a (pronounced "p-hat"). The "hat" means that it is an estimated proportion.

= # of individuals in sample supporting the position the total # of individuals in the sample

So for this sample = 4/10 = .40

sample1

There are many different possible samples of n = 10. Consider the two more below.

sample2 sample 3

Notice that the three samples we just considered have different estimates of the population proportion. When the sample proportion estimate is different from the population proportion we call this sampling error. Notice too, that not every sample has the same amount of smapling error; so there is variability in how much error we have. At first glance this seems like a BIG problem: How do we know which sample gives us the best estimate of the population proportion? Luckily for us (and the congressperson) there are statistical procedures to help with the problem.

Bring up the picture of a population again.

2) Using the population picture above (go back up and click the population button again), randomly select a sample of 5 individuals (choose 5 circles randomly). Compute the . Now do the same for 9 more samples of size n = 5. Type in all of these estimated proportions into your worksheet.

On your worksheet you'll see the axes for a graph, using X's plot your estimated proportions from your 10 samples on the graph. While the proportions in your samples vary (and thus the sampling error), you may begin to notice a pattern. What is the most common proportion among your 10 samples?

To the right is an example of what your graph for your samples might look like. graph

3) Now let's repeat the process, taking 10 more samples and plotting them on your graph. As we add more samples, a pattern should start to emerge. What do you notice about the pattern of proportions as we add more and more samples? (If you aren't sure, ask one of your lab neighbors to if you can add their samples to your graph and double your number to 40 samples)

Biased sampling

Suppose that the congressperson's staff decided not to do random sampling, but instead decided to just call individuals from the same area code (trying to save money on those long distance charges). This sampling method is an example of convenience sampling (selecting the individuals of the population that are easiest to reach). So let's reconsider our population of individuals along with an overlaid area code map.

Suppose that the congressional offices are located in the 217 area code. Samples taken from this sub-part of the population will typically have a higher estimation of the population proportion than the samples drawn using random sampling.

4) Go ahead and select 5 samples of size n = 5 from this area code. Compute their estimated population proportions.

Stop and consider why the estimates are consistently different from those that we got from the random sample.

The reason is that these convenience samples are biased. Bias is a systematic difference between the population parameter (in this case the proportion of those agreeing with the proposal) and the sample estimates (called statistics) resulting from your method of sampling. This is different from the variability that we saw earlier, because the variability comes from random differences, while bias comes from the systematic (or consistent) differences.

5) Suppose that the congressional offices are in the 204 area code. Would samples drawn from that area code have the same or different bias (or no bias) as those from the 217 area code? Why?

Convenience samples are typically biased even if the researcher doesn't know how. Bias is a systematic difference between the population parameter (in this case the proportion of those agreeing with the proposal) and the sample estimates (called statistics) that results from your method of sampling. This is different from the variability that we saw earlier, because the variability comes from random differences, while bias comes from the systematic (or consistent) differences.

Random events

What the above exercise is an attempt to show is that, while random events are somewhat unpredicatable in the short term (e.g., if you only take one or two samples, but that in the long term, across many repeated samples, random events usually have regular and predicatable pattern.

Random (chance) behavior is unpredictable in the short term, but has a regular and predictable pattern in the long run.
The probability of any outcome of a random phenomenon is a number between 0 and 1 that describes the proportion of times the outcome would occur in a very long series of repetitions.

This is why gambling casinos do so well. While, the casino operators can't know how particular rolls of the dice, or hands of cards will be dealt, they do know the odds of dice rolls and cards hands based on long term outcomes. So they set their payouts based on these odds and, over the long term, make their profits.

How is this relevant to our statistics course? The logic of inferential statistics is to estimate the probability of getting particular outcomes and make decisions based on these probabilities. Why do we need to estimate probabilities? Largely because we make observations on samples but want to make/test claims about populations. Inferential statistical procedures are used to estimate the amount of sampling error in our samples and uses it to determine the probability that our results are due to random chance. Later in the course we will go through the logical steps in this process. For now we will start by discussing the basics of probability.

Basic probability

We deal with probabilities everyday.

- lotto tickets, weather forcasts, medical reports on the news (e.g., risks of cancer)

In a situation where several different outcomes are possible, we define the probability for any particular outcome as a fraction or proportion. If the possible outcomes are identified as A, B, C, D, and so on, then:

Probability of A =	number of outcomes classified as A
	total number of possible outcomes

The total number of possible outcomes (the bottom part of the equation) is called the sample space S

Some Rules of probability

Any probability is a number between 0 and 1.0 (some people find it easier to think in terms of percentages, so 0 to 100%)
All possible outcomes together must have a probability equal to 1.0 (again, all possible outcomes add together to make 100% of the outcomes)
The probability that an event does not occur is 1 minus the probability that it does occur. (so if there is a 20% chance of something happening, then there is a 80% chance that it doesn't happen)
If two events have no common outcomes, then the probability that one OR the other occurs is the sum of their individual probabililties. (so if you roll a four-sided die, there is a 25% chance of getting each outcome 1,2,3, or 4. The chances of getting a 1 or a 4 is 25% (for a 1) plus 25% (for a 4) = 50% chance).

The probably of two independent outcomes both happening (A AND B) is the product of their two probabilities. p(A) * p(B). (using our 4-sided die example again. If we rolled the die twice, what is the probability of first rolling a 1 AND then rolling a 4? 25% (rolling a 1) times 25% (rolling a 4) = 6.25%)

Consider a concrete example:

You are playing War (the card game) with your kid sister, each of you has your own deck of 52 cards. She picks the Queen of hearts from her deck. What are the odds that you'll pick the Queen of hearts from your deck?

There are 52 different cards in a deck, so the sample space is 52. There is only one queen of hearts. So:

 prob of Q-hearts =  ____picking the Queen of hearts ___
		    total number of possible cards picked


= 1/ 52

Notationally we can express this probability as: p(Q♥) = f / N = .019
f = the frequency of queen of hearts in a standard deck of cards
N = the total number of possible cards picked

Now let you try a few.

6) What is the probability of selecting a red card from a standard deck of playing cards (remember that there are two red suits: ♥ and ♦)?

7) What is the probability of selecting a club (♣) from a standard deck of playing cards?

8) What is the probability of selecting a club (♣) or a heart (♥) from a standard deck of playing cards?

9) What is the probability of selecting a 7♣ or 7♥ from a standard deck of cards?

10) In each of the following situations, describe the sample space (i.e., possible outcomes) for the random phenomenon.

a) A seed is planted in the ground. It either germinates or fails to grow.

b) A patient with a usually fatal form of cancer is given a new treatment. The response variable is the length of time that the patient lives after treatment.

c) A student enrolls in a statistics course and at the end of the semester receives a letter grade.

d) A basketball player shoots four free throws. You record the sequence of hits and misses.

e) A basketball player shoots four free throws. You record the number of baskets she makes.

Note: (d) and (e) are different! In one case the are taking the "sequence" into account. Think about how that changes the question. Hint: the sequence takes the order into account, but the number of baskets does not. So the sequence Hit-Miss (made the first, missed the second) is different than Miss-Hit (missed the first, hit the second). But if we were just counting the hunber of made baskets, they'd both be the same (1 Hit).