Sampling Strategies and Principles of Experimental Design
Observational studies and Experiments
Observational studies:
- Collect data in a way that does not directly interfere with how the data arise.
- Only correlation can be inferred.
- Only establish an assocation between the explanatory and response variables.
Experiment:
- Randomly assign subjects to various treatments, and causation can be inferred.
- Example) A study is designed to evaluate whether people read text faster in Arial or Helvetica font. A group of volunteers who agreed to be a part of the study are randomly assigned to two groups: one where they read some text in Arial, and another where they read the same text in Helvetica. At the end, average reading speeds from the two groups are compared.
Random sampling: At randomly selection of subjects from population, helps generalizability of results
Random assignment:
- It occurs only in expereimental settings where subjects are being assigned to various treatments.
- It helps infer causation from results.
Sampling Strategies: simple random sampling, stratfied sampling, and cluster sampling.
- Simple Random Sampling: Randomly select cases from the population, such that each case is equally likely to be selected.
- Stratified Sampling: First divide the population into homogeneous groups, called strata, and then randomly sample from within each stratum.
Here is a good example: A consulting company is planning a pilot study on marketing in Boston. They identify the zip codes that make up the greater Boston area, then sample 50 randomly selected addresses from each zip code and mail a coupon to these addresses. They then track whether the coupon was used in the following month.
- Cluster Sampling: Divide the population into clusters, randomly sample a few clusters, and then sample all observations within these clusters. The clusters, unlike strata in stratified sampling, are heterogeneous within themselves and each cluster is similar to the others, such that we can get away with sampling from just a few of the clusters.
# install.packages("openintro")
library(openintro)
library(tidyverse)
data(county)
glimpse(county)
## Observations: 3,143
## Variables: 10
## $ name <fct> Autauga County, Baldwin County, Barbour County, ...
## $ state <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Ala...
## $ pop2000 <dbl> 43671, 140415, 29038, 20826, 51024, 11714, 21399...
## $ pop2010 <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947...
## $ fed_spend <dbl> 6.068095, 6.139862, 8.752158, 7.122016, 5.130910...
## $ poverty <dbl> 10.6, 12.2, 25.0, 12.6, 13.4, 25.3, 25.0, 19.5, ...
## $ homeownership <dbl> 77.5, 76.7, 68.0, 82.9, 82.0, 76.9, 69.0, 70.7, ...
## $ multiunit <dbl> 7.2, 22.6, 11.1, 6.6, 3.7, 9.9, 13.7, 14.3, 8.7,...
## $ income <dbl> 24568, 26469, 15875, 19918, 21070, 20289, 16916,...
## $ med_income <dbl> 53255, 50147, 33219, 41770, 45549, 31602, 30659,...
levels(county$state)
## [1] "Alabama" "Alaska" "Arizona"
## [4] "Arkansas" "California" "Colorado"
## [7] "Connecticut" "Delaware" "District of Columbia"
## [10] "Florida" "Georgia" "Hawaii"
## [13] "Idaho" "Illinois" "Indiana"
## [16] "Iowa" "Kansas" "Kentucky"
## [19] "Louisiana" "Maine" "Maryland"
## [22] "Massachusetts" "Michigan" "Minnesota"
## [25] "Mississippi" "Missouri" "Montana"
## [28] "Nebraska" "Nevada" "New Hampshire"
## [31] "New Jersey" "New Mexico" "New York"
## [34] "North Carolina" "North Dakota" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania"
## [40] "Rhode Island" "South Carolina" "South Dakota"
## [43] "Tennessee" "Texas" "Utah"
## [46] "Vermont" "Virginia" "Washington"
## [49] "West Virginia" "Wisconsin" "Wyoming"
county_noDC <- county %>%
filter(state != "District of Columbia") %>%
droplevels()
Simple Random Sample: sampling by random chance
county_srs <- county_noDC %>%
dplyr::sample_n(size = 150)
glimpse(county_srs)
## Observations: 150
## Variables: 10
## $ name <fct> Scott County, Chesterfield County, Camas County,...
## $ state <fct> Mississippi, South Carolina, Idaho, Massachusett...
## $ pop2000 <dbl> 28423, 42768, 991, 1465396, 22893, 4882, 53926, ...
## $ pop2010 <dbl> 28264, 46734, 1117, 1503085, 22363, 4812, 53227,...
## $ fed_spend <dbl> 8.212603, 7.418453, 6.500448, 14.242306, 8.50986...
## $ poverty <dbl> 22.2, 22.7, 16.3, 7.6, 11.3, 9.1, 17.5, 22.9, 23...
## $ homeownership <dbl> 80.4, 73.6, 72.6, 63.9, 79.5, 64.1, 76.6, 69.8, ...
## $ multiunit <dbl> 4.5, 7.5, 0.0, 44.7, 6.4, 12.8, 5.8, 11.4, 17.8,...
## $ income <dbl> 16608, 17162, 19659, 40139, 21891, 25269, 18905,...
## $ med_income <dbl> 35765, 32979, 44145, 77377, 44627, 42469, 36312,...
county_srs %>%
group_by(state) %>%
count()
## # A tibble: 41 x 2
## # Groups: state [41]
## state n
## <fct> <int>
## 1 Alabama 3
## 2 Alaska 1
## 3 Arizona 1
## 4 Arkansas 2
## 5 California 1
## 6 Colorado 4
## 7 Florida 7
## 8 Georgia 8
## 9 Idaho 5
## 10 Illinois 5
## # ... with 31 more rows
Stratified Sample: sample 2 counties per state to make up the sample of 150 counties.
county_str <- county_noDC %>%
group_by(state) %>%
sample_n(size = 2)
glimpse(county_str)
## Observations: 100
## Variables: 10
## Groups: state [50]
## $ name <fct> Pike County, Winston County, Lake and Peninsula ...
## $ state <fct> Alabama, Alabama, Alaska, Alaska, Arizona, Arizo...
## $ pop2000 <dbl> 29605, 24843, 1823, 7208, 51335, 19715, 83529, 5...
## $ pop2010 <dbl> 32899, 24484, 1631, 7523, 53597, 20489, 107118, ...
## $ fed_spend <dbl> 9.223563, 8.743874, 10.545064, 12.917187, 11.541...
## $ poverty <dbl> 28.6, 20.6, 21.4, 19.7, 18.9, 20.3, 9.9, 18.2, 2...
## $ homeownership <dbl> 56.3, 73.8, 75.0, 53.7, 78.3, 75.4, 77.7, 69.7, ...
## $ multiunit <dbl> 18.7, 6.1, 2.5, 19.4, 4.8, 3.6, 7.8, 14.1, 14.6,...
## $ income <dbl> 19013, 18055, 15161, 21278, 19600, 21165, 24584,...
## $ med_income <dbl> 29181, 33685, 40909, 55217, 37580, 32147, 51502,...
### US_regions DataSet
load("C:/blogdown/blogdown/content/post/us_regions.RData")
glimpse(us_regions)
## Observations: 51
## Variables: 2
## $ state <fct> Connecticut, Maine, Massachusetts, New Hampshire, Rhode...
## $ region <fct> Northeast, Northeast, Northeast, Northeast, Northeast, ...
head(us_regions)
## state region
## 1 Connecticut Northeast
## 2 Maine Northeast
## 3 Massachusetts Northeast
## 4 New Hampshire Northeast
## 5 Rhode Island Northeast
## 6 Vermont Northeast
Simple random sample: states_srs : result in different amounts of data being sampled from each state.
states_srs <- us_regions %>%
sample_n(size = 8)
# Count states by region
states_srs %>%
count(region)
## # A tibble: 4 x 2
## region n
## <fct> <int>
## 1 Midwest 3
## 2 Northeast 1
## 3 South 2
## 4 West 2
Stratified sample in R : each stratum (i.e. Region) is represented equally.
levels(us_regions$region)
## [1] "Midwest" "Northeast" "South" "West"
# Stratified sample
states_str <- us_regions %>%
group_by(region) %>%
sample_n(size = 2)
# Count states by region
states_str %>%
count(region)
## # A tibble: 4 x 2
## # Groups: region [4]
## region n
## <fct> <int>
## 1 Midwest 2
## 2 Northeast 2
## 3 South 2
## 4 West 2
In this stratified sample, each stratum (i.e. Region) is represented equally unlike simple random sampling.
Principles of experimental design: control, randomize, replicate, and block
- Control : Compare treatment of interest to a control group
- Randomize : Randomly assign subjects to treatments
- Replicate: Collect a sufficiently large sample withint a study, or replicate the entire study
- Block: Account for the potential effect of known or suspected confounding variables.
- Explanatory variable: Conditions you can impose on the experimental units
- blocking variable: Characteristics that the experimental units come with that you would like to control for.
- in random sampling, we use stratifying to control for a variable.
- in random assignment, we use blocking to control for a variable.
Case Study : to investigate whether instructors who are viewed to be better looking receive higher instructional ratings.
Inspect evals
## Observations: 463
## Variables: 21
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank <fct> tenure track, tenure track, tenure track, tenure...
## $ ethnicity <fct> minority, minority, minority, minority, not mino...
## $ gender <fct> female, female, female, female, male, male, male...
## $ language <fct> english, english, english, english, english, eng...
## $ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level <fct> upper, upper, upper, upper, upper, upper, upper,...
## $ cls_profs <fct> single, single, single, single, multiple, multip...
## $ cls_credits <fct> multi credit, multi credit, multi credit, multi ...
## $ bty_f1lower <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit <fct> not formal, not formal, not formal, not formal, ...
## $ pic_color <fct> color, color, color, color, color, color, color,...
cls_students : # of students in the class. Instead of the exact number of students, make three categoric groups: small(18 or fewers), midsize (19-59), large (60 or more) Recode cls_students as cls_type
evals_fortified <- evals %>%
mutate(
cls_type = case_when(
cls_students < 19 ~ "small",
cls_students < 60 ~ "midsize",
cls_students >= 60 ~ "large"
)
)
# enable to appear all columns in tibble form.
print.data.frame(head(evals_fortified, 3))
## score rank ethnicity gender language age cls_perc_eval
## 1 4.7 tenure track minority female english 36 55.81395
## 2 4.1 tenure track minority female english 36 68.80000
## 3 3.9 tenure track minority female english 36 60.80000
## cls_did_eval cls_students cls_level cls_profs cls_credits bty_f1lower
## 1 24 43 upper single multi credit 5
## 2 86 125 upper single multi credit 5
## 3 76 125 upper single multi credit 5
## bty_f1upper bty_f2upper bty_m1lower bty_m1upper bty_m2upper bty_avg
## 1 7 6 2 4 6 5
## 2 7 6 2 4 6 5
## 3 7 6 2 4 6 5
## pic_outfit pic_color cls_type
## 1 not formal color midsize
## 2 not formal color large
## 3 not formal color large
- bty_avg : the average beauty rating of the professor by the six students who were asked to rate the attractiveness of these faculty.
- score : the average professor evaluation score, with 1 being very unsatisfactory and 5 being excellent.
# Scatterplot of score vs. bty_avg
ggplot(evals, aes(x = bty_avg, y = score)) + geom_point()
# Scatterplot of score vs. bty_avg colored by cls_type
ggplot(evals_fortified, aes(x = bty_avg, y = score, color = cls_type)) +
geom_point()