Sampling Strategies and Principles of Experimental Design

Observational studies and Experiments


Observational studies:

  • Collect data in a way that does not directly interfere with how the data arise.
  • Only correlation can be inferred.
  • Only establish an assocation between the explanatory and response variables.


Experiment:

  • Randomly assign subjects to various treatments, and causation can be inferred.
  • Example) A study is designed to evaluate whether people read text faster in Arial or Helvetica font. A group of volunteers who agreed to be a part of the study are randomly assigned to two groups: one where they read some text in Arial, and another where they read the same text in Helvetica. At the end, average reading speeds from the two groups are compared.


Random sampling: At randomly selection of subjects from population, helps generalizability of results


Random assignment:

  • It occurs only in expereimental settings where subjects are being assigned to various treatments.
  • It helps infer causation from results.



Sampling Strategies: simple random sampling, stratfied sampling, and cluster sampling.


- Simple Random Sampling: Randomly select cases from the population, such that each case is equally likely to be selected.
- Stratified Sampling: First divide the population into homogeneous groups, called strata, and then randomly sample from within each stratum.
Here is a good example: A consulting company is planning a pilot study on marketing in Boston. They identify the zip codes that make up the greater Boston area, then sample 50 randomly selected addresses from each zip code and mail a coupon to these addresses. They then track whether the coupon was used in the following month.
- Cluster Sampling: Divide the population into clusters, randomly sample a few clusters, and then sample all observations within these clusters. The clusters, unlike strata in stratified sampling, are heterogeneous within themselves and each cluster is similar to the others, such that we can get away with sampling from just a few of the clusters.

# install.packages("openintro")
library(openintro)
library(tidyverse)
data(county)
glimpse(county)
## Observations: 3,143
## Variables: 10
## $ name          <fct> Autauga County, Baldwin County, Barbour County, ...
## $ state         <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Ala...
## $ pop2000       <dbl> 43671, 140415, 29038, 20826, 51024, 11714, 21399...
## $ pop2010       <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947...
## $ fed_spend     <dbl> 6.068095, 6.139862, 8.752158, 7.122016, 5.130910...
## $ poverty       <dbl> 10.6, 12.2, 25.0, 12.6, 13.4, 25.3, 25.0, 19.5, ...
## $ homeownership <dbl> 77.5, 76.7, 68.0, 82.9, 82.0, 76.9, 69.0, 70.7, ...
## $ multiunit     <dbl> 7.2, 22.6, 11.1, 6.6, 3.7, 9.9, 13.7, 14.3, 8.7,...
## $ income        <dbl> 24568, 26469, 15875, 19918, 21070, 20289, 16916,...
## $ med_income    <dbl> 53255, 50147, 33219, 41770, 45549, 31602, 30659,...
levels(county$state)
##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District of Columbia"
## [10] "Florida"              "Georgia"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "Minnesota"           
## [25] "Mississippi"          "Missouri"             "Montana"             
## [28] "Nebraska"             "Nevada"               "New Hampshire"       
## [31] "New Jersey"           "New Mexico"           "New York"            
## [34] "North Carolina"       "North Dakota"         "Ohio"                
## [37] "Oklahoma"             "Oregon"               "Pennsylvania"        
## [40] "Rhode Island"         "South Carolina"       "South Dakota"        
## [43] "Tennessee"            "Texas"                "Utah"                
## [46] "Vermont"              "Virginia"             "Washington"          
## [49] "West Virginia"        "Wisconsin"            "Wyoming"
county_noDC <- county %>% 
  filter(state != "District of Columbia") %>% 
  droplevels()

Simple Random Sample: sampling by random chance

county_srs <- county_noDC %>% 
  dplyr::sample_n(size = 150)
glimpse(county_srs)
## Observations: 150
## Variables: 10
## $ name          <fct> Scott County, Chesterfield County, Camas County,...
## $ state         <fct> Mississippi, South Carolina, Idaho, Massachusett...
## $ pop2000       <dbl> 28423, 42768, 991, 1465396, 22893, 4882, 53926, ...
## $ pop2010       <dbl> 28264, 46734, 1117, 1503085, 22363, 4812, 53227,...
## $ fed_spend     <dbl> 8.212603, 7.418453, 6.500448, 14.242306, 8.50986...
## $ poverty       <dbl> 22.2, 22.7, 16.3, 7.6, 11.3, 9.1, 17.5, 22.9, 23...
## $ homeownership <dbl> 80.4, 73.6, 72.6, 63.9, 79.5, 64.1, 76.6, 69.8, ...
## $ multiunit     <dbl> 4.5, 7.5, 0.0, 44.7, 6.4, 12.8, 5.8, 11.4, 17.8,...
## $ income        <dbl> 16608, 17162, 19659, 40139, 21891, 25269, 18905,...
## $ med_income    <dbl> 35765, 32979, 44145, 77377, 44627, 42469, 36312,...
county_srs %>% 
  group_by(state) %>% 
  count()
## # A tibble: 41 x 2
## # Groups:   state [41]
##    state          n
##    <fct>      <int>
##  1 Alabama        3
##  2 Alaska         1
##  3 Arizona        1
##  4 Arkansas       2
##  5 California     1
##  6 Colorado       4
##  7 Florida        7
##  8 Georgia        8
##  9 Idaho          5
## 10 Illinois       5
## # ... with 31 more rows

Stratified Sample: sample 2 counties per state to make up the sample of 150 counties.

county_str <- county_noDC %>% 
    group_by(state) %>% 
    sample_n(size = 2)
glimpse(county_str)
## Observations: 100
## Variables: 10
## Groups: state [50]
## $ name          <fct> Pike County, Winston County, Lake and Peninsula ...
## $ state         <fct> Alabama, Alabama, Alaska, Alaska, Arizona, Arizo...
## $ pop2000       <dbl> 29605, 24843, 1823, 7208, 51335, 19715, 83529, 5...
## $ pop2010       <dbl> 32899, 24484, 1631, 7523, 53597, 20489, 107118, ...
## $ fed_spend     <dbl> 9.223563, 8.743874, 10.545064, 12.917187, 11.541...
## $ poverty       <dbl> 28.6, 20.6, 21.4, 19.7, 18.9, 20.3, 9.9, 18.2, 2...
## $ homeownership <dbl> 56.3, 73.8, 75.0, 53.7, 78.3, 75.4, 77.7, 69.7, ...
## $ multiunit     <dbl> 18.7, 6.1, 2.5, 19.4, 4.8, 3.6, 7.8, 14.1, 14.6,...
## $ income        <dbl> 19013, 18055, 15161, 21278, 19600, 21165, 24584,...
## $ med_income    <dbl> 29181, 33685, 40909, 55217, 37580, 32147, 51502,...
### US_regions DataSet
load("C:/blogdown/blogdown/content/post/us_regions.RData")
glimpse(us_regions)
## Observations: 51
## Variables: 2
## $ state  <fct> Connecticut, Maine, Massachusetts, New Hampshire, Rhode...
## $ region <fct> Northeast, Northeast, Northeast, Northeast, Northeast, ...
head(us_regions)
##           state    region
## 1   Connecticut Northeast
## 2         Maine Northeast
## 3 Massachusetts Northeast
## 4 New Hampshire Northeast
## 5  Rhode Island Northeast
## 6       Vermont Northeast

Simple random sample: states_srs : result in different amounts of data being sampled from each state.

states_srs <- us_regions %>%
  sample_n(size = 8)

# Count states by region
states_srs %>%
  count(region)
## # A tibble: 4 x 2
##   region        n
##   <fct>     <int>
## 1 Midwest       3
## 2 Northeast     1
## 3 South         2
## 4 West          2

Stratified sample in R : each stratum (i.e. Region) is represented equally.

levels(us_regions$region)
## [1] "Midwest"   "Northeast" "South"     "West"
# Stratified sample
states_str <- us_regions %>%
  group_by(region) %>%
  sample_n(size = 2)

# Count states by region
states_str %>%
  count(region)
## # A tibble: 4 x 2
## # Groups:   region [4]
##   region        n
##   <fct>     <int>
## 1 Midwest       2
## 2 Northeast     2
## 3 South         2
## 4 West          2

In this stratified sample, each stratum (i.e. Region) is represented equally unlike simple random sampling.



Principles of experimental design: control, randomize, replicate, and block


- Control : Compare treatment of interest to a control group
- Randomize : Randomly assign subjects to treatments
- Replicate: Collect a sufficiently large sample withint a study, or replicate the entire study
- Block: Account for the potential effect of known or suspected confounding variables.
- Explanatory variable: Conditions you can impose on the experimental units
- blocking variable: Characteristics that the experimental units come with that you would like to control for.
- in random sampling, we use stratifying to control for a variable.
- in random assignment, we use blocking to control for a variable.



Case Study : to investigate whether instructors who are viewed to be better looking receive higher instructional ratings.

Inspect evals

## Observations: 463
## Variables: 21
## $ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank          <fct> tenure track, tenure track, tenure track, tenure...
## $ ethnicity     <fct> minority, minority, minority, minority, not mino...
## $ gender        <fct> female, female, female, female, male, male, male...
## $ language      <fct> english, english, english, english, english, eng...
## $ age           <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval  <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students  <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level     <fct> upper, upper, upper, upper, upper, upper, upper,...
## $ cls_profs     <fct> single, single, single, single, multiple, multip...
## $ cls_credits   <fct> multi credit, multi credit, multi credit, multi ...
## $ bty_f1lower   <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper   <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper   <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper   <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper   <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit    <fct> not formal, not formal, not formal, not formal, ...
## $ pic_color     <fct> color, color, color, color, color, color, color,...

cls_students : # of students in the class. Instead of the exact number of students, make three categoric groups: small(18 or fewers), midsize (19-59), large (60 or more) Recode cls_students as cls_type

evals_fortified <- evals %>%
  mutate(
    cls_type = case_when(
      cls_students < 19  ~ "small",
      cls_students < 60  ~ "midsize",
      cls_students >= 60 ~ "large"
    )
  )
# enable to appear all columns in tibble form.
print.data.frame(head(evals_fortified, 3))
##   score         rank ethnicity gender language age cls_perc_eval
## 1   4.7 tenure track  minority female  english  36      55.81395
## 2   4.1 tenure track  minority female  english  36      68.80000
## 3   3.9 tenure track  minority female  english  36      60.80000
##   cls_did_eval cls_students cls_level cls_profs  cls_credits bty_f1lower
## 1           24           43     upper    single multi credit           5
## 2           86          125     upper    single multi credit           5
## 3           76          125     upper    single multi credit           5
##   bty_f1upper bty_f2upper bty_m1lower bty_m1upper bty_m2upper bty_avg
## 1           7           6           2           4           6       5
## 2           7           6           2           4           6       5
## 3           7           6           2           4           6       5
##   pic_outfit pic_color cls_type
## 1 not formal     color  midsize
## 2 not formal     color    large
## 3 not formal     color    large
  • bty_avg : the average beauty rating of the professor by the six students who were asked to rate the attractiveness of these faculty.
  • score : the average professor evaluation score, with 1 being very unsatisfactory and 5 being excellent.
# Scatterplot of score vs. bty_avg
ggplot(evals, aes(x = bty_avg, y = score)) + geom_point()

# Scatterplot of score vs. bty_avg colored by cls_type
ggplot(evals_fortified, aes(x = bty_avg, y = score, color = cls_type)) +
  geom_point()

Avatar
Shawn Kim
Actively seeking for full-time opportunities | Analytics Position

Actively seeking for full-time opportunities | Analytics Position