Completely Randomized Design (CRD)

Today’s goals

Explore key concepts in CRD:

  1. Homogeneous experimental unit
  2. Treatment randomization
  3. The effects model
  4. The ANOVA table
  5. The linear model assumptions

Motivational example - Treatment design

  • 2-way factorial
  • N fertilizer rates: 0, 100, 200 kg N/ha
  • K fertilizer rates: 0, 30, 60 kg K/ha
  • 3 x 3 = 9 treatment combinations

Experimental design

Assuming we have homogeneous experimental material (e.g., same soil type, topography, etc.)

  • Thus, we can use a completely randomized design (CRD)

Treatment randomization

  • Randomization guards against unknown or uncontrollable sources of bias

  • Avoid systematic patterns

  • Eliminate selective assignment of EU to Trt (conscious or unconscious!)

  • Randomization allows for valid inference on CAUSATION

Treatment randomization - CRD

  • Randomization of a treatment to a EU is unrestricted.

  • That means that replications of same treatment could, by random chance, fall right next to each other.

In our motivational example:

  • 4 replicates

  • Total observations: 9 x 4 = 36 EUs

Homogeneous Experimental Material - CRD

In the plot layout here, all treatments (1 through 9) were randomly assigned to any experimental unit (plot) in the study area.

Homogeneous Experimental Material - CRD

  • Treatment 1 and its replicates are highlighted.

  • Note how, due to the unrestricted randomization, treatment 1 appears twice in the first column, and does not appear on the third column. The same happened with other treatments.

Homogeneous Experimental Material - CRD

Because the experimental material is homogeneous (e.g., same soil texture class), this should not be an issue when estimating treatment means and performing comparisons. 👍

The effects model

\[ y_{ijk} = \mu + \alpha_{i} + \beta_{j} + \alpha\beta_{ij} + e_{ijk} \]

  • \(y_{ijk}\) is the observation on the kth rep. from ith N rate and jth K rate
  • \(\mu\) is the overall mean
  • \(\alpha_{i}\) is the differential effect of ith N rate
  • \(\beta_{j}\) is the differential effect of jth K rate
  • \(\alpha\beta_{ij}\) is the differential effect of the combination of the ith N rate and ith K rate
  • \(e_{ijk}\) is the residual corresponding to the kth replicate of N rate i and K rate j.

The ANOVA table

In the following ANOVA table…

  • n is number of levels in N rate = 3
  • k is number of levels in K rate = 3
  • r is number of replicates = 4
  • N is total number of obserevations or EUs = 3 x 3 x 4 = 36

The ANOVA table

library(dplyr)
library(flextable)

Attaching package: 'flextable'
The following object is masked from 'package:purrr':

    compose
tribble(~`Source of variation`,~`df`,~SS,~MS,~`F ratio`,
        #"Overall mean", "1", NA_character_, NA, NA,
        "N rate", "dfn =\nn - 1 =\n3 - 1 = 2", "SSn", "MSn =\nSSn / dfn ", "MSn / MSe",
        "K rate", "dfk =\nk - 1 =\n3 - 1 = 2", "SSk", "MSk =\nSSk / dfk", "MSk / MSe",
        "N x K", "dfnk =\n(n - 1) x (k - 1) =\n(3-1) x (3-1) = 4", "SSnk", "MSnk =\nSSnk / dfnk", "MSnk / MSe",
        "Error", "dfe =\nnk(r - 1) =\n3x3(4-1) =\n9x3 = 27", "SSe", "MSe =\nSSe / dfe", "",
        "TOTAL", "dft =\nN -1 = 35", "SSt", "", ""
        ) %>%
  regulartable() %>%
  fontsize(size = 16, part = "all") %>%
  border_inner_h(part="all") %>%
  autofit()

Source of variation

df

SS

MS

F ratio

N rate

dfn =
n - 1 =
3 - 1 = 2

SSn

MSn =
SSn / dfn

MSn / MSe

K rate

dfk =
k - 1 =
3 - 1 = 2

SSk

MSk =
SSk / dfk

MSk / MSe

N x K

dfnk =
(n - 1) x (k - 1) =
(3-1) x (3-1) = 4

SSnk

MSnk =
SSnk / dfnk

MSnk / MSe

Error

dfe =
nk(r - 1) =
3x3(4-1) =
9x3 = 27

SSe

MSe =
SSe / dfe

TOTAL

dft =
N -1 = 35

SSt

  #knitr::kable()
  • Each component of effects model has a row in the table

  • The larger is the F ratio value, the more evidence towards an effect being significant.

The ANOVA table - let’s think about it

  • What can I do to increase the chances of finding a significant effect? In other words, how can I \(\uparrow\) F ratios?
  1. Have more variability explained by treatment (\(\uparrow\) SSn ~ MSn)
  2. Have less variability explained by error (\(\downarrow\) MSe), through
  3. Less noise in the error, or
  4. More dfe

ANOVA table - motivational example

library(car)
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
The following object is masked from 'package:purrr':

    some
library(emmeans)
library(multcomp)
Loading required package: mvtnorm
Loading required package: survival
Loading required package: TH.data
Loading required package: MASS

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select

Attaching package: 'TH.data'
The following object is masked from 'package:MASS':

    geyser
# trt 
trt <- data.frame(trt = 1:16,
                  n_code = rep(c("N1", "N2", "N3", "N4"),
                               each = 4
                               ),
                  # the study had P, I'm replacing for K
                  k_code = rep(c("K1", "K2", "K3", "K4"),
                               4
                               )
                  ) %>%
  mutate(nrate_kgha = case_when(
    n_code == "N1" ~ 0,
    n_code == "N2" ~ 100,#65,
    n_code == "N3" ~ 200,#95,
    n_code == "N4" ~ 120
  )) %>%
  mutate(krate_kgha = case_when(
    k_code == "K1" ~ 0,
    k_code == "K2" ~ 30,#50,
    k_code == "K3" ~ 60,#70,
    k_code == "K4" ~ 90
  ))  

# data
df <- read_csv("files/On-Station_Trial_on_Nitrogen_and_Phosphate_Fertilization_of_Wheat__Balkh__Afghanistan__2018-19.csv") %>%
  janitor::clean_names() %>%
  dplyr::select(rep, trt, yield_kg_ha) %>%
  left_join(trt) %>%
  filter(nrate_kgha != 120) %>%
  filter(krate_kgha != 90) %>%
    mutate(rep = factor(rep),
         nrate_kgha = factor(nrate_kgha),
         krate_kgha = factor(krate_kgha)
         )  %>%
  dplyr::select(rep, trt, nrate_kgha, krate_kgha, yield_kg_ha)
Rows: 48 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (17): Plot No, Rep, Trt, Fall Stand %, Spring Stand %, Days to Heading, ...
lgl  (1): Remark

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Joining with `by = join_by(trt)`
set.seed(298)
df_rep4 <- df %>%
  group_by(trt, nrate_kgha, krate_kgha) %>%
  summarise(yield_kg_ha = mean(yield_kg_ha)) %>%
  ungroup() %>%
  mutate(e = rnorm(nrow(.), 0,300)) %>%
  mutate(yield_kg_ha = yield_kg_ha + e) %>%
  mutate(rep = "4") %>%
  dplyr::select(rep, trt, nrate_kgha, krate_kgha, yield_kg_ha)
`summarise()` has grouped output by 'trt', 'nrate_kgha'. You can override using
the `.groups` argument.
df_all <- df %>%
  bind_rows(df_rep4)

# lm
options(contrasts = c("contr.sum", "contr.poly"))

lm2w <- lm(yield_kg_ha ~ nrate_kgha*krate_kgha,
           data = df_all)

Anova(lm2w, type = 3) %>%
  as.data.frame() %>%
  rownames_to_column(var = "Source of variation") %>%
  mutate(`Pr(>F)` = ifelse(`Pr(>F)` < 0.001, 
                           "<0.001",
                           round(`Pr(>F)`,3))) %>%
  regulartable() %>%
  fontsize(size = 20, part = "all") %>%
  #border_inner_h(part="all") %>%
  autofit()

Source of variation

Sum Sq

Df

F value

Pr(>F)

(Intercept)

836,785,505.5

1

1,917.9470762

<0.001

nrate_kgha

1,490,936.1

2

1.7086437

0.2

krate_kgha

470,561.8

2

0.5392735

0.589

nrate_kgha:krate_kgha

11,108,159.3

4

6.3650904

<0.001

Residuals

11,779,891.6

27

  • What is significant here (say at \(\alpha = 0.05\))?

  • But wait, before inference, need to check model assumptions (based on residuals)!

The linear model residual(s)

\[ \hat{e}_{ijk} = y_{ijk} - \hat{y}_{ijk} \] Or, in other words…

\[ residual = observed - predicted \]

df_all %>%
  mutate(trt = factor(trt)) %>%
  ggplot(aes(x = trt, y = yield_kg_ha))+
  geom_point(shape = 21,
             fill = "purple", 
             size = 3,
             alpha = .7)+
  geom_hline(yintercept = mean(df_all$yield_kg_ha),
             color = "blue")+
  geom_point(data = df_all %>%
               mutate(trt = factor(trt)) %>%
               group_by(trt) %>%
               summarise(mean = mean(yield_kg_ha)),
             aes(y = mean),
             shape = 22,
             fill = "green", 
             size = 3,
             alpha = .7
               )+
  theme_bw()

The linear model assumptions are based on the residuals!

\[ e_{ijk} \sim iidN(0, \sigma^2) \] \(e_{ijk}\) is the residual corresponding to the kth replicate of N rate i and K rate j.

Random experimental errors (residuals) are assumed to be:

  • Mutually independent
  • Normally distributed with mean 0 and common and homogeneous variance \(\sigma^2\) across trts
  • No extreme observations (outliers)

Studentized residuals

  • There are different types of residuals, including raw residual (the previous formula).

  • Raw residuals are not useful to assess all assumptions (e.g., outliers)

  • That limitation is addressed if we calculated the studentized residuals

Studentized residuals

Studentized residuals are the raw residuals divided by their standard deviation:

  • ~95% of studentized residuals within ± 2 sd
  • ~99% of studentized residuals within ± 3 sd

Valid analysis depend on valid assumptions!

  • If model assumptions are not met, inference is meaningless

  • Validity and repeatability of results is QUESTIONABLE

  • Always check residuals before making inference! If something is wrong with assumptions, need to fix before proceeding with inference!

Checking assumptions - residual independence

library(broom)
lm2w_aug <- augment(lm2w) %>%
  mutate(.studresid=rstudent(lm2w))

ggplot(lm2w_aug, aes(x=.fitted, y=.studresid))+
  geom_hline(yintercept = 0, color="red")+
  geom_point(shape = 21,
             fill = "purple", 
             size = 3,
             alpha = .7)+
  geom_smooth()+
  geom_hline(yintercept = c(-3,3), color = "red")+
  theme_bw()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
  • No clear pattern (evidenced by geom_smooth)

Checking assumptions - residual homoscedasticity

ggplot(lm2w_aug, aes(x=.fitted, y=.studresid))+
  geom_hline(yintercept = 0, color="red")+
  geom_point(shape = 21,
             fill = "purple", 
             size = 3,
             alpha = .7)+
  geom_smooth()+
  geom_hline(yintercept = c(-3,3), color = "red")+
  theme_bw()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
  • Data spread is even and centered around zero (evidenced by geom_smooth)

Checking assumptions - residual normality

ggplot(lm2w_aug, aes(sample=.studresid))+
  stat_qq(  shape = 21,
            fill = "purple", 
            size = 3,
            alpha = .7
  )+
  stat_qq_line()+
  labs(x = "Theoretical quantile",
       y = "Sample quantile")+
  theme_bw()

ggplot(lm2w_aug, aes(x=.studresid))+
  geom_density(color = "black",
               fill = "purple",
               alpha = .7)+
  scale_x_continuous(breaks = c(-3,0,3))+
  theme_bw()

  • It’s common for some residuals in the tails being off, especially with low N (N=36). Nothing to worry here

Checking assumptions - residual outliers

ggplot(lm2w_aug, aes(x=.fitted, y=.studresid))+
  geom_hline(yintercept = 0, color="red")+
  geom_point(shape = 21,
             fill = "purple", 
             size = 3,
             alpha = .7)+
  geom_smooth()+
  geom_hline(yintercept = c(-3,3), color = "red")+
  theme_bw()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
  • A couple of residuals beyond -3 and 3, but not too many and not too far out. I wouldn’t worry about them

Checking assumptions - summary

We confirmed that residuals were:

  • Independent

  • Homoscedastic (homogeneous variance)

  • Normally distributed

  • No outliers

  • Therefore, the model can be used for inference!

Summary

  • CRD is randomized without restrictions
  • Each term of the effects model has a row in the ANOVA table
  • Raw vs. Studentized residuals
  • If model assumptions are not met, inference is meaningless
  • Model residual assumptions: iidN(0,\(\sigma^2\))