Completely Randomized Design (CRD)

Today’s goals

Explore key concepts in CRD:

Homogeneous experimental unit
Treatment randomization
The effects model
The ANOVA table
The linear model assumptions

Motivational example - Treatment design

2-way factorial
N fertilizer rates: 0, 100, 200 kg N/ha
K fertilizer rates: 0, 30, 60 kg K/ha
3 x 3 = 9 treatment combinations

Experimental design

Assuming we have homogeneous experimental material (e.g., same soil type, topography, etc.)

Thus, we can use a completely randomized design (CRD)

Treatment randomization

Randomization guards against unknown or uncontrollable sources of bias
Avoid systematic patterns
Eliminate selective assignment of EU to Trt (conscious or unconscious!)
Randomization allows for valid inference on CAUSATION

Treatment randomization - CRD

Randomization of a treatment to a EU is unrestricted.
That means that replications of same treatment could, by random chance, fall right next to each other.

In our motivational example:

4 replicates
Total observations: 9 x 4 = 36 EUs

Homogeneous Experimental Material - CRD ✅

In the plot layout here, all treatments (1 through 9) were randomly assigned to any experimental unit (plot) in the study area.

Homogeneous Experimental Material - CRD ✅

Treatment 1 and its replicates are highlighted.
Note how, due to the unrestricted randomization, treatment 1 appears twice in the first column, and does not appear on the third column. The same happened with other treatments.

Homogeneous Experimental Material - CRD ✅

Because the experimental material is homogeneous (e.g., same soil texture class), this should not be an issue when estimating treatment means and performing comparisons. 👍

The effects model

\[ y_{ijk} = \mu + \alpha_{i} + \beta_{j} + \alpha\beta_{ij} + e_{ijk} \]

\(y_{ijk}\) is the observation on the kth rep. from ith N rate and jth K rate
\(\mu\) is the overall mean
\(\alpha_{i}\) is the differential effect of ith N rate
\(\beta_{j}\) is the differential effect of jth K rate
\(\alpha\beta_{ij}\) is the differential effect of the combination of the ith N rate and ith K rate
\(e_{ijk}\) is the residual corresponding to the kth replicate of N rate i and K rate j.

The ANOVA table

In the following ANOVA table…

n is number of levels in N rate = 3
k is number of levels in K rate = 3
r is number of replicates = 4
N is total number of obserevations or EUs = 3 x 3 x 4 = 36

The ANOVA table

library(dplyr)
library(flextable)


Attaching package: 'flextable'

The following object is masked from 'package:purrr':

    compose

tribble(~`Source of variation`,~`df`,~SS,~MS,~`F ratio`,
        #"Overall mean", "1", NA_character_, NA, NA,
        "N rate", "dfn =\nn - 1 =\n3 - 1 = 2", "SSn", "MSn =\nSSn / dfn ", "MSn / MSe",
        "K rate", "dfk =\nk - 1 =\n3 - 1 = 2", "SSk", "MSk =\nSSk / dfk", "MSk / MSe",
        "N x K", "dfnk =\n(n - 1) x (k - 1) =\n(3-1) x (3-1) = 4", "SSnk", "MSnk =\nSSnk / dfnk", "MSnk / MSe",
        "Error", "dfe =\nnk(r - 1) =\n3x3(4-1) =\n9x3 = 27", "SSe", "MSe =\nSSe / dfe", "",
        "TOTAL", "dft =\nN -1 = 35", "SSt", "", ""
        ) %>%
  regulartable() %>%
  fontsize(size = 16, part = "all") %>%
  border_inner_h(part="all") %>%
  autofit()

Source of variation	df	SS	MS	F ratio
N rate	dfn = n - 1 = 3 - 1 = 2	SSn	MSn = SSn / dfn	MSn / MSe
K rate	dfk = k - 1 = 3 - 1 = 2	SSk	MSk = SSk / dfk	MSk / MSe
N x K	dfnk = (n - 1) x (k - 1) = (3-1) x (3-1) = 4	SSnk	MSnk = SSnk / dfnk	MSnk / MSe
Error	dfe = nk(r - 1) = 3x3(4-1) = 9x3 = 27	SSe	MSe = SSe / dfe
TOTAL	dft = N -1 = 35	SSt

  #knitr::kable()

Each component of effects model has a row in the table
The larger is the F ratio value, the more evidence towards an effect being significant.

The ANOVA table - let’s think about it

What can I do to increase the chances of finding a significant effect? In other words, how can I \(\uparrow\) F ratios?

Have more variability explained by treatment (\(\uparrow\) SSn ~ MSn)
Have less variability explained by error (\(\downarrow\) MSe), through
Less noise in the error, or
More dfe

ANOVA table - motivational example

library(car)

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

library(emmeans)
library(multcomp)

Loading required package: mvtnorm

Loading required package: survival

Loading required package: TH.data

Loading required package: MASS


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select


Attaching package: 'TH.data'

The following object is masked from 'package:MASS':

    geyser

# trt 
trt <- data.frame(trt = 1:16,
                  n_code = rep(c("N1", "N2", "N3", "N4"),
                               each = 4
                               ),
                  # the study had P, I'm replacing for K
                  k_code = rep(c("K1", "K2", "K3", "K4"),
                               4
                               )
                  ) %>%
  mutate(nrate_kgha = case_when(
    n_code == "N1" ~ 0,
    n_code == "N2" ~ 100,#65,
    n_code == "N3" ~ 200,#95,
    n_code == "N4" ~ 120
  )) %>%
  mutate(krate_kgha = case_when(
    k_code == "K1" ~ 0,
    k_code == "K2" ~ 30,#50,
    k_code == "K3" ~ 60,#70,
    k_code == "K4" ~ 90
  ))  

# data
df <- read_csv("files/On-Station_Trial_on_Nitrogen_and_Phosphate_Fertilization_of_Wheat__Balkh__Afghanistan__2018-19.csv") %>%
  janitor::clean_names() %>%
  dplyr::select(rep, trt, yield_kg_ha) %>%
  left_join(trt) %>%
  filter(nrate_kgha != 120) %>%
  filter(krate_kgha != 90) %>%
    mutate(rep = factor(rep),
         nrate_kgha = factor(nrate_kgha),
         krate_kgha = factor(krate_kgha)
         )  %>%
  dplyr::select(rep, trt, nrate_kgha, krate_kgha, yield_kg_ha)

Rows: 48 Columns: 18

── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (17): Plot No, Rep, Trt, Fall Stand %, Spring Stand %, Days to Heading, ...
lgl  (1): Remark

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Joining with `by = join_by(trt)`

set.seed(298)
df_rep4 <- df %>%
  group_by(trt, nrate_kgha, krate_kgha) %>%
  summarise(yield_kg_ha = mean(yield_kg_ha)) %>%
  ungroup() %>%
  mutate(e = rnorm(nrow(.), 0,300)) %>%
  mutate(yield_kg_ha = yield_kg_ha + e) %>%
  mutate(rep = "4") %>%
  dplyr::select(rep, trt, nrate_kgha, krate_kgha, yield_kg_ha)

`summarise()` has grouped output by 'trt', 'nrate_kgha'. You can override using
the `.groups` argument.

df_all <- df %>%
  bind_rows(df_rep4)

# lm
options(contrasts = c("contr.sum", "contr.poly"))

lm2w <- lm(yield_kg_ha ~ nrate_kgha*krate_kgha,
           data = df_all)

Anova(lm2w, type = 3) %>%
  as.data.frame() %>%
  rownames_to_column(var = "Source of variation") %>%
  mutate(`Pr(>F)` = ifelse(`Pr(>F)` < 0.001, 
                           "<0.001",
                           round(`Pr(>F)`,3))) %>%
  regulartable() %>%
  fontsize(size = 20, part = "all") %>%
  #border_inner_h(part="all") %>%
  autofit()

Source of variation	Sum Sq	Df	F value	Pr(>F)
(Intercept)	836,785,505.5	1	1,917.9470762	<0.001
nrate_kgha	1,490,936.1	2	1.7086437	0.2
krate_kgha	470,561.8	2	0.5392735	0.589
nrate_kgha:krate_kgha	11,108,159.3	4	6.3650904	<0.001
Residuals	11,779,891.6	27

What is significant here (say at \(\alpha = 0.05\))?
But wait, before inference, need to check model assumptions (based on residuals)!

The linear model residual(s)

\[ \hat{e}_{ijk} = y_{ijk} - \hat{y}_{ijk} \] Or, in other words…

\[ residual = observed - predicted \]

df_all %>%
  mutate(trt = factor(trt)) %>%
  ggplot(aes(x = trt, y = yield_kg_ha))+
  geom_point(shape = 21,
             fill = "purple", 
             size = 3,
             alpha = .7)+
  geom_hline(yintercept = mean(df_all$yield_kg_ha),
             color = "blue")+
  geom_point(data = df_all %>%
               mutate(trt = factor(trt)) %>%
               group_by(trt) %>%
               summarise(mean = mean(yield_kg_ha)),
             aes(y = mean),
             shape = 22,
             fill = "green", 
             size = 3,
             alpha = .7
               )+
  theme_bw()

The linear model assumptions are based on the residuals!

\[ e_{ijk} \sim iidN(0, \sigma^2) \] \(e_{ijk}\) is the residual corresponding to the kth replicate of N rate i and K rate j.

Random experimental errors (residuals) are assumed to be:

Mutually independent
Normally distributed with mean 0 and common and homogeneous variance \(\sigma^2\) across trts
No extreme observations (outliers)

Studentized residuals

There are different types of residuals, including raw residual (the previous formula).
Raw residuals are not useful to assess all assumptions (e.g., outliers)
That limitation is addressed if we calculated the studentized residuals

Studentized residuals

Studentized residuals are the raw residuals divided by their standard deviation:

~95% of studentized residuals within ± 2 sd
~99% of studentized residuals within ± 3 sd

Valid analysis depend on valid assumptions!

If model assumptions are not met, inference is meaningless
Validity and repeatability of results is QUESTIONABLE
Always check residuals before making inference! If something is wrong with assumptions, need to fix before proceeding with inference!

Checking assumptions - residual independence

library(broom)
lm2w_aug <- augment(lm2w) %>%
  mutate(.studresid=rstudent(lm2w))

ggplot(lm2w_aug, aes(x=.fitted, y=.studresid))+
  geom_hline(yintercept = 0, color="red")+
  geom_point(shape = 21,
             fill = "purple", 
             size = 3,
             alpha = .7)+
  geom_smooth()+
  geom_hline(yintercept = c(-3,3), color = "red")+
  theme_bw()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

No clear pattern (evidenced by geom_smooth) ✅

Checking assumptions - residual homoscedasticity

ggplot(lm2w_aug, aes(x=.fitted, y=.studresid))+
  geom_hline(yintercept = 0, color="red")+
  geom_point(shape = 21,
             fill = "purple", 
             size = 3,
             alpha = .7)+
  geom_smooth()+
  geom_hline(yintercept = c(-3,3), color = "red")+
  theme_bw()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Data spread is even and centered around zero (evidenced by geom_smooth) ✅

Checking assumptions - residual normality

ggplot(lm2w_aug, aes(sample=.studresid))+
  stat_qq(  shape = 21,
            fill = "purple", 
            size = 3,
            alpha = .7
  )+
  stat_qq_line()+
  labs(x = "Theoretical quantile",
       y = "Sample quantile")+
  theme_bw()

ggplot(lm2w_aug, aes(x=.studresid))+
  geom_density(color = "black",
               fill = "purple",
               alpha = .7)+
  scale_x_continuous(breaks = c(-3,0,3))+
  theme_bw()

It’s common for some residuals in the tails being off, especially with low N (N=36). Nothing to worry here ✅

Checking assumptions - residual outliers

ggplot(lm2w_aug, aes(x=.fitted, y=.studresid))+
  geom_hline(yintercept = 0, color="red")+
  geom_point(shape = 21,
             fill = "purple", 
             size = 3,
             alpha = .7)+
  geom_smooth()+
  geom_hline(yintercept = c(-3,3), color = "red")+
  theme_bw()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

A couple of residuals beyond -3 and 3, but not too many and not too far out. I wouldn’t worry about them ✅

Checking assumptions - summary

We confirmed that residuals were:

✅ Independent
✅ Homoscedastic (homogeneous variance)
✅ Normally distributed
✅ No outliers
Therefore, the model can be used for inference!

Summary

CRD is randomized without restrictions
Each term of the effects model has a row in the ANOVA table
Raw vs. Studentized residuals
If model assumptions are not met, inference is meaningless
Model residual assumptions: iidN(0,\(\sigma^2\))