#install.packages("easypackages")
# Loading packages
library(easypackages)
packages("dplyr") # for data wrangling
packages("tidyr") # for data wrangling
packages("ggplot2") # for plotting
packages("car") # for Anova function
packages("lme4") # for mixed-effect ANOVA model
packages("broom.mixed") # for model residuals extraction
packages("emmeans") # for model mean extraction
packages("multcomp") # for pairwise comparison letter display2024 ACS Workshop - Designing and Analyzing Agricultural Studies: intro and example in R
Notes
This script was developed by Dr. Leonardo Bastos for the 2024 ACS Workshop - Designing and Analyzing Agricultural Studies: intro and example in R, conducted on May 10th 2024.
The slides can be found here: https://leombastos.github.io/bastoslab/teaching/2024-acs/2024-acs-deck.html#/title-slide
The rendered version of this script can be found here: https://leombastos.github.io/bastoslab/teaching/2024-acs/2024-acsworkshop-rcbd-Bastos.html
A few points before we start:
- Use the Outline to navigate
- This is not an intro to R workshop.
- This workshop assumes you are familiar with R, RStudio, quarto scripts, chunks, data wrangling with
dplyr/tidyr, plotting withggplot2.
- If this is your first time using R, you will likely run into issues.
- If you are unable to code along, feel free to just follow along and don’t worry about the code part.
Introduction
The goals of this script are to:
Create an analytical workflow for an RCBD with random blocks, from data import through publication-ready plot
Understand each of its components
Study details
Objectives: Our objective is to assess the effect of different N and K fertilizer rates on crop grain yield.
Treatment design: 3 N rate x 3 K rate.
Experimental design: randomized complete block design with four blocks.
a) Setup
Here is where we load the packages we will use.
Let’s import the data.
rcbd_df <- read.csv("../data/wheat_nk_bamyan.csv")
rcbd_df trt rep nrate_kgha krate_kgha yield_kgha
1 1 1 0 0 1350
2 2 1 0 30 2110
3 3 1 0 60 5725
4 4 1 100 0 2988
5 5 1 100 30 3438
6 6 1 100 60 6525
7 7 1 200 0 2825
8 8 1 200 30 4400
9 9 1 200 60 7025
10 1 2 0 0 2250
11 2 2 0 30 4250
12 3 2 0 60 3150
13 4 2 100 0 3575
14 5 2 100 30 6075
15 6 2 100 60 3925
16 7 2 200 0 4075
17 8 2 200 30 7232
18 9 2 200 60 3412
19 1 3 0 0 875
20 2 3 0 30 2250
21 3 3 0 60 4250
22 4 3 100 0 812
23 5 3 100 30 2962
24 6 3 100 60 5200
25 7 3 200 0 1910
26 8 3 200 30 5488
27 9 3 200 60 5870
28 1 4 0 0 1537
29 2 4 0 30 2791
30 3 4 0 60 4578
31 4 4 100 0 2001
32 5 4 100 30 4419
33 6 4 100 60 5754
34 7 4 200 0 2756
35 8 4 200 30 5354
36 9 4 200 60 5029
We have 36 rows (9 treatments x 4 blocks).
b) Explortory data analysis (EDA) - tables
summary(rcbd_df) trt rep nrate_kgha krate_kgha yield_kgha
Min. :1 Min. :1.00 Min. : 0 Min. : 0 Min. : 812
1st Qu.:3 1st Qu.:1.75 1st Qu.: 0 1st Qu.: 0 1st Qu.:2630
Median :5 Median :2.50 Median :100 Median :30 Median :3750
Mean :5 Mean :2.50 Mean :100 Mean :30 Mean :3838
3rd Qu.:7 3rd Qu.:3.25 3rd Qu.:200 3rd Qu.:60 3rd Qu.:5238
Max. :9 Max. :4.00 Max. :200 Max. :60 Max. :7232
Checking variable classes.
glimpse(rcbd_df)Rows: 36
Columns: 5
$ trt <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2,…
$ rep <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3,…
$ nrate_kgha <int> 0, 0, 0, 100, 100, 100, 200, 200, 200, 0, 0, 0, 100, 100, 1…
$ krate_kgha <int> 0, 30, 60, 0, 30, 60, 0, 30, 60, 0, 30, 60, 0, 30, 60, 0, 3…
$ yield_kgha <int> 1350, 2110, 5725, 2988, 3438, 6525, 2825, 4400, 7025, 2250,…
c) Wrangling
Need to transform rep, N rate and K rate into factor (for analysis of variance).
rcbd_dfw <- rcbd_df %>%
mutate(rep = factor(rep),
nrate_kgha = factor(nrate_kgha),
krate_kgha = factor(krate_kgha)
)
rcbd_dfw trt rep nrate_kgha krate_kgha yield_kgha
1 1 1 0 0 1350
2 2 1 0 30 2110
3 3 1 0 60 5725
4 4 1 100 0 2988
5 5 1 100 30 3438
6 6 1 100 60 6525
7 7 1 200 0 2825
8 8 1 200 30 4400
9 9 1 200 60 7025
10 1 2 0 0 2250
11 2 2 0 30 4250
12 3 2 0 60 3150
13 4 2 100 0 3575
14 5 2 100 30 6075
15 6 2 100 60 3925
16 7 2 200 0 4075
17 8 2 200 30 7232
18 9 2 200 60 3412
19 1 3 0 0 875
20 2 3 0 30 2250
21 3 3 0 60 4250
22 4 3 100 0 812
23 5 3 100 30 2962
24 6 3 100 60 5200
25 7 3 200 0 1910
26 8 3 200 30 5488
27 9 3 200 60 5870
28 1 4 0 0 1537
29 2 4 0 30 2791
30 3 4 0 60 4578
31 4 4 100 0 2001
32 5 4 100 30 4419
33 6 4 100 60 5754
34 7 4 200 0 2756
35 8 4 200 30 5354
36 9 4 200 60 5029
Checking summary.
summary(rcbd_dfw) trt rep nrate_kgha krate_kgha yield_kgha
Min. :1 1:9 0 :12 0 :12 Min. : 812
1st Qu.:3 2:9 100:12 30:12 1st Qu.:2630
Median :5 3:9 200:12 60:12 Median :3750
Mean :5 4:9 Mean :3838
3rd Qu.:7 3rd Qu.:5238
Max. :9 Max. :7232
Number of replicates: 4
Number o treatments: 3 N rates x 3 K rates = 9
Number of observations: 4 x 9 = 36
Yield: from 912 to 7232 kg/ha
d) EDA plots
N rate boxplots.
ggplot(rcbd_dfw, aes(x = nrate_kgha,
y = yield_kgha,
color = nrate_kgha)) +
geom_boxplot() +
geom_jitter() +
theme(legend.position = "none")K rate boxplots.
ggplot(rcbd_dfw, aes(x = krate_kgha,
y = yield_kgha,
color = krate_kgha)) +
geom_boxplot() +
geom_jitter() +
theme(legend.position = "none")N x K interaction boxplots
ggplot(rcbd_dfw, aes(x = nrate_kgha,
y = yield_kgha,
color = nrate_kgha)) +
geom_boxplot() +
geom_jitter() +
facet_grid(.~krate_kgha) +
theme(legend.position = "none")e) Statistical model
To treat blocks as random effect, we will need to use a function that accomodates fixed and random effects.
To account for more than one variance component (i.e., random effects and error), we can use function lmer() from package lme4.
# Changing to sum-to-zero contrast
options(contrasts = c("contr.sum", "contr.poly"))
# Model fitting
rcbd_mix_mod <- lmer(yield_kgha ~ (1|rep) + nrate_kgha*krate_kgha,
data = rcbd_dfw)
rcbd_mix_modLinear mixed model fit by REML ['lmerMod']
Formula: yield_kgha ~ (1 | rep) + nrate_kgha * krate_kgha
Data: rcbd_dfw
REML criterion at convergence: 481.7932
Random effects:
Groups Name Std.Dev.
rep (Intercept) 151.4
Residual 1118.2
Number of obs: 36, groups: rep, 4
Fixed Effects:
(Intercept) nrate_kgha1 nrate_kgha2
3837.94 -911.61 134.89
krate_kgha1 krate_kgha2 nrate_kgha1:krate_kgha1
-1591.78 392.81 168.44
nrate_kgha2:krate_kgha1 nrate_kgha1:krate_kgha2 nrate_kgha2:krate_kgha2
-37.06 -468.89 -142.14
f) ANOVA table
Anova(rcbd_mix_mod, type = 3)Analysis of Deviance Table (Type III Wald chisquare tests)
Response: yield_kgha
Chisq Df Pr(>Chisq)
(Intercept) 363.9961 1 < 2.2e-16 ***
nrate_kgha 13.9399 2 0.0009397 ***
krate_kgha 39.5932 2 2.526e-09 ***
nrate_kgha:krate_kgha 3.2401 4 0.5184758
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice how rep (i.e., block in this case) does not appear in the ANOVA table above. That’s because it only displays fixed effects.
Since only the main effects (N and K) are significant, and not the interaction, we should extract means and perform pairwise comparisons for the main effects only.
Before we do that, let’s check our model assumptions.
A model is only valid for inference (i.e., means and pwc) IF it fulfills the linear model assumptions.
g) Linear model assumptions
Extracting residuals
First, let’s extract our model residuals, and also create studentized residuals.
rcbd_mix_resid <- augment(rcbd_mix_mod) %>%
mutate(.studresid = rstudent(rcbd_mix_mod))
rcbd_mix_resid# A tibble: 36 × 16
yield_kgha rep nrate_kgha krate_kgha .fitted .resid .hat .cooksd .fixed
<int> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1350 1 0 0 1532. -182. 0.262 0.00141 1503.
2 2110 1 0 30 2879. -769. 0.262 0.0253 2850.
3 5725 1 0 60 4455. 1270. 0.262 0.0689 4426.
4 2988 1 100 0 2373. 615. 0.262 0.0161 2344
5 3438 1 100 30 4253. -815. 0.262 0.0283 4224.
6 6525 1 100 60 5380. 1145. 0.262 0.0560 5351
7 2825 1 200 0 2921. -95.5 0.262 0.000390 2892.
8 4400 1 200 30 5648. -1248. 0.262 0.0664 5618.
9 7025 1 200 60 5363. 1662. 0.262 0.118 5334
10 2250 2 0 0 1557. 693. 0.262 0.0205 1503.
# ℹ 26 more rows
# ℹ 7 more variables: .mu <dbl>, .offset <dbl>, .sqrtXwt <dbl>, .sqrtrwt <dbl>,
# .weights <dbl>, .wtres <dbl>, .studresid <dbl>
Now, let’s recap the linear model assumptions:
- Independence (no pattern)
- Variance homogeneity (homoscedasticity)
- Normality
- Outlier detection (< -3 or > 3)
One difference in mixed models is that the 3 first assumptions are also applied to the random effects, so we need to check it for them as well.
Random effects are iid ~ N(0,var_a)
randeff_rep <- ranef(rcbd_mix_mod)[[1]]
randeff_rep (Intercept)
1 29.038733
2 53.566978
3 -77.528459
4 -5.077252
For random effects with so few levels (i.e., 4 since that’s the number of blocks), the QQ plot is one of the only ways to check for assumptions on random effects.
ggplot(randeff_rep, aes(sample = `(Intercept)`)) +
stat_qq(shape = 21,
fill = "purple",
size = 3,
alpha = .7
) +
stat_qq_line() +
labs(x = "Theoretical quantile",
y = "Sample quantile") +
theme_bw()Nothing to worry here, especially since we only have 4 points.
Within-group errors are iid ~ N(0, var_e)
ggplot(rcbd_mix_resid,
aes(x = .fitted,
y = .studresid)) +
geom_hline(yintercept = 0,
color = "red") +
geom_point(shape = 21,
fill = "purple",
size = 3,
alpha = .7) +
geom_smooth() +
geom_hline(yintercept = c(-3,3),
color = "red") +
theme_bw()`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Independence: no pattern observed, points appear to be random, looks good.
Variance homoscedastic: seems constant across the x-axis (blue smooth line and confidence error ribbon touching 0 on the y-axis throughout the x-axis range), looks good.
Outliers: nothing outside the -3,3 boundaries, looks good.
ggplot(rcbd_mix_resid,
aes(sample = .studresid)) +
stat_qq(shape = 21,
fill = "purple",
size = 3,
alpha = .7) +
stat_qq_line() +
labs(x = "Theoretical quantile",
y = "Sample quantile") +
theme_bw()ggplot(rcbd_mix_resid,
aes(x = .studresid)) +
geom_density(color = "black",
fill = "purple",
alpha = .7) +
scale_x_continuous(breaks = c(-3,0,3),
limits = c(-4,4)) +
theme_bw()- Normality: residuals seem normal.
h) Model means
The next step in the workflow is extracting the model means.
Since our main effects were the only significant terms, we’ll extract the means for N and K separately.
N rate means:
rcbd_mix_means_n <- emmeans(rcbd_mix_mod, ~nrate_kgha)NOTE: Results may be misleading due to involvement in interactions
rcbd_mix_means_n nrate_kgha emmean SE df lower.CL upper.CL
0 2926 332 16.2 2224 3629
100 3973 332 16.2 3271 4675
200 4615 332 16.2 3912 5317
Results are averaged over the levels of: krate_kgha
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
K rate means:
rcbd_mix_means_k <- emmeans(rcbd_mix_mod, ~krate_kgha)NOTE: Results may be misleading due to involvement in interactions
rcbd_mix_means_k krate_kgha emmean SE df lower.CL upper.CL
0 2246 332 16.2 1544 2948
30 4231 332 16.2 3529 4933
60 5037 332 16.2 4335 5739
Results are averaged over the levels of: nrate_kgha
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
i) Pairwise comparisons
Now that we extracted means, let’s perform pairwise comparisons among them.
N pairwise comparison:
rcbd_mix_pwc_n <- cld(rcbd_mix_means_n,
reversed = T,
adjust = "none",
Letters = letters) %>%
as.data.frame() %>%
mutate(letter = trimws(.group))
rcbd_mix_pwc_n nrate_kgha emmean SE df lower.CL upper.CL .group letter
1 200 4614.667 331.5619 16.18014 3912.422 5316.911 a a
2 100 3972.833 331.5619 16.18014 3270.589 4675.078 a a
3 0 2926.333 331.5619 16.18014 2224.089 3628.578 b b
K pairwise comparison:
rcbd_mix_pwc_k <- cld(rcbd_mix_means_k,
reversed = T,
adjust = "none",
Letters = letters) %>%
as.data.frame() %>%
mutate(letter = trimws(.group))
rcbd_mix_pwc_k krate_kgha emmean SE df lower.CL upper.CL .group letter
1 60 5036.917 331.5619 16.18014 4334.672 5739.161 a a
2 30 4230.750 331.5619 16.18014 3528.506 4932.994 a a
3 0 2246.167 331.5619 16.18014 1543.922 2948.411 b b
g) Final plots
Let’s plot our results, including both raw data (for allowing our audience to inspect data distribution) and statistical model summary (i.e., letter separation) for inference purposes.
Let’s make these plots publication ready.
For N rates:
ggplot(mapping = aes(fill = nrate_kgha)) +
# Raw data and boxplots
geom_boxplot(data = rcbd_dfw,
aes(x = nrate_kgha,
y = yield_kgha),
alpha = .8) +
geom_jitter(data = rcbd_dfw,
aes(x = nrate_kgha,
y = yield_kgha),
shape = 21,
size = 3,
alpha = .6) +
# Adding letters
geom_label(data = rcbd_mix_pwc_n,
aes(x = nrate_kgha,
y = emmean,
label = letter),
fill = "white") +
labs(x = "N rate (kg/ha)",
y = "Yield (kg/ha)") +
scale_fill_viridis_d() +
theme(panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.position = "none")For K rates:
ggplot(mapping = aes(fill = krate_kgha))+
# Raw data and boxplots
geom_boxplot(data = rcbd_dfw,
aes(x = krate_kgha,
y = yield_kgha),
alpha = .8) +
geom_jitter(data = rcbd_dfw,
aes(x = krate_kgha,
y = yield_kgha),
shape = 21,
size = 3,
alpha = .6) +
# Adding letters
geom_label(data = rcbd_mix_pwc_k,
aes(x = krate_kgha,
y = emmean,
label = letter),
fill = "white") +
labs(x = "K rate (kg/ha)",
y = "Yield (kg/ha)") +
scale_fill_viridis_d(option = "A") +
theme(panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.position = "none")