2024 ACS Workshop - Designing and Analyzing Agricultural Studies: intro and example in R

Notes

This script was developed by Dr. Leonardo Bastos for the 2024 ACS Workshop - Designing and Analyzing Agricultural Studies: intro and example in R, conducted on May 10th 2024.

The slides can be found here: https://leombastos.github.io/bastoslab/teaching/2024-acs/2024-acs-deck.html#/title-slide
The rendered version of this script can be found here: https://leombastos.github.io/bastoslab/teaching/2024-acs/2024-acsworkshop-rcbd-Bastos.html

Important

A few points before we start:

Use the Outline to navigate
This is not an intro to R workshop.
This workshop assumes you are familiar with R, RStudio, quarto scripts, chunks, data wrangling with dplyr/tidyr, plotting with ggplot2.
If this is your first time using R, you will likely run into issues.
If you are unable to code along, feel free to just follow along and don’t worry about the code part.

Introduction

The goals of this script are to:

Create an analytical workflow for an RCBD with random blocks, from data import through publication-ready plot
Understand each of its components

Study details

Objectives: Our objective is to assess the effect of different N and K fertilizer rates on crop grain yield.

Treatment design: 3 N rate x 3 K rate.

Experimental design: randomized complete block design with four blocks.

a) Setup

Here is where we load the packages we will use.

#install.packages("easypackages")

# Loading packages
library(easypackages)
packages("dplyr") # for data wrangling
packages("tidyr") # for data wrangling 
packages("ggplot2") # for plotting
packages("car") # for Anova function
packages("lme4") # for mixed-effect ANOVA model  
packages("broom.mixed") # for model residuals extraction
packages("emmeans") # for model mean extraction
packages("multcomp") # for pairwise comparison letter display

Let’s import the data.

rcbd_df <- read.csv("../data/wheat_nk_bamyan.csv")

rcbd_df

   trt rep nrate_kgha krate_kgha yield_kgha
1    1   1          0          0       1350
2    2   1          0         30       2110
3    3   1          0         60       5725
4    4   1        100          0       2988
5    5   1        100         30       3438
6    6   1        100         60       6525
7    7   1        200          0       2825
8    8   1        200         30       4400
9    9   1        200         60       7025
10   1   2          0          0       2250
11   2   2          0         30       4250
12   3   2          0         60       3150
13   4   2        100          0       3575
14   5   2        100         30       6075
15   6   2        100         60       3925
16   7   2        200          0       4075
17   8   2        200         30       7232
18   9   2        200         60       3412
19   1   3          0          0        875
20   2   3          0         30       2250
21   3   3          0         60       4250
22   4   3        100          0        812
23   5   3        100         30       2962
24   6   3        100         60       5200
25   7   3        200          0       1910
26   8   3        200         30       5488
27   9   3        200         60       5870
28   1   4          0          0       1537
29   2   4          0         30       2791
30   3   4          0         60       4578
31   4   4        100          0       2001
32   5   4        100         30       4419
33   6   4        100         60       5754
34   7   4        200          0       2756
35   8   4        200         30       5354
36   9   4        200         60       5029

We have 36 rows (9 treatments x 4 blocks).

b) Explortory data analysis (EDA) - tables

summary(rcbd_df)

      trt         rep         nrate_kgha    krate_kgha   yield_kgha  
 Min.   :1   Min.   :1.00   Min.   :  0   Min.   : 0   Min.   : 812  
 1st Qu.:3   1st Qu.:1.75   1st Qu.:  0   1st Qu.: 0   1st Qu.:2630  
 Median :5   Median :2.50   Median :100   Median :30   Median :3750  
 Mean   :5   Mean   :2.50   Mean   :100   Mean   :30   Mean   :3838  
 3rd Qu.:7   3rd Qu.:3.25   3rd Qu.:200   3rd Qu.:60   3rd Qu.:5238  
 Max.   :9   Max.   :4.00   Max.   :200   Max.   :60   Max.   :7232

Checking variable classes.

glimpse(rcbd_df)

Rows: 36
Columns: 5
$ trt        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2,…
$ rep        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3,…
$ nrate_kgha <int> 0, 0, 0, 100, 100, 100, 200, 200, 200, 0, 0, 0, 100, 100, 1…
$ krate_kgha <int> 0, 30, 60, 0, 30, 60, 0, 30, 60, 0, 30, 60, 0, 30, 60, 0, 3…
$ yield_kgha <int> 1350, 2110, 5725, 2988, 3438, 6525, 2825, 4400, 7025, 2250,…

c) Wrangling

Need to transform rep, N rate and K rate into factor (for analysis of variance).

rcbd_dfw <- rcbd_df %>%
  mutate(rep = factor(rep),
         nrate_kgha = factor(nrate_kgha),
         krate_kgha = factor(krate_kgha) 
         ) 

rcbd_dfw

   trt rep nrate_kgha krate_kgha yield_kgha
1    1   1          0          0       1350
2    2   1          0         30       2110
3    3   1          0         60       5725
4    4   1        100          0       2988
5    5   1        100         30       3438
6    6   1        100         60       6525
7    7   1        200          0       2825
8    8   1        200         30       4400
9    9   1        200         60       7025
10   1   2          0          0       2250
11   2   2          0         30       4250
12   3   2          0         60       3150
13   4   2        100          0       3575
14   5   2        100         30       6075
15   6   2        100         60       3925
16   7   2        200          0       4075
17   8   2        200         30       7232
18   9   2        200         60       3412
19   1   3          0          0        875
20   2   3          0         30       2250
21   3   3          0         60       4250
22   4   3        100          0        812
23   5   3        100         30       2962
24   6   3        100         60       5200
25   7   3        200          0       1910
26   8   3        200         30       5488
27   9   3        200         60       5870
28   1   4          0          0       1537
29   2   4          0         30       2791
30   3   4          0         60       4578
31   4   4        100          0       2001
32   5   4        100         30       4419
33   6   4        100         60       5754
34   7   4        200          0       2756
35   8   4        200         30       5354
36   9   4        200         60       5029

Checking summary.

summary(rcbd_dfw)

      trt    rep   nrate_kgha krate_kgha   yield_kgha  
 Min.   :1   1:9   0  :12     0 :12      Min.   : 812  
 1st Qu.:3   2:9   100:12     30:12      1st Qu.:2630  
 Median :5   3:9   200:12     60:12      Median :3750  
 Mean   :5   4:9                         Mean   :3838  
 3rd Qu.:7                               3rd Qu.:5238  
 Max.   :9                               Max.   :7232

Number of replicates: 4
Number o treatments: 3 N rates x 3 K rates = 9
Number of observations: 4 x 9 = 36
Yield: from 912 to 7232 kg/ha

d) EDA plots

N rate boxplots.

ggplot(rcbd_dfw, aes(x = nrate_kgha, 
                    y = yield_kgha,
                    color = nrate_kgha)) +
  geom_boxplot() +
  geom_jitter() +
  theme(legend.position = "none")

K rate boxplots.

ggplot(rcbd_dfw, aes(x = krate_kgha, 
                    y = yield_kgha,
                    color = krate_kgha)) +
  geom_boxplot() +
  geom_jitter() +
  theme(legend.position = "none")

N x K interaction boxplots

ggplot(rcbd_dfw, aes(x = nrate_kgha, 
                    y = yield_kgha,
                    color = nrate_kgha)) +
  geom_boxplot() +
  geom_jitter() +
  facet_grid(.~krate_kgha) +
  theme(legend.position = "none")

e) Statistical model

To treat blocks as random effect, we will need to use a function that accomodates fixed and random effects.

To account for more than one variance component (i.e., random effects and error), we can use function lmer() from package lme4.

# Changing to sum-to-zero contrast
options(contrasts = c("contr.sum", "contr.poly"))

# Model fitting
rcbd_mix_mod <- lmer(yield_kgha ~ (1|rep) + nrate_kgha*krate_kgha, 
              data = rcbd_dfw)

rcbd_mix_mod

Linear mixed model fit by REML ['lmerMod']
Formula: yield_kgha ~ (1 | rep) + nrate_kgha * krate_kgha
   Data: rcbd_dfw
REML criterion at convergence: 481.7932
Random effects:
 Groups   Name        Std.Dev.
 rep      (Intercept)  151.4  
 Residual             1118.2  
Number of obs: 36, groups:  rep, 4
Fixed Effects:
            (Intercept)              nrate_kgha1              nrate_kgha2  
                3837.94                  -911.61                   134.89  
            krate_kgha1              krate_kgha2  nrate_kgha1:krate_kgha1  
               -1591.78                   392.81                   168.44  
nrate_kgha2:krate_kgha1  nrate_kgha1:krate_kgha2  nrate_kgha2:krate_kgha2  
                 -37.06                  -468.89                  -142.14

f) ANOVA table

Anova(rcbd_mix_mod, type = 3)

Analysis of Deviance Table (Type III Wald chisquare tests)

Response: yield_kgha
                         Chisq Df Pr(>Chisq)    
(Intercept)           363.9961  1  < 2.2e-16 ***
nrate_kgha             13.9399  2  0.0009397 ***
krate_kgha             39.5932  2  2.526e-09 ***
nrate_kgha:krate_kgha   3.2401  4  0.5184758    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice how rep (i.e., block in this case) does not appear in the ANOVA table above. That’s because it only displays fixed effects.

Since only the main effects (N and K) are significant, and not the interaction, we should extract means and perform pairwise comparisons for the main effects only.

Before we do that, let’s check our model assumptions.

A model is only valid for inference (i.e., means and pwc) IF it fulfills the linear model assumptions.

g) Linear model assumptions

Extracting residuals

First, let’s extract our model residuals, and also create studentized residuals.

rcbd_mix_resid <- augment(rcbd_mix_mod) %>%
  mutate(.studresid = rstudent(rcbd_mix_mod))

rcbd_mix_resid

# A tibble: 36 × 16
   yield_kgha rep   nrate_kgha krate_kgha .fitted  .resid  .hat  .cooksd .fixed
        <int> <fct> <fct>      <fct>        <dbl>   <dbl> <dbl>    <dbl>  <dbl>
 1       1350 1     0          0            1532.  -182.  0.262 0.00141   1503.
 2       2110 1     0          30           2879.  -769.  0.262 0.0253    2850.
 3       5725 1     0          60           4455.  1270.  0.262 0.0689    4426.
 4       2988 1     100        0            2373.   615.  0.262 0.0161    2344 
 5       3438 1     100        30           4253.  -815.  0.262 0.0283    4224.
 6       6525 1     100        60           5380.  1145.  0.262 0.0560    5351 
 7       2825 1     200        0            2921.   -95.5 0.262 0.000390  2892.
 8       4400 1     200        30           5648. -1248.  0.262 0.0664    5618.
 9       7025 1     200        60           5363.  1662.  0.262 0.118     5334 
10       2250 2     0          0            1557.   693.  0.262 0.0205    1503.
# ℹ 26 more rows
# ℹ 7 more variables: .mu <dbl>, .offset <dbl>, .sqrtXwt <dbl>, .sqrtrwt <dbl>,
#   .weights <dbl>, .wtres <dbl>, .studresid <dbl>

Now, let’s recap the linear model assumptions:

Independence (no pattern)
Variance homogeneity (homoscedasticity)
Normality
Outlier detection (< -3 or > 3)

One difference in mixed models is that the 3 first assumptions are also applied to the random effects, so we need to check it for them as well.

Random effects are iid ~ N(0,var_a)

randeff_rep <- ranef(rcbd_mix_mod)[[1]]

randeff_rep

  (Intercept)
1   29.038733
2   53.566978
3  -77.528459
4   -5.077252

For random effects with so few levels (i.e., 4 since that’s the number of blocks), the QQ plot is one of the only ways to check for assumptions on random effects.

ggplot(randeff_rep, aes(sample = `(Intercept)`)) +
  stat_qq(shape = 21,
          fill = "purple", 
          size = 3,
          alpha = .7
  ) +
  stat_qq_line() +
  labs(x = "Theoretical quantile",
       y = "Sample quantile") +
  theme_bw()

Nothing to worry here, especially since we only have 4 points.

Within-group errors are iid ~ N(0, var_e)

ggplot(rcbd_mix_resid, 
       aes(x = .fitted, 
           y = .studresid)) +
  geom_hline(yintercept = 0, 
             color = "red") +
  geom_point(shape = 21,
             fill = "purple", 
             size = 3,
             alpha = .7) +
  geom_smooth() +
  geom_hline(yintercept = c(-3,3), 
             color = "red") +
  theme_bw()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Independence: no pattern observed, points appear to be random, looks good.
Variance homoscedastic: seems constant across the x-axis (blue smooth line and confidence error ribbon touching 0 on the y-axis throughout the x-axis range), looks good.
Outliers: nothing outside the -3,3 boundaries, looks good.

ggplot(rcbd_mix_resid, 
       aes(sample = .studresid)) +
  stat_qq(shape = 21,
          fill = "purple", 
          size = 3,
          alpha = .7) +
  stat_qq_line() +
  labs(x = "Theoretical quantile",
       y = "Sample quantile") +
  theme_bw()

ggplot(rcbd_mix_resid, 
       aes(x = .studresid)) +
  geom_density(color = "black",
               fill = "purple",
               alpha = .7) +
  scale_x_continuous(breaks = c(-3,0,3), 
                     limits = c(-4,4)) +
  theme_bw()

Normality: residuals seem normal.

h) Model means

The next step in the workflow is extracting the model means.

Since our main effects were the only significant terms, we’ll extract the means for N and K separately.

N rate means:

rcbd_mix_means_n <- emmeans(rcbd_mix_mod, ~nrate_kgha)

NOTE: Results may be misleading due to involvement in interactions

rcbd_mix_means_n

 nrate_kgha emmean  SE   df lower.CL upper.CL
 0            2926 332 16.2     2224     3629
 100          3973 332 16.2     3271     4675
 200          4615 332 16.2     3912     5317

Results are averaged over the levels of: krate_kgha 
Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95

K rate means:

rcbd_mix_means_k <- emmeans(rcbd_mix_mod, ~krate_kgha)

NOTE: Results may be misleading due to involvement in interactions

rcbd_mix_means_k

 krate_kgha emmean  SE   df lower.CL upper.CL
 0            2246 332 16.2     1544     2948
 30           4231 332 16.2     3529     4933
 60           5037 332 16.2     4335     5739

Results are averaged over the levels of: nrate_kgha 
Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95

i) Pairwise comparisons

Now that we extracted means, let’s perform pairwise comparisons among them.

N pairwise comparison:

rcbd_mix_pwc_n <- cld(rcbd_mix_means_n, 
                   reversed = T, 
                   adjust = "none",
               Letters = letters) %>%
  as.data.frame() %>%
  mutate(letter = trimws(.group))

rcbd_mix_pwc_n

  nrate_kgha   emmean       SE       df lower.CL upper.CL .group letter
1        200 4614.667 331.5619 16.18014 3912.422 5316.911     a       a
2        100 3972.833 331.5619 16.18014 3270.589 4675.078     a       a
3          0 2926.333 331.5619 16.18014 2224.089 3628.578      b      b

K pairwise comparison:

rcbd_mix_pwc_k <- cld(rcbd_mix_means_k, 
                   reversed = T, 
                   adjust = "none",
               Letters = letters) %>%
  as.data.frame() %>%
  mutate(letter = trimws(.group))

rcbd_mix_pwc_k

  krate_kgha   emmean       SE       df lower.CL upper.CL .group letter
1         60 5036.917 331.5619 16.18014 4334.672 5739.161     a       a
2         30 4230.750 331.5619 16.18014 3528.506 4932.994     a       a
3          0 2246.167 331.5619 16.18014 1543.922 2948.411      b      b

g) Final plots

Let’s plot our results, including both raw data (for allowing our audience to inspect data distribution) and statistical model summary (i.e., letter separation) for inference purposes.

Let’s make these plots publication ready.

For N rates:

ggplot(mapping = aes(fill = nrate_kgha)) +
  # Raw data and boxplots  
  geom_boxplot(data = rcbd_dfw,
               aes(x = nrate_kgha, 
                   y = yield_kgha),
               alpha = .8) +
  geom_jitter(data = rcbd_dfw,
               aes(x = nrate_kgha, 
                   y = yield_kgha),
              shape = 21,
              size = 3,
              alpha = .6) +
  # Adding letters
  geom_label(data = rcbd_mix_pwc_n,
            aes(x = nrate_kgha, 
                y = emmean, 
                label = letter),
            fill = "white") +
  labs(x = "N rate (kg/ha)",
       y = "Yield (kg/ha)") +
  scale_fill_viridis_d() +
  theme(panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        legend.position = "none")

For K rates:

ggplot(mapping = aes(fill = krate_kgha))+
  # Raw data and boxplots  
  geom_boxplot(data = rcbd_dfw,
               aes(x = krate_kgha, 
                   y = yield_kgha),
               alpha = .8) +
  geom_jitter(data = rcbd_dfw,
               aes(x = krate_kgha, 
                   y = yield_kgha),
              shape = 21,
              size = 3,
              alpha = .6) +
  # Adding letters
  geom_label(data = rcbd_mix_pwc_k,
            aes(x = krate_kgha, 
                y = emmean, 
                label = letter),
            fill = "white") +
  labs(x = "K rate (kg/ha)",
       y = "Yield (kg/ha)") +
  scale_fill_viridis_d(option = "A") +
  theme(panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        legend.position = "none")