Leonardo M. Bastos
Assistant Professor
Integrative Precision Agriculture
“Research is reproducible when others can reproduce the results of a scientific study given only the original data, code, and documentation”
This applies to:
A Nature survey with ~1,600 researchers found that
+70% of researchers have tried and failed to reproduce another scientist’s experiments
+50% have failed to reproduce their own experiments
Main causes: selective reporting, weak stats, code/data unavailability, etc.
Starting my M.Sc., this is how my file management system looked like:
1. Machine readable
❌ sas cumulative flux 2 years.xlsx
✅ 2013-2014_N2O-cumulative.csv
2. Human readable
❌
daily1.csv
analysisyear1.qmd
figure.png
data2.csv
anova2.qmd
figure2test.png
✅
2013-N2O-daily.csv
2013-N2O-daily-anova.qmd
2013-N2O-daily-plot.png 2014-N2O-daily.csv
2014-N2O-daily-anova.qmd
2014-N2O-daily-plot.png
Which set of files do you want at 3 am before a deadline?
3. Plays well with default ordering
❌
N2O_daily_1-10-2013.csv
10-23-2013_N2O_daily.csv
2-15-2013_N2O_daily.csv
✅
2013-01-10_N2O_daily.csv
2013-02-15_N2O_daily.csv
2013-10-23_N2O_daily.csv
From this:
To this:
Keep data files in data
, script files in code
, and tables and figures in output
In RStudio, use RStudio Projects!
Also during my M.Sc., I was using multiple point-and-click, proprietary software:
Warning
Proprietary software and file extensions hamper reproducibility by imposing a paywall.
Point-and-click software is prone to human error, and normally humans are not good at documenting each step we do.
Think about the last time you organized data in Excel. Do you remember each step you took when filtering or deleting cells, or when creating new columns? What was your decision-making process?
Or maybe each step you took to create a complicated figure. How easy would it be for you to replicate it?
Whenever possible and available, opt for free, open source software.
Instead of this:
Use this:
plot n2o_ppm
1 101 0.7965260
2 102 1.1163717
3 103 1.7185601
4 104 2.7246234
5 105 0.6050458
plot n2o_ppb
1 102 1116.372
2 103 1718.560
3 104 2724.623
Tip
Code is in itself documentation of each step you do. Adding comments with #
make it even more understandable.
.txt
or .R
Mixes code, output, and narrative on the same file
Examples:
Situation: you spent the whole week working on an analysis, only to find out it didn’t work as expected OR you got stuck with multiple bugs.
Situation #2: at some point your script had an important piece of code, but at the time you thought you didn’t need it anymore and deleted those lines.
Wouldn’t it be nice/useful/graduate-school-life-saving if you could simply go back in time and start fresh from your latest working version?
Think of “track changes”, but on any file type
Especially useful for script files (.Rmd, .qmd)
As your code grows and develops, snapshots are saved allowing you to retrieve different versions
This connects your current-self with your past-self (what were I thinking when I decided on doing this step?)
Locally (in your own machine), use
GitHub is an online centralized platform that combines git, collaborative tools, and cloud storage, all free 💸
You can choose if your projects hosted on GitHub (i.e., a repository) can be seen by everyone (public) or only by you and invited collaborators (private)
Even if we are working off of the same GitHub repository, our local software versions may differ, which can cause discrepancies and issues that may impact reproducibility.
This entire presentation was made with quarto, and its source code is available on my GitHub
You can find more info on my lab’s website (also made with quarto): Bastos Lab
You can find my data science teaching material on my blog: agRonomy
Wish to learn and apply these concepts to your own research?
Applications of data science in ag research, Spring 2024
Thanks! 🙏 💻