Machine learning

Today’s goals

Learn the concepts of

What is machine learning?
Experimental design analysis vs. ML
ML workflow
Variance-bias trade-off
Data spending
Training techniques
Hyper-parameter optimization
Predictive assessment

What is machine learning?

Machine learning is a subfield of artificial intelligence, which is broadly defined as the capability of a machine to imitate intelligent human behavior.

Experimental design analysis vs. ML

How is experimental design analysis learning different from ML?

Experimental design analysis:

Model is pre-determined
Little room to “play around” with model
Exception to this is when model assumptions do not hold, then we allow for model flexibility (e.g. repeated measure analysis)

Machine learning:

A lot of room for “playing around”
Different algorithms
Hyperparameter types
Hyperparameter values

ML workflow

The workflow below applies to any data analysis task.

Now, how does a ML-specific workflow look like?

ML model workflow

Pre-processing
- Data split
- Data processing
Training
- Model specification
- Hyperparameter fine-tuning
Validation
- Predictive performance assessment

1. Pre-processing: data split

A major goal of the machine learning process is to find an algorithm that most accurately predicts future values based on a set of features.
This is called the generalizability of our algorithm.
How we “spend” our data will help us understand how well our algorithm generalizes to unseen data.

1. Pre-processing: data split

We can split our data into training and test data sets:

Training set: used to develop feature sets, train our algorithms, tune hyperparameters, compare models, etc.
Test set: having chosen a final model, these data are used to estimate an unbiased assessment of the model’s performance, which we refer to as the generalization error.

1. Pre-processing: data split

Given a fixed amount of data, typical recommendations for splitting your data into training-test splits include 60% (training)–40% (testing), 70%–30%, or 80%–20%.
Spending too much in training (e.g.,>80%) won’t allow us to get a good assessment of predictive performance. We may find a model that fits the training data very well, but is not generalizable (overfitting).
Sometimes too much spent in testing (>40%) won’t allow us to get a good assessment of model parameters.

1. Pre-processing: data split

The two most common ways of splitting data include simple random sampling and stratified sampling.

We want our train and test sets to be similar in data distribution, as seen on the left.
If they are not, we can use stratified sampling to ensure the same proportions of the predicted variable fall in both training and test sets.
This is important to avoid data shift!

2. Training: hyperparameters

Now that our data has been split, we can choose a model type (e.g., random forest), and start the training process.
The training process will involve selecting the best hyperparameter values that optimize model performance.
Hyperparameters are parameters in a machine learning model that control how simple or complex the model is allowed to be.

2. Training: hyperparameters

For example, random forest has many hyperparameters that can be changed, including:

mtry: an integer for the number of predictors that will be randomly sampled at each split when creating the tree models (default is 1/3 of number of predictors).
trees: an integer for the number of trees contained in the ensemble (e.g., 10, 100, 1000).

Finding the best combination of mtry and trees will create an optimum model for prediction.

What happens if we don’t optimize hyperparameters?

We may create models that are too simple or too complex and do not perform well when predicting new unseen data.

Variance-bias trade-off

Prediction errors can be decomposed into two important subcomponents: error due to “bias” and error due to “variance”.

\[ Total error = Bias^2 + Variance + Irreducible~error \]

There is often a trade-off between a model’s ability to minimize bias and variance.

Bias

Bias is the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict.

It provides a sense of how well a model can conform to the underlying structure of the data.

Variance

Error due to variance is defined as the variability of a model prediction for a given data point.

Variance

Many models are very adaptable and offer extreme flexibility in the patterns that they can fit to.
However, these models offer their own problems as they run the risk of overfitting to the training data.
Although you may achieve very good performance on your training data, the model will not generalize well to unseen data.

Variance-bias trade-off

To minimize prediction error (on test set), need to create a model that balances the trade-off between variance and bias.

Notice how train set error always decrease as complexity increases. It is the test set error that presents the variance-bias trade-off (which is what counts for prediction).

Variance-bias trade-off

In other words, a model needs to be complex enough to capture the signal in the data, but not more complex to the point that it starts capturing noise.

What k value (hyperparameter) above do you think provides a good balance between bias and variance?

Variance-bias trade-off

Since high variance models are more prone to overfitting, using resampling procedures are critical to reduce this risk.
Many algorithms that can achieve high generalization performance have lots of hyperparameters that control the level of model complexity (i.e., the tradeoff between bias and variance).

2. Training: data resampling

So, we need to optimize hyperparameters of a model so it balances the variance-bias trade-off, and is optimum for prediction.

Right now, this is how our data looks like:

We cannot touch the test set until the end. So how can we leverage the training set for hyperparameter optimizization? Using resampling methods.

2. Training: data resampling

Data resampling methods further split the training set into training and validation.

For each resample, models with given hyperparameter values are fit independently on the analysis set and evaluated on the assessment set.

2. Training: resampling methods

There exists many different resampling methods, including:

V-fold cross-validation
Bootstrapping
Leave-one-out cross-validation

Let’s explore a couple of them.

2. Training: v-fold cross-validation

The most common cross-validation method is V-fold cross-validation. The data are randomly partitioned into V sets of roughly equal size (called the folds).

2. Training: v-fold cross-validation

For each resampling, 2 folds are used for training the model. and 1 fold is used to estimate model performance:

In practice, values of V are most often 5 or 10 (10 preferred because it is large enough for good results in most situations.).

2. Training: leave-one-out cross-validation

If there are n training set samples, n models are fit using n - 1 rows of the training set.
Each model predicts the single excluded data point. At the end of resampling, the n predictions are pooled to produce a single performance statistic.
Leave-one-out methods are deficient compared to almost any other method. For anything but pathologically small samples, LOO is computationally excessive, and it may not have good statistical properties.

2. Training: hyperp. optimiziation

Now that we determined the appropriate date splits to perform hyperparameter optimization, we also need to determine

which hyperparameters will be optimized
what levels
using which optimization algorithm

2. Training: hyperp. optimiziation

In our previous random forest example, we had:

hyperparameters: mtry and trees
levels: mtry (10, 20, 30), trees (10, 100, 1000)

How can we test all possible combinations and find the one that produces the most accurate model?

2. Training: hyperp. optimiziation

There exist two general classes of hyperparameter optimization algorithms:

Grid search
Iterative search

2. Training: grid search

Predefined a set of parameter values to evaluate.
Inefficient since the number of grid points required to cover the parameter space can become unmanageable with the curse of dimensionality.

2. Training: iterative search

Sequentially discover new parameter combinations based on previous results.
In some cases, an initial set of results for one or more parameter combinations is required to start the optimization process.

2. Training: selecting best hyperp.

Now that we have trained models across many resamples using different hyperparameters, we select the hyperparameter values that created the best model, judged based on predictive ability on the assessment set.

2. Training: selecting best hyperp.

Some metrics we can use to select the best model are:

R2: the proportion of the variance in the dependent variable that is predictable from the independent variable(s), larger is better.
MSE: average of the squared error, smaller is better.
RMSE: the square root of the MSE metric so that your error is in the same units as your response variable, smaller is better.

3. Validation: predictive assessment

Once we have found the best combination of hyperparameters that optimize model performance on the train set (analysis + assessment sets), it is time to use this model to create predictions on the test set.
Because the model training process has not seen the test set yet, these predictions and their performance can be used as a measure of predictive ability/generalizability of the trained ML model.

3. Validation: predictive assessment

For continuous predicted variables, a predicted vs. observed plot and its metrics are commonly used:

3. Validation: predictive assessment

Models with greater agreement between predicted and observed (greater R2) and lower error metrics (e.g. RMSE, MAE) are better at predicting new observations.

Summary

In this exercise, we learned:

What is machine learning
Experimental design analysis vs. ML
ML workflow
Variance-bias trade-off
Data spending
Training techniques
Hyper-parameter optimization
Predictive assessment