Learn the concepts of
Machine learning is a subfield of artificial intelligence, which is broadly defined as the capability of a machine to imitate intelligent human behavior.
How is experimental design analysis learning different from ML?
Experimental design analysis:
Machine learning:
The workflow below applies to any data analysis task.
Now, how does a ML-specific workflow look like?
Pre-processing
Training
Validation
A major goal of the machine learning process is to find an algorithm that most accurately predicts future values based on a set of features.
This is called the generalizability of our algorithm.
How we “spend” our data will help us understand how well our algorithm generalizes to unseen data.
We can split our data into training and test data sets:
Training set: used to develop feature sets, train our algorithms, tune hyperparameters, compare models, etc.
Test set: having chosen a final model, these data are used to estimate an unbiased assessment of the model’s performance, which we refer to as the generalization error.
Given a fixed amount of data, typical recommendations for splitting your data into training-test splits include 60% (training)–40% (testing), 70%–30%, or 80%–20%.
Spending too much in training (e.g.,>80%) won’t allow us to get a good assessment of predictive performance. We may find a model that fits the training data very well, but is not generalizable (overfitting).
Sometimes too much spent in testing (>40%) won’t allow us to get a good assessment of model parameters.
The two most common ways of splitting data include simple random sampling and stratified sampling.
We want our train and test sets to be similar in data distribution, as seen on the left.
If they are not, we can use stratified sampling to ensure the same proportions of the predicted variable fall in both training and test sets.
This is important to avoid data shift!
Now that our data has been split, we can choose a model type (e.g., random forest), and start the training process.
The training process will involve selecting the best hyperparameter values that optimize model performance.
Hyperparameters are parameters in a machine learning model that control how simple or complex the model is allowed to be.
For example, random forest has many hyperparameters that can be changed, including:
mtry: an integer for the number of predictors that will be randomly sampled at each split when creating the tree models (default is 1/3 of number of predictors).
trees: an integer for the number of trees contained in the ensemble (e.g., 10, 100, 1000).
Finding the best combination of mtry and trees will create an optimum model for prediction.
We may create models that are too simple or too complex and do not perform well when predicting new unseen data.
Prediction errors can be decomposed into two important subcomponents: error due to “bias” and error due to “variance”.
\[ Total error = Bias^2 + Variance + Irreducible~error \]
There is often a trade-off between a model’s ability to minimize bias and variance.
Bias is the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict.
It provides a sense of how well a model can conform to the underlying structure of the data.
Error due to variance is defined as the variability of a model prediction for a given data point.
Many models are very adaptable and offer extreme flexibility in the patterns that they can fit to.
However, these models offer their own problems as they run the risk of overfitting to the training data.
Although you may achieve very good performance on your training data, the model will not generalize well to unseen data.
To minimize prediction error (on test set), need to create a model that balances the trade-off between variance and bias.
Notice how train set error always decrease as complexity increases. It is the test set error that presents the variance-bias trade-off (which is what counts for prediction).
In other words, a model needs to be complex enough to capture the signal in the data, but not more complex to the point that it starts capturing noise.
What k value (hyperparameter) above do you think provides a good balance between bias and variance?
Since high variance models are more prone to overfitting, using resampling procedures are critical to reduce this risk.
Many algorithms that can achieve high generalization performance have lots of hyperparameters that control the level of model complexity (i.e., the tradeoff between bias and variance).
So, we need to optimize hyperparameters of a model so it balances the variance-bias trade-off, and is optimum for prediction.
Right now, this is how our data looks like:
We cannot touch the test set until the end. So how can we leverage the training set for hyperparameter optimizization? Using resampling methods.
Data resampling methods further split the training set into training and validation.
For each resample, models with given hyperparameter values are fit independently on the analysis set and evaluated on the assessment set.
There exists many different resampling methods, including:
Let’s explore a couple of them.
The most common cross-validation method is V-fold cross-validation. The data are randomly partitioned into V sets of roughly equal size (called the folds).
For each resampling, 2 folds are used for training the model. and 1 fold is used to estimate model performance:
In practice, values of V are most often 5 or 10 (10 preferred because it is large enough for good results in most situations.).
If there are n training set samples, n models are fit using n - 1 rows of the training set.
Each model predicts the single excluded data point. At the end of resampling, the n predictions are pooled to produce a single performance statistic.
Leave-one-out methods are deficient compared to almost any other method. For anything but pathologically small samples, LOO is computationally excessive, and it may not have good statistical properties.
Now that we determined the appropriate date splits to perform hyperparameter optimization, we also need to determine
In our previous random forest example, we had:
How can we test all possible combinations and find the one that produces the most accurate model?
There exist two general classes of hyperparameter optimization algorithms:
Predefined a set of parameter values to evaluate.
Inefficient since the number of grid points required to cover the parameter space can become unmanageable with the curse of dimensionality.
Sequentially discover new parameter combinations based on previous results.
In some cases, an initial set of results for one or more parameter combinations is required to start the optimization process.
Now that we have trained models across many resamples using different hyperparameters, we select the hyperparameter values that created the best model, judged based on predictive ability on the assessment set.
Some metrics we can use to select the best model are:
R2: the proportion of the variance in the dependent variable that is predictable from the independent variable(s), larger is better.
MSE: average of the squared error, smaller is better.
RMSE: the square root of the MSE metric so that your error is in the same units as your response variable, smaller is better.
Once we have found the best combination of hyperparameters that optimize model performance on the train set (analysis + assessment sets), it is time to use this model to create predictions on the test set.
Because the model training process has not seen the test set yet, these predictions and their performance can be used as a measure of predictive ability/generalizability of the trained ML model.
For continuous predicted variables, a predicted vs. observed plot and its metrics are commonly used:
Models with greater agreement between predicted and observed (greater R2) and lower error metrics (e.g. RMSE, MAE) are better at predicting new observations.
In this exercise, we learned: