How To Use Testing and Validation Sets
By George Bennett
In machine learning the proper way to go about things is to split the data you want to use into three sets: the training set, the validation set, and the testing set. The training set is used to train the machine learning model’s parameters, that is the weights or the internal variables that the model itself changes during training. The validation set is used to find optimal hyper-parameters, that is variables of the machine learning model that are not changed by the model itself during training. Examples of such hyper-parameters are the “C” value of logistic regression, the “max depth” value of tree based algorithms, the architecture of neural networks, and the amount of epochs neural networks are trained for. The test set is reserved for last, when all the work has already been completed, to get an idea of how well the model will perform on unseen data. Everything should be finished before using the test set and nothing should be changed after opening it up.
It is okay to use the validation set in exploratory data analysis along with the training set, but it is not okay to use the test set. This is called “data snooping” and may bias you, the machine learning engineer, to alter your model in favor of the test set. The purpose of creating a test set is that the data is left unseen. So keep your exploratory data analysis to just the training and validation sets.
When training your model examine its metrics of success on the training and validation sets. If the model performs significantly better on the training set than the test set this is a sign it is over fitting. Try adding regularization to the model by changing the parameters or using a simpler algorithm all together. Try plotting the models performance over how many iterations the model has run you can do this by changing the “max iter” parameter in traditional machine learning algorithms or for neural networks change the “epochs” parameter when running the model. By looking at the metrics of success at different points, with different amounts of iterations, you will be able to find the optimal amount of iterations for the model to not be under or over fit. This is called early stopping.
So the training set is used to train model parameters, the validation set is used for all the work such as tuning the hyper-parameters and finding an early stopping point. After is all the work is done the test set is saved to give an idea on how well the model will perform on unseen data.
Sources:
Image: https://deeplearning4j.konduit.ai/tuning-and-training/early-stopping