Why Use Cross Validation?

Datascience George
2 min readNov 16, 2020

--

Cross validation is an alternative to using a validation set for machine learning models. There are several ways using cross validation instead of a stand alone validation set can improve the performance of your models. Its only drawback is that it can be computationally expensive when working with large amounts of data or with highly complex pipelines.

Cross validation involves splitting your training set into multiple equal sized pieces. Then you take all but the last one and use those for training and take the remainder and use it as a validation set. This completes one round of training and validation. What makes it cross validation is that you let each one of the original pieces take a turn as the validation set. This can involve training and validating the model five or so times, which is what makes it computationally expensive.

One of the main benefits of this practice is that it increases the chance of detecting over fitting issues. Simply compare the performance metrics of each split and you may find that the model does very well on one split and does not so good on another. Naturally you also get a better estimation of model performance because you are validating multiple times. Another benefit of cross validation is that it gives you as much training data as possible, since you don’t have to carve out a validation set.

To use cross validation in sklearn call the cross_validate function and pass in some sklearn model and some X and y data. Then let the function know how to score the model by making sure to add either one of sklearn’s scorers (or the name of that scorer as a string) to the “scoring” argument. Some common ones are: “neg_mean_squared_error”, “accuracy”, “precision”, “recall”, ect. Also you can specify how many splits you want with the “cv” argument.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
lr = LogisticRegression()scores = cross_validate(lr, X_train, y_train, cv=4, scoring = "precision")>>>
array([0.7, 0.75, 0.67, 0.78])

--

--

Datascience George
Datascience George

Written by Datascience George

Data scientist learning at Flat Iron School

No responses yet