By George Bennett
When using machine learning algorithms it is often a good idea to first scale the data. Scaling the data is putting all the features on a level playing field. Say you are using a distance based algorithm and you have one feature with a range in the thousands and another feature that ranges from 1 to 5. Would you want the first feature to be hundreds of times more important to the model? Scaling data puts everything in a similar range. In this post I will cover the two most common and practical ways to scale data.
The first way to scale data is to literally just squeeze it in the same range. It is called “Min/Max Scaling” or “Normalization”. To do this take every point in a column and subtract the column minimum. Then divide by the column maximum. This puts every value in the range of zero and one. This is often the best range for numbers to be in when dealing with neural networks. The drawback of this method is that if there are outliers your values might get squeezed to an even smaller range. Say you had one outlier that was around 10 times higher than the second highest value. This would mean that all but one data point has been squeezed into the range of 0 to 0.1. If your other columns have a mean around 0.5, then this method is far from perfect. That’s where the second method comes in.
The second method is more tolerant of outliers. It is called standardization. It takes a measure of how far away points are from the mean relative to variance and turns that into the new data. To perform standardization take every value in a column and subtract that column’s mean from them. Then divide each value by the column’s standard deviation.
Its pretty easy to normalize and standardize entire matrices using sklearn. It has built in transformers, and you can build them right into a pipeline. Simply instantiate a transformer and then fit it to the training data only. Then transform the training and the testing data. The name for the normalizer is “MinMaxScaler” and the name for the standardizer is “StandardScaler”. Here is an example below.
from sklearn.preprocessing import StandardScaler# instantiate scaler objectsc = StandardScaler# fit to training datasc.fit(X_train)# transform both training and testing data
X_train_scaled = sc.transform(X_train)X_test_scaled = sc.transform(X_test
Here are a few things to note when practicing. Scaling is not useful with dependent variables, it is only useful for independent variables. Also scaling is not important for tree based algorithms, because they partition space in such a way that the differences in scale don’t matter.