The Importance Of Correlation
By George Bennett
Correlation is a measurement of how closely two variables are relatable by a linear relationship. If the relationship between two variables can be displayed linearly, then it can be thought of as a measure of noise in the relationship. The closer the correlation value is to 1 or -1, the less noise is present. Keep in mind correlation has nothing to do with the intensity of the slope, but if the slope is zero than the correlation is zero. There are several types of correlation, but the one I am going to cover in this post is called Pearson correlation. Here is a picture of different pieces of data and their correlation values. [1]
The formula for Pearson correlation requires the covariance of the two variables as well as the standard deviation of each variable.
To find the covariance take the mean of the differences of each data point in a variable and that variable’s mean. Then multiply them together. If you are using only a sample from the data and not the whole population make sure to subtract 1 from the denominator when taking the means.
# Covariance formula for a population
# Variables: X(variable 1), Y(variable 2)cov = sum(X - X.mean()) * sum(Y - Y.mean()) / len(X)# Covariance formula for a samplecov = sum(X - X.mean()) * sum(Y - Y.mean()) / (len(X) - 1)
Now that we know how to get the covariance simply divide that by the product of the two variables’ standard deviations.
# Formula for Pearson Correlationcorr = cov / X.std() & Y.std()
Correlation can be useful when selecting features to use in linear machine learning models and when doing exploratory data analysis. It is important to remember though, that correlation does not necessarily mean causation. Some things are completely unrelated on there own, but share some third connection to another variable. Also sometimes it may just be random chance they map out linearly.
Another way to use correlation is to find multicolinearity. This can cause problems with machine learning algorithms when they try to adjust there parameters. It can be like having two copies of the same feature. A good rule of thumb is to not have any features with a correlation value over 0.6 or under -0.6. A good way to check for multicolinearity is to use pandas. Simply call the correlation function on a dataframe, and you will see a N x N dataframe (N being the number of features) containing the correlations between each combination of features.
# df is the variable name for a dataframedf.corr()
To get rid of multicolinearity I suggest using recursive feature elimination, PCA, or if there is a small amount of features, just use your best judgement to decide which ones to leave out of the modeling process.
That is a summary of correlation, how to find it, and why it is important.