In this post I will talk about the assumptions for linear regression. Some must be checked before modeling begins. Others need to be checked afterward using the residuals obtained from training. It is important to check all of the assumptions for a model because failing to do so can lead to unreliable results in the real world, even if the performance metrics from evaluation on the testing set are great.
The first assumption is that the feature variables being modeled have a linear relationship with the target variable. This means that as one feature variable increases, the target variable responds by either increasing or decreasing. The second assumption is that there is little to no multicollinearity between feature variables. Multicollinearity is when the features have strong linear relationships with one another. To test linear relationships you can use scatter plots. Another way is to use a correlation heatmap. This can be done using pandas and seaborn. Pandas has a method to turn a data frame into a matrix of correlation values and seaborn has a plot that color codes the values to visualize the matrix. It is best to use absolute values of the correlation coefficient because in this case it doesn’t matter if the correlation is positive or negative. Avoid using features that have a correlation with each other higher than 0.6. Here is a code example.
import pandas as pd
import seaborn as sns# dataset
df = pd.read_csv('data.csv')# correlation matrix
matrix = abs(df.corr())# heatmap
If the data meets the above assumptions go ahead and model with the data. Save the residuals (errors) from the training set and use these to test the other assumptions.
The first assumption for the errors is that they are normally distributed. To check for this you can use QQ-plots or simply use a histogram. If the errors are not normal then the model’s performance can not be predicted reliably. This could be caused by bad data. Try removing outliers from the data and removing any features that are imbalanced.
The second assumption for the errors is that they have homoscedasticity. This means that the variance of the errors does not change with the value of the target variable. For instance if a regression model had a low error for predicting values that lie between 100 and 200, but had very high error for values that lie between 400 and 500, that would make the model unreliable. In order to check for this make a scatter plot and plot the residuals against the true target value. Then draw a horizontal line at zero where the errors should be centered. If the plot is a straight bar then the data has homoscedasticity. However if the data forms a cone then it does not. To fix this problem try applying transformations to different variables such as a logarithmic transformation. This can improve performance. Here is a code example and a picture showing residuals with homoscedasticity.
import matplotlib.pyplot as pltresiduals = y_train - y_hat # true minus predictedplt.scatter(x=y_train, y=residuals)