An Introduction To Exploratory Data Analysis
By George Bennett
Exploratory data analysis is the practice of looking at data to gain insights. It can also go hand in hand with cleaning data. Through the practice you can get to better know the data set and start to form meaningful questions. Some simple questions you may want to think about asking are: Do any of the features have a visible relationship with each other or the target variable (polynomial, linear, or otherwise)? Is it possible to hard code rules to make predictions without using any machine learning? Is there anything that sticks out about the data? This is a good way to initiate statistical testing and to pre-launch machine learning. Exploratory data analysis ideas can vary widely depending on the data set, but here I will show you two techniques to start off with.
Uni-variate data analysis is simply plotting the distribution of one variable. This can be done with a histogram or a kernel density estimation plot. This is useful because you may discover things about how the data is collected, whether there are outliers, if the data is skewed, or if there is some other anomaly in the data. This can impact decisions such as whether to log transform the feature or whether to remove outliers. Also the distribution of the data plays a role in statistical testing.
Bi-variate data analysis is plotting variables against one another in a scatter plot. Doing this you can see if features have relationships with one another. This can give you insight into multicolinearity and perhaps you will find non-linear relationships between the variables. Perhaps one variable has a polynomial relationship to another, this can cause problems if you are using a polynomial machine learning model. If the task at hand is binary classification try splitting each feature into two sets, one for each class, and the plot two kernel density estimation plots, one for each set, and see if the data points belonging to one class have a higher or lower value than the data points belonging to the other class. This would indicate a linear relationship between that feature and the class variable.
These are just some basic tips to get your feet wet in exploratory data analysis. The more complex your data set is, the more you can do with it. These methods should be a good starting point.