A brief introduction to creating machine learning models for classification in python using sklearn
Machine learning classification is very useful in many ways including screening people for diseases in the field of medicine, classifying cosmic objects in the field of astronomy, financial fraud detection in the field of banking and much more. Big names like Youtube, Netflix, Amazon, and Google use classification to aid in there recommendation systems for advertisements or suggested videos. The amount of applications for classification tasks are endless. In this post I will go over a simple process to create machine learning models for classification tasks using primarily sklearn. The process Is repetitive consisting of seven steps: business understanding, data mining, data cleaning, data exploration, feature engineering, predictive modeling, and data visualization.
To start the process the first step is to have a business understanding of the problem you are solving with machine learning. This can tell you things such as what features do and don’t matter, how do I deal with null values, and much more. The most important part of this step is to know what your metrics of performance should be. If you want to have low false positives use precision, if you want to have low false negatives use recall, and if you want both low false positives and low false negatives use f1-score. You can also use more than just one if needed. For example if you are making a model to screen people for cancer you will want a low false negative rate. It would not be good to tell someone they are healthy when they actually have cancer.
The second step is data mining. This can be as easy as loading a .csv file into a pandas dataframe, or querying with SQL from a database to get a sample of data. It could also be more difficult like using an API, web-scraping multiple websites, or use all four techniques mentioned together. The important end goal is that the data is collected and put into a usable form such as a pandas dataframe.
The third step is data cleaning. In this step null values need to be dealt with, and you may need to change the data type of some of the columns in your dataframe (One example is changing a string value to a datetime timestamp) Also in this step you want to check for duplicate data (This can be done in pandas by using dataframe.duplicated()). In some cases you may want to check for outliers (box plots can be useful for this). Finally check the data’s features for multicolinearity and decide how to deal with it. Perhaps drop a feature now or use principal component analysis in a later stage.
The fourth step is data exploration. Ask questions about the data, and get a good feel of what it looks like when plotted. The goal of this step is to gain as many insights as possible about the data you are working with. Try creating a scatter matrix using pandas.plotting.scatter_matrix(). Also use dataframe.describe() to get some basic statistics about your data. Make sure to check if your data is imbalanced, if one class is much more prevalent than another then it can cause issues in your machine learning model.
The fifth step is feature engineering. You should begin by splitting the data into train and test sets (which can be done with sklearn.model_selection.train_test_split()), or if needed perform principal component analysis first and then split the data. If you used principle component analysis make sure to test out models on the original data as well to make sure that principal component analysis was not detrimental. Once the data is split if your data set is imbalanced you may want to consider creating a imblearn.over_sampling.SMOTE() object to upsample the less prevalent classes, or if your dataset is large enough try using a imblearn.under_sampling.NearMiss() object to down sample the most prevalent classes.
The sixth step is predictive modeling. There are many classifiers you can use such as sklearn.tree.DecisionTreeClassifier(), sklearn.ensemble.RandomForestClassifier(), sklearn.linear_model.LogisticRegression(), and many more. I would suggest that you create several models and see which one gives the best scores. The next part of the process is to take each model and use a sklean.model_selection.GridSearchCV() object. To use a GridSearchCV object you must pass in a machine learning object such as a sklearn.tree.DecisionTreeClassifier(), a scoring option such as precision or recall, and then pass in what is called a parameter grid. The parameter grid is a dictionary containing different hyper-parameter options for your classifier. While fitting to the training data, the GridSearchCV object will try all possible combinations of hyper-parameters you pass it and select the one with the highest score. Then you can use the object’s .best_params_ attribute to find the best hyper-parameters for the selected model. You can use the fitted GridSearchCV object to make predictions and calculate precision scores, f1-scores, and so on (use sklearn.metrics.precision_score(), sklearn.metrics.f1_score(), sklearn.metrics.recall_score(), and sklearn.metrics.accuracy_score().) Finally take the best parameters you received and plug them into a new model and there you go, you have an optimal model for classification. Remember to repeat the process with several different machine learning algorithms to find the best model that will help solve your specific problem. Below is a picture of GridSearchCV in use.
The seventh step is Data visualization. You should start off by making a confusion matrix. You can calculate a simple confusion matrix by using sklearn.metrics.confusion_matrix(). This will return a two by two array of numbers. The number in the top left is the amount of true negatives, the number in the top right is false positives, the number in the bottom left is false negatives, and the number in the bottom right is the true positives. You can use matplotlib techniques to make this into a beautiful display.
There you have it, I hope this was a helpful post to those learning about machine learning in python with sklearn.