A brief introduction to Binary Classification Tasks with Sklearn

Datascience George
4 min readApr 29, 2020

--

Machine learning classification is very useful in many ways including screening people for diseases in the field of medicine, classifying cosmic objects in the field of astronomy, financial fraud detection in the field of banking and much more. Big names like Youtube, Netflix, Amazon, and Google use classification to aid in there recommendation systems for advertisements or suggested videos. The amount of applications for classification tasks are endless. In this post I will go over a simple process to create machine learning models for classification tasks using primarily the python module sklearn. The process Is repetitive consisting of seven steps: business understanding, data mining, data cleaning, data exploration, feature engineering, predictive modeling, and data visualization.

The first step is to have a business understanding of the problem you are solving with machine learning. This can tell you things such as what features do and don’t matter, how to deal with null values, and much more. The most important part of this step is to know what the metrics of performance should be. If is ideal to have low false positives use precision, if it is ideal to have low false negatives use recall, and if you want both low false positives and low false negatives use f1-score. You can also use more than just one if needed. For example if you are making a model to screen people for cancer you will want a low false negative rate. It would not be good to tell someone they are healthy when they actually have cancer.

The second step is data mining. This can be as easy as loading a .csv file into a pandas dataframe, or querying with SQL from a database to get a sample of data. It could also be more difficult like using an API, web-scraping multiple websites, or use all four techniques mentioned together. The important end goal is that the data is collected and put into a usable form such as a pandas dataframe.

The third step is data cleaning. In this step null values need to be dealt with, and it may be necessary to change the data type of some of the columns in the dataframe (One example is changing a string value to a datetime timestamp) Also in this step it is best practice to check for duplicate data (This can be done in pandas by using dataframe.duplicated()). Make sure to check for outliers (box plots can be useful for this). Finally check the data’s features for multicolinearity and decide how to deal with it (this can be done with a heatmap in seaborn along with pandas’ dataframe.corr() method.

The fourth step is data exploration. Ask questions about the data, and get a good feel of what it looks like when plotted. The goal of this step is to gain as many insights as possible about the data you are working with. Try creating a scatter matrix using pandas.plotting.scatter_matrix(). Also use dataframe.describe() to get some basic statistics about your data. Make sure to check if your data is imbalanced using the pandas method dataframe.column.value_counts(), if one class is much more prevalent than another then it can cause issues in your machine learning model.

The fifth step is feature engineering. You should begin by splitting the data into train and test sets (which can be done with sklearn.model_selection.train_test_split()). Next make sure to scale the data so that they it is all in the same “units” of measurement. This can be done with sklearn.preprocessing’s MinMaxScaler() or StandardScaler(). If your data set is imbalanced you may want to consider creating a imblearn.over_sampling.SMOTE() object to upsample the less prevalent classes, or if your dataset is large enough try using a imblearn.under_sampling.NearMiss() object to down sample the most prevalent classes.

The sixth step is predictive modeling. There are many classifiers you can use such as sklearn.tree.DecisionTreeClassifier(), sklearn.ensemble.RandomForestClassifier(), sklearn.linear_model.LogisticRegression(), and many more. I would suggest that you create several models and see which one gives the best scores. The next part of the process is to take each model and use a sklean.model_selection.GridSearchCV() object to tune the hyper parameters. To use a GridSearchCV object you must pass in a machine learning object such as a sklearn.tree.DecisionTreeClassifier(), a scoring option such as precision or recall, and then pass in what is called a parameter grid. The parameter grid is a dictionary containing different hyper parameter options for your classifier. While fitting to the training data, the GridSearchCV object will try all possible combinations of hyper parameters you pass it and select the one with the highest score. Then you can use the object’s .best_params_ property to find the best parameters for the selected model. You can use the fitted GridSearchCV object to make predictions and calculate precision scores, f1-scores, and so on (use sklearn.metrics.precision_score(), sklearn.metrics.f1_score(), sklearn.metrics.recall_score(), and sklearn.metrics.accuracy_score().) Finally take the best parameters you received and plug them into a new model and there you go, you have an optimal model for classification. Remember to repeat the process with several different machine learning algorithms to find the best model that will help solve your specific problem.

The seventh step is Data visualization. You should start off by making a confusion matrix. You can calculate a simple confusion matrix by using sklearn.metrics.confusion_matrix(). This will return a two by two array of numbers. The number in the top left is the amount of true negatives, the number in the top right is false positives, the number in the bottom left is false negatives, and the number in the bottom right is the true positives. You can use matplotlib techniques to make this into a beautiful display.

--

--

Datascience George
Datascience George

Written by Datascience George

Data scientist learning at Flat Iron School

No responses yet