The Machine Learning Modeling Process

By George Bennett

So the data has been cleaned and exploratory data analysis has been done and perhaps a few new features have been created. This is the time to begin modeling. Make sure you have already made a clear training, validation, and test set.

First select a few machine learning algorithms to try. I would use regression (linear, logistic, lasso or ridge), decision tree, random forest, xgboost and also support vector machine if your data is relatively small.

For each algorithm first scale the data using sklearn’s MinMaxScaler or StandardScaler if their are a few outliers. Then begin hyperparameter tuning. For this use sklearn’s GridSearchCV. Pass in a parameter grid into the grid search with some parameters. A few to try changing are the max depth of the tree algorithms and the C parameter for the linear algorithms. Then fit the grid search object on the training set.

After hyper parameter tuning is done check the performance metrics for the best model with the training and validation sets. Then check the feature importances (tree models) or the coefficients (linear models) and see if there are any features that are not important. It would be good to remove those features and repeat the process again as it may give better results and it will also simplify the model when put into production. After the right features are selected (which may very well be all of them) record the final performance metrics for each model.

Lastly take the model which performed the best on the validation set and test it on the test set. This should give you a good Idea of how the model will perform when put into production.

Data scientist learning at Flat Iron School