Using TF-IDF for document classification
By George Bennett
It is possible to classify bodies of text by looking at the frequencies of words in the text. In this post we will look at doing just that. This tool can be used to classify emails as spam or ham, to classify news as real or fake, or a myriad of other things.
One intuitive way one might think to classify text is to create a matrix with a column for each unique word in the whole collection of texts, create a row for each article of text, and then fill in the values in each column for how many times the word appeared in the article of text. This is called count vectorization. TF-IDF is simply a more sophisticated form of this.
The first part of TF-IDF, the “TF” part, is term frequency. To calculate term frequency simply count how many times the word shows up in the article of text and then divide that by the total number of words in the text. This operation helps to prevent exceedingly long articles to confuse the machine learning algorithms.
The second part is the inverse document frequency. This can be calculated by dividing the total number of documents by the amount of documents with a the given term. This penalizes words that occur often across all documents like: “he”, “she”, “the”, “said”, “it”, etc.
To obtain the overall TF-IDF simply multiply the term frequency values by the inverse document frequency values. To do this in scikit-learn simply call an instance of the TfidfVectorizer class from sklearn.feature_extraction.text. Then fit_transform the training data and transform the testing data. Before transformation the data should just be a series of strings, one string per an article of text.
from sklearn.feature_extraction.text import TfidfVectorizerX = ["the cat in the hat", "the cat wearing a hat"]
X_val = ["the cat is my hat"]y = [1, 0]
y_val = [1]vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X)
X_val = vectorizer.transform(X_val)
Finally after obtaining TF-IDF matrices simply use any machine learning classifier to classify the text such as Naive Bayes classifier or Logistic Regression.
from sklearn.linear_model import LogisticRegressionlr = LogisticRegression
lr.fit(X, y)
y_pred = lr.predict(X_val)...