Multinomial Naive Bayes for Text Classification

In this project I implement a Multinomial Naive Bayes model to classify text data. Multinomial Naive Bayes works well for the following reasons:

Handles high-dimensional data
Robust to small datasets
Fast training and prediction
Works well for sparse data
Is an interpretable model

Import Statements

import spacy
import numpy as np 
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, confusion_matrix,roc_curve, auc

I use this dataset

# python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

Note:

Arrays are homogeneous (all elements are of the same type) while lists are heterogeneous(elements can be diifferent)

Arrays have a fixed size, whereas lists are dynamic

Lists in python have more built in funtions

# documents come in 5 folders, put them all together into one list 
files = sorted(list(Path('bbc').glob('**/*.txt')))
doc_list = [] 

for i,file in enumerate(files):
    # get folder name
    topic = file.parts[-2]
    article = file.read_text(encoding='latin1').split('\n')
    heading = article[0].strip()
    body = ' '.join([l.strip() for l in article[1:]])
    doc_list.append([topic, heading, body])

# create dataframe 
docs = pd.DataFrame(doc_list, columns=['topic','heading','body'])
docs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   topic    2225 non-null   object
 1   heading  2225 non-null   object
 2   body     2225 non-null   object
dtypes: object(3)
memory usage: 52.3+ KB

Here is a look into the data:

docs.sample(10)

	topic	heading	body
1123	politics	Labour pig poster 'anti-Semitic'	The Labour Party has been accused of anti-Sem...
325	business	Senior Fannie Mae bosses resign	The two most senior executives at US mortgage...
1500	sport	Campbell lifts lid on United feud	Arsenal's Sol Campbell has called the rivalry...
1454	sport	Owen determined to stay in Madrid	England forward Michael Owen has told the BBC...
1298	politics	Voters 'don't trust politicians'	Eight out of 10 voters do not trust politicia...
1633	sport	Woodward eyes Brennan for Lions	Toulouse's former Irish international Trevor ...
1006	politics	Kilroy launches 'Veritas' party	Ex-BBC chat show host and East Midlands MEP R...
1972	tech	Microsoft gets the blogging bug	Software giant Microsoft is taking the plunge...
302	business	Brazil plays down Varig rescue	The Brazilian government has played down clai...
454	business	Qantas considers offshore option	Australian airline Qantas could transfer as m...

We are going to classify articles based on these five categories:

docs.topic.value_counts(normalize=True).to_frame('count').style.format({'count': '{:,.2%}'.format})

	count
topic
sport	22.97%
business	22.92%
politics	18.74%
tech	18.02%
entertainment	17.35%

The parameter stratify = y ensures that when the data is split into training and testing sets, the proportion of classes is preserved.

Without stratify = y there may be an unbalanced distributiion of classes between training and testing sets, especially if some classes are less frequent in the original dataset.

With stratify = y the splot respected the distribution of the different classes in the dataset.

For example, if topic 1 represents 20% of the original dataset, it will also represent approximately 20% of both the training and testing sets.

# classify news articles

# create integer class values
y = pd.factorize(docs.topic)[0]
x = docs.body 
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state=1,stratify=y)

Vectorize Text Data

CountVectorizer() converts a collection of text documents into a matrix of token/word counts. First the data is tokenized by being split into individual words. Then, a vocabulary of unique tokens across the entire corpus is built. Finally, a word count matrix is created where each row corresponds to a document and each column corresponds to the count of a unique word in the vocabulary.

vectorizer = CountVectorizer(stop_words=None)
x_train_dtm = vectorizer.fit_transform(x_train)
x_test_dtm = vectorizer.transform(x_test)

x_train_dtm.shape, x_test_dtm.shape

((1668, 25951), (557, 25951))

Train Multi-Class Naive Bayes Model

Naive Bayes is based on Bayes Theorem:

First equation is here \({P(C|X) = \frac{P(X∣C)P(C)}{P(X)}}\)

where C is the class and X is the feature.

Some limitations are its feature independence assumption and its zero frequency problem. There is a strong assumption that features are condiditionally independent which does not always hold, depending on the data. Also, if a class has a zero probability for a given feature, the entire product becomes zero.

Multinomial naive bayes handles frequency based features, making it effective for text classification when the number of times a word appears in an article is meaningful.

My one concern while building this was that stop words would be frequent enough across all documents to influence the classification of topics. However, stop words are generally not topic-specific and do not provide much discriiminatory power between the classes. For example, the word “the” can appear in articles about both sports and politics.

Naive Bayes is robust in this nature– it naturall down weights stop words because of its probablistic nature. Stop words, being common across all classes, will have siimilar probabilities for all classes ((P(word|class)). Thus, these words have limited impact on the overall classification decision

nb = MultinomialNB()
nb.fit(x_train_dtm,y_train)
y_pred_class = nb.predict(x_test_dtm)

score = accuracy_score(y_test, y_pred_class)
print(score)

0.9712746858168761

Our model produces an accuracy score of 97% – pretty good!

Create and plot confusion matrix

# calculate the confusion matrix
cm = confusion_matrix(y_true=y_test, y_pred=y_pred_class)

# display the confusion matrix as a heatmap
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=np.unique(y_test))
disp.plot(cmap='Blues', xticks_rotation=45)

# customize plot
plt.title("Confusion Matrix", fontsize=16)
plt.xlabel("Predicted Labels", fontsize=14)
plt.ylabel("True Labels", fontsize=14)
plt.grid(False)
plt.tight_layout()
plt.show()

Create ROC plot for all classes

y_pred_probs = nb.predict_proba(x_test_dtm)
# create binary labels for class 0
y_test_binary = (y_test == 0).astype(int)   
# probabilities for class 0
y_pred_probs_class0 = y_pred_probs[:, 0]     

# get the unique class labels
classes = np.unique(y_test)  
n_classes = len(classes)

# binarize the output for multi-class ROC
y_test_binarized = label_binarize(y_test, classes=classes)

# initialize plot
plt.figure(figsize=(10, 8))
colors = plt.cm.get_cmap('tab10', n_classes)

# plot ROC for each class
for i in range(n_classes):
    fpr, tpr, _ = roc_curve(y_test_binarized[:, i], y_pred_probs[:, i])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f"Class {classes[i]} (AUC = {roc_auc:.2f})", color=colors(i))

# ddd diagonal reference line
plt.plot([0, 1], [0, 1], 'k--', label="Random Guessing")

# plot settings
plt.title("One-vs-Rest ROC Curves", fontsize=16)
plt.xlabel("False Positive Rate", fontsize=14)
plt.ylabel("True Positive Rate", fontsize=14)
plt.legend(loc="lower right", fontsize=12)
plt.grid(alpha=0.5)
plt.tight_layout()
plt.show()

ROC curves look good!