Text Classification Using NLTK

Text classification is a fundamental task in Natural Language Processing (NLP) that involves assigning categories or labels to text based on its content. One of the most accessible tools for performing text classification in Python is the Natural Language Toolkit (NLTK). NLTK provides a comprehensive suite of tools for text processing, including tokenization, stemming, stopword removal and built in classifiers like Naive Bayes.

Implementation

Step 1: Install necessary libraries

This code imports essential libraries for text preprocessing, model training and evaluation.
Pandas is used for handling the dataset, while nltk provides tools for text processing like tokenization and stopword removal. From sklearn, you import modules for splitting data, converting text to TF-IDF vectors, training a Naive Bayes classifier and evaluating the model's performance.
The three nltk.download() lines ensure that necessary datasets like the tokenizer models (punkt) and stopword list are downloaded and available.

Python

import pandas as pd
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

Output:

Step 2: Load the dataset

This block loads the dataset from a CSV file (You can download it from here- Emotion dataset) into a DataFrame called df using pandas. It then removes any rows with missing values to avoid errors during processing using dropna().
After that the column names are renamed to text and label for consistency and easier reference in the rest of the code. Finally the first five rows are printed to give a quick look at the loaded data.

Python

df = pd.read_csv("Emotion_classify_Data (2).csv")
df.dropna(inplace=True)

df.columns = ['text', 'label']
print(df.head())

Output:

Step 3: Preprocessing the text

This block defines a preprocessing function to clean the text data. First it creates a set of English stopwords using NLTK's built in list. Inside the preprocess() function each text is converted to lowercase, tokenized into words using word_tokenize() and filtered to keep only alphabetic words that are not in the stopword list.
The cleaned tokens are joined back into a single string. This function is then applied to the text column and the result is stored in a new column called clean_text. Finally it prints the original and cleaned text side by side for the first few rows.

Python

stop_words = set(stopwords.words('english'))

def preprocess(text):
    text = text.lower
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    return " ".join(tokens)

df['clean_text'] = df['text'].apply(preprocess)
print(df[['text', 'clean_text']].head())

Output:

Screenshot-2025-06-26-103856 — Output for Preprocessed text

Step 4: TF-IDF Vectorization

This block initializes a TfidfVectorizer which converts the cleaned text into numerical features based on the importance of each word (TF-IDF: Term Frequency-Inverse Document Frequency).
The `fit_transform()` method learns the vocabulary from the clean_text column and transforms the text into a sparse TF-IDF matrix X.
The corresponding target labels are stored in y from the label column. Finally, it prints the shape of the TF-IDF matrix showing the number of documents (rows) and the number of unique words (columns) used as features.

Python

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['clean_text'])
y = df['label']
print("TF-IDF matrix shape:", X.shape)

Output:

Screenshot-2025-06-26-103901 — Output for vectorization

Step 5: Train-Test Split

This line splits the dataset into training and testing sets. The train_test_split() takes the TF-IDF feature matrix X and corresponding labels y and randomly divides them as 80% for training and 20% for testing.
The random_state=42 ensures that the split is reproducible so the same data is selected each time the code is run.

Python

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 6: Train Classifier

This line initializes a Multinomial Naive Bayes classifier which is well suited for text classification tasks using word frequencies or TF-IDF features.
The .fit() method trains the model on the training data allowing it to learn the patterns between the input features and their corresponding emotion labels.

Python

model = MultinomialNB()
model.fit(X_train, y_train)

Output:

Step 7: Evaluate the Model

This block uses the trained Naive Bayes model to make predictions on the test set X_test storing the results in y_pred. The accuracy_score() function then compares the predicted labels with the actual labels (y_test) to calculate the overall accuracy of the model.
The classification_report() provides a detailed performance summary including precision, recall and F1 score for each emotion class, helping you understand how well the model performs on each label.

Python

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Output:

Screenshot-2025-06-26-103917 — Output of Accuracy of the model

Step 8: Make Predictions

Python

def predict_emotion(text):
    clean = preprocess(text)
    vector = vectorizer.transform([clean])
    return model.predict(vector)[0]

# Example:
print(predict_emotion("I feel amazing and joyful today!"))

Output:

Joy

You can download the complete source code from here - Text Classification Using NLTK

Text Classification Using NLTK

Implementation

Step 1: Install necessary libraries

Step 2: Load the dataset

Step 3: Preprocessing the text

Step 4: TF-IDF Vectorization

Step 5: Train-Test Split

Step 6: Train Classifier

Step 7: Evaluate the Model

Step 8: Make Predictions

Explore