Personality Prediction Project using ML

Last Updated : 29 Aug, 2025

Myers-Briggs Type Indicator (MBTI) is used to predict personality type based on answers to a MBTI-style survey. The MBTI framework classifies personalities into 16 distinct types based on four dimensions involving how people perceive the world and make decisions. Let's make a machine learning model which will:

  • Learns from a dataset of social media posts labeled with MBTI types.
  • The textual data is converted into numerical features using TF-IDF vectorization, capturing the importance of words.
  • It combines text features with simulated or collected questionnaire answers representing preferences in social behavior, information processing, decision making, work style and values.
  • A Random Forest classifier is trained on this hybrid data to predict the personality type accurately.

Step-by-Step Implementation

Let's build our prediction model step by step and use it to predict our personality type:

Step 1: Install dependencies

We will install the required packages,

Python
!pip install sentence-transformers chromadb joblib numpy pandas

Step 2: Import Libraries and Load Data

We will import the required libraries for our model and load the MBTI dataset which contains user posts and their MBTI labels

  • pandas: Used for data manipulation and loading CSV files.
  • LabelEncoder: Converts MBTI personality type labels (strings) into numeric codes for classification.
  • train_test_split: Splits dataset into training and testing subsets.
  • TfidfVectorizer: Converts user text data (posts) into numerical vectors using TF-IDF vectorization.

The MTBI dataset can be download from here.

Python
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy

data = pd.read_csv('mbti_1.csv')

Step 3: Encode Personality Labels and Split Dataset

We will encode the labels and also split the dataset for training and testing,

  • Label Encoder transforms MBTI labels into integers (e.g., 'INFP' -> 6).
  • Separates posts (X_text) and label codes (y).
  • Split: 80% training data, 20% testing to evaluate model generalization.
Python
le = LabelEncoder()
data['type_code'] = le.fit_transform(data['type'])

X_text = data['posts']
y = data['type_code']
X_train_text, X_test_text, y_train, y_test = train_test_split(
    X_text, y, test_size=0.2, random_state=42)

Step 4: TF-IDF Vectorization of Text Data

Now we:

  • Converts raw text posts into sparse matrices of TF-IDF features.
  • Limits to top 3000 frequent words for tractability.
  • Removes common English stop words to reduce noise.
Python
vectorizer = TfidfVectorizer(max_features=3000, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train_text)
X_test_tfidf = vectorizer.transform(X_test_text)

Step 5: Simulate Questionnaire Data for Training

We simulate questions and answers for training the model.

Python
import numpy as np

num_train = X_train_tfidf.shape[0]
num_test = X_test_tfidf.shape[0]
num_questions = 5

np.random.seed(42)
X_train_q = np.random.randint(0, 2, size=(num_train, num_questions))
X_test_q = np.random.randint(0, 2, size=(num_test, num_questions))

Step 6: Combine Text and Questionnaire Features

Now we,

  • Horizontally stacks the TF-IDF vectors and questionnaire answer vectors.
  • Combines text content and survey responses into one feature matrix.
  • hstack efficiently handles sparse text vectors combined with dense questionnaire data.
Python
from scipy.sparse import hstack

X_train_combined = hstack([X_train_tfidf, X_train_q])
X_test_combined = hstack([X_test_tfidf, X_test_q])

Step 7: Train Random Forest Model and Evaluate Performance

  • RandomForestClassifierRandom Forest classifier is an ensemble tree-based model that combines many decision trees to improve accuracy and reduce overfitting.
  • n_estimators=100 specifies 100 trees in the forest.
  • random_state=42 ensures results can be reproduced.
  • After training on both text features and questionnaire answers, it predicts on the unseen test set.
  • accuracy_score: Shows overall proportion of correctly predicted instances.
  • classification_report: Provides detailed metrics per MBTI category for a nuanced evaluation.
Python
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_combined, y_train)

y_pred = model.predict(X_test_combined)2

print("Accuracy on test set:", accuracy_score(y_test, y_pred))
print("Classification report:\n", classification_report(
    y_test, y_pred, target_names=le.classes_))

Output:

Screenshot-2025-08-29-094217
Training and Testing

Step 8: Save Trained Model and Vectorize for Use

Now we save the trained Random Forest model and all encoders/vectorizers to disk. These files are loaded later for interactive prediction after deployment.

To know more about saving and reusing the model we can refer to: Save and Load Machine Learning Models.

Python
import joblib

joblib.dump(model, "hybrid_personality_model.joblib")
joblib.dump(vectorizer, "tfidf_vectorizer.joblib")
joblib.dump(le, "label_encoder.joblib")

Step 9: Load Saved Models and Personality Description File

Here we,

  • Load the trained classifier, vectorizer and label encoder for inference.
  • Load a JSON file with textual personality descriptions for each MBTI type.
  • This allows showing detailed feedback on predictions.

The JSON file with personality description can be download from here.

Python
import numpy as np
from scipy.sparse import hstack
import joblib
import chromadb
import json

model = joblib.load("hybrid_personality_model.joblib")
vectorizer = joblib.load("tfidf_vectorizer.joblib")
le = joblib.load("label_encoder.joblib")

with open("personality_descriptions.json", "r") as f:
    personality_descriptions = json.load(f)

Step 10: Questionnaire Setup and Interactive User Input

Now we,

  • Define the 5 MBTI survey questions with two answer options each.
  • Gets freeform self-description from user.
  • Then sequentially asks each MBTI question, collects responses as binary 0/1.
Python
questions = [
    ("At social events, I usually:", "Meet and talk with many new people",
     "Stick with a small group of close friends"),
    ("When focusing on information, I prefer:", "Concrete facts and practical details",
     "Abstract ideas and imaginative concepts"),
    ("When making decisions, I rely on:",
     "Logic and objective analysis", "Feelings and harmony"),
    ("My work style tends to be:", "Organized and planned", "Flexible and spontaneous"),
    ("I value:", "Fairness and impartiality", "Harmony and kindness"),
]

print("Please enter a brief description about yourself:")
user_text = input("> ").strip()

answers = []
for idx, (q, a, b) in enumerate(questions):
    print(f"\nQ{idx+1}: {q}")
    print(f"  1. {a}")
    print(f"  2. {b}")
    while True:
        inp = input("Choose 1 or 2: ").strip()
        if inp in ("1", "2"):
            answers.append(int(inp) - 1)
            break
        else:
            print("Invalid choice, please enter 1 or 2.")

Output:

questionnaire
Questions

Step 11: Vectorize Input and Combine Features

  • Converts the user’s text into a TF-IDF vector (same space as training).
  • Formats questionnaire answers as a numeric feature vector.
  • Stacks both into one hybrid vector for prediction.
Python
text_vec = vectorizer.transform([user_text])
answer_vec = np.array(answers).reshape(1, -1)
hybrid_vec = hstack([text_vec, answer_vec])

Step 12: Make Personality Prediction and Output Description

Now our model,

  • Passes combined features through the trained model to predict the MBTI label code.
  • Converts numeric MBTI code back to string label.
  • Retrieves and prints the detailed MBTI description for user clarity.
Python
pred_code = model.predict(hybrid_vec)[0]
pred_type = le.inverse_transform([pred_code])[0]
description = personality_descriptions.get(
    pred_type, "Description not available.")

print(f"\nYour MBTI personality type is: {pred_type}")
print(description)

Output:

Screenshot-2025-08-29-102052
Personality Predicted by Model

As we saw that our model predicted the personality type of a person based on the answers of the questionnaire.

Step 13: Store the Profile in ChromaDB Vector Database

Our model,

  • Connects to ChromaDB (local vector DB) to store user profile embeddings.
  • Metadata contains MBTI type, answers and user text for rich querying.
  • Uses a unique UUID string as identifier for each stored profile.
  • Persists the profile for future user comparisons, recommendations or analytics.
Python
import uuid
client = chromadb.Client()
collection = client.get_or_create_collection(name='personality_profiles')

metadata = {
    "mbti_type": pred_type,
    "answers": json.dumps(answers),
    "user_text": user_text
}

unique_id = str(uuid.uuid4())

collection.add(
    embeddings=hybrid_vec.toarray().tolist(),
    metadatas=[metadata],
    ids=[unique_id]
)

print("\nYour profile has been saved to the personality database.")

Output:

Your profile has been saved to the personality database.

Step 14: Access the Database

We can access the ChromaDB database,

  • To get all stored metadata and IDs.
  • Retrieves all saved vectors’ metadata and ids (user texts and MBTI types stored in metadata).
Python
client = chromadb.Client()
collection = client.get_collection(name='personality_profiles')

results = collection.get()

print("Stored profile IDs:", results['ids'])
print("Stored metadata example:", results['metadatas'])

Output:

Stored profile IDs: ['ff6ea2d8-0b78-47ea-b125-0d9baec116a2', '3665925b-1b07-489b-9108-7f4ad3914618']
Stored metadata example: [{'user_text': 'I am a calm person and an extrovert. I love to to explore things', 'mbti_type': 'INFP', 'answers': '[0, 1, 0, 1, 1]'},
{'mbti_type': 'INFP', 'answers': '[1, 0, 1, 0, 1]', 'user_text': 'I am a sad person'}]

The complete notebook can be download from here.

Comment