Myers-Briggs Type Indicator (MBTI) is used to predict personality type based on answers to a MBTI-style survey. The MBTI framework classifies personalities into 16 distinct types based on four dimensions involving how people perceive the world and make decisions. Let's make a machine learning model which will:
- Learns from a dataset of social media posts labeled with MBTI types.
- The textual data is converted into numerical features using TF-IDF vectorization, capturing the importance of words.
- It combines text features with simulated or collected questionnaire answers representing preferences in social behavior, information processing, decision making, work style and values.
- A Random Forest classifier is trained on this hybrid data to predict the personality type accurately.
Step-by-Step Implementation
Let's build our prediction model step by step and use it to predict our personality type:
Step 1: Install dependencies
We will install the required packages,
- sentence-transformers generate embeddings for semantic similarity and search.
- chromadb for vector database storage of user profiles.
- joblib for loading models.
- Pandas and numpy for numerical operations and manipulations.
- Scikit learn and scipy for various ML modules.
!pip install sentence-transformers chromadb joblib numpy pandas
Step 2: Import Libraries and Load Data
We will import the required libraries for our model and load the MBTI dataset which contains user posts and their MBTI labels
- pandas: Used for data manipulation and loading CSV files.
- LabelEncoder: Converts MBTI personality type labels (strings) into numeric codes for classification.
- train_test_split: Splits dataset into training and testing subsets.
- TfidfVectorizer: Converts user text data (posts) into numerical vectors using TF-IDF vectorization.
The MTBI dataset can be download from here.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy
data = pd.read_csv('mbti_1.csv')
Step 3: Encode Personality Labels and Split Dataset
We will encode the labels and also split the dataset for training and testing,
- Label Encoder transforms MBTI labels into integers (e.g., 'INFP' -> 6).
- Separates posts (X_text) and label codes (y).
- Split: 80% training data, 20% testing to evaluate model generalization.
le = LabelEncoder()
data['type_code'] = le.fit_transform(data['type'])
X_text = data['posts']
y = data['type_code']
X_train_text, X_test_text, y_train, y_test = train_test_split(
X_text, y, test_size=0.2, random_state=42)
Step 4: TF-IDF Vectorization of Text Data
Now we:
- Converts raw text posts into sparse matrices of TF-IDF features.
- Limits to top 3000 frequent words for tractability.
- Removes common English stop words to reduce noise.
vectorizer = TfidfVectorizer(max_features=3000, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train_text)
X_test_tfidf = vectorizer.transform(X_test_text)
Step 5: Simulate Questionnaire Data for Training
We simulate questions and answers for training the model.
import numpy as np
num_train = X_train_tfidf.shape[0]
num_test = X_test_tfidf.shape[0]
num_questions = 5
np.random.seed(42)
X_train_q = np.random.randint(0, 2, size=(num_train, num_questions))
X_test_q = np.random.randint(0, 2, size=(num_test, num_questions))
Step 6: Combine Text and Questionnaire Features
Now we,
- Horizontally stacks the TF-IDF vectors and questionnaire answer vectors.
- Combines text content and survey responses into one feature matrix.
- hstack efficiently handles sparse text vectors combined with dense questionnaire data.
from scipy.sparse import hstack
X_train_combined = hstack([X_train_tfidf, X_train_q])
X_test_combined = hstack([X_test_tfidf, X_test_q])
Step 7: Train Random Forest Model and Evaluate Performance
- RandomForestClassifier: Random Forest classifier is an ensemble tree-based model that combines many decision trees to improve accuracy and reduce overfitting.
- n_estimators=100 specifies 100 trees in the forest.
- random_state=42 ensures results can be reproduced.
- After training on both text features and questionnaire answers, it predicts on the unseen test set.
- accuracy_score: Shows overall proportion of correctly predicted instances.
- classification_report: Provides detailed metrics per MBTI category for a nuanced evaluation.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_combined, y_train)
y_pred = model.predict(X_test_combined)2
print("Accuracy on test set:", accuracy_score(y_test, y_pred))
print("Classification report:\n", classification_report(
y_test, y_pred, target_names=le.classes_))
Output:

Step 8: Save Trained Model and Vectorize for Use
Now we save the trained Random Forest model and all encoders/vectorizers to disk. These files are loaded later for interactive prediction after deployment.
To know more about saving and reusing the model we can refer to: Save and Load Machine Learning Models.
import joblib
joblib.dump(model, "hybrid_personality_model.joblib")
joblib.dump(vectorizer, "tfidf_vectorizer.joblib")
joblib.dump(le, "label_encoder.joblib")
Step 9: Load Saved Models and Personality Description File
Here we,
- Load the trained classifier, vectorizer and label encoder for inference.
- Load a JSON file with textual personality descriptions for each MBTI type.
- This allows showing detailed feedback on predictions.
The JSON file with personality description can be download from here.
import numpy as np
from scipy.sparse import hstack
import joblib
import chromadb
import json
model = joblib.load("hybrid_personality_model.joblib")
vectorizer = joblib.load("tfidf_vectorizer.joblib")
le = joblib.load("label_encoder.joblib")
with open("personality_descriptions.json", "r") as f:
personality_descriptions = json.load(f)
Step 10: Questionnaire Setup and Interactive User Input
Now we,
- Define the 5 MBTI survey questions with two answer options each.
- Gets freeform self-description from user.
- Then sequentially asks each MBTI question, collects responses as binary 0/1.
questions = [
("At social events, I usually:", "Meet and talk with many new people",
"Stick with a small group of close friends"),
("When focusing on information, I prefer:", "Concrete facts and practical details",
"Abstract ideas and imaginative concepts"),
("When making decisions, I rely on:",
"Logic and objective analysis", "Feelings and harmony"),
("My work style tends to be:", "Organized and planned", "Flexible and spontaneous"),
("I value:", "Fairness and impartiality", "Harmony and kindness"),
]
print("Please enter a brief description about yourself:")
user_text = input("> ").strip()
answers = []
for idx, (q, a, b) in enumerate(questions):
print(f"\nQ{idx+1}: {q}")
print(f" 1. {a}")
print(f" 2. {b}")
while True:
inp = input("Choose 1 or 2: ").strip()
if inp in ("1", "2"):
answers.append(int(inp) - 1)
break
else:
print("Invalid choice, please enter 1 or 2.")
Output:

Step 11: Vectorize Input and Combine Features
- Converts the user’s text into a TF-IDF vector (same space as training).
- Formats questionnaire answers as a numeric feature vector.
- Stacks both into one hybrid vector for prediction.
text_vec = vectorizer.transform([user_text])
answer_vec = np.array(answers).reshape(1, -1)
hybrid_vec = hstack([text_vec, answer_vec])
Step 12: Make Personality Prediction and Output Description
Now our model,
- Passes combined features through the trained model to predict the MBTI label code.
- Converts numeric MBTI code back to string label.
- Retrieves and prints the detailed MBTI description for user clarity.
pred_code = model.predict(hybrid_vec)[0]
pred_type = le.inverse_transform([pred_code])[0]
description = personality_descriptions.get(
pred_type, "Description not available.")
print(f"\nYour MBTI personality type is: {pred_type}")
print(description)
Output:

As we saw that our model predicted the personality type of a person based on the answers of the questionnaire.
Step 13: Store the Profile in ChromaDB Vector Database
Our model,
- Connects to ChromaDB (local vector DB) to store user profile embeddings.
- Metadata contains MBTI type, answers and user text for rich querying.
- Uses a unique UUID string as identifier for each stored profile.
- Persists the profile for future user comparisons, recommendations or analytics.
import uuid
client = chromadb.Client()
collection = client.get_or_create_collection(name='personality_profiles')
metadata = {
"mbti_type": pred_type,
"answers": json.dumps(answers),
"user_text": user_text
}
unique_id = str(uuid.uuid4())
collection.add(
embeddings=hybrid_vec.toarray().tolist(),
metadatas=[metadata],
ids=[unique_id]
)
print("\nYour profile has been saved to the personality database.")
Output:
Your profile has been saved to the personality database.
Step 14: Access the Database
We can access the ChromaDB database,
- To get all stored metadata and IDs.
- Retrieves all saved vectors’ metadata and ids (user texts and MBTI types stored in metadata).
client = chromadb.Client()
collection = client.get_collection(name='personality_profiles')
results = collection.get()
print("Stored profile IDs:", results['ids'])
print("Stored metadata example:", results['metadatas'])
Output:
Stored profile IDs: ['ff6ea2d8-0b78-47ea-b125-0d9baec116a2', '3665925b-1b07-489b-9108-7f4ad3914618']
Stored metadata example: [{'user_text': 'I am a calm person and an extrovert. I love to to explore things', 'mbti_type': 'INFP', 'answers': '[0, 1, 0, 1, 1]'},
{'mbti_type': 'INFP', 'answers': '[1, 0, 1, 0, 1]', 'user_text': 'I am a sad person'}]
The complete notebook can be download from here.