Create a Movie Recommendation Engine with Milvus and Python
Use open source tools to help people find movies they might want to watch.
Nov 17th, 2023 9:32am by
Featured image by Tima Miroshnichenko on Pexels.
Zilliz sponsored this post.
Setting Up the Environment
For this article, you’ll need the following requirements installed:- Python 3.x
- Python Package Manager (PIP) to install packages
- Jupyter Notebook for writing code
- Docker
- A system with at least 32GB of RAM or a Zilliz Cloud account
Python Requirements
You also need to install a set of libraries that will be needed throughout this tutorial. Install the libraries using PIP:
$ python -m pip install pymilvus pandas sentence_transformers Kaggle
Vectors Data Store (Milvus)
You’ll use the Milvus vector database to store the embeddings that you’ll generate using movie descriptions. The dataset is relatively large, at least for a server running on a personal computer. So you may want to use a Zilliz Cloud instance to store these vectors. If you prefer to stay with a local instance, you can download a docker-compose configuration and run it:
$ wget https://github.com/milvus-io/milvus/releases/download/v2.3.0/milvus-standalone-docker-compose.yml -O docker-compose.yml
$ docker-compose up -d
Data Collection and Preprocessing
For this project, you’ll use the Movies Dataset from Kaggle which contains metadata for 45,000 movies. You can download this dataset directly or use the Kaggle API to download the dataset using Python. To do it with Python, you need to download the kaggle.json file from the profile section of Kaggle.com and put it in a location where the API will find it. Next, set up some environment variables for Kaggle authentication. For this, you can open the Jupyter Notebook and write the following lines of code:
%env KAGGLE_USERNAME=username
%env KAGGLE_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
%env TOKENIZERS_PARALLELISM=true
# import kaggle dependency
import kaggle
kaggle.api.authenticate()
kaggle.api.dataset_download_files('rounakbanik/the-movies-dataset', path='dataset', unzip=True)
# import pandas
import pandas as pd
# read csv data
movies=pd.read_csv('dataset/movies_metadata.csv',low_memory=False)
# check shape of data
movies.shape
Database records and columns
# check column names
movies.columns

Dataset columns
# filter required columns
trimmed_movies = movies[["id", "title", "overview", "release_date", "genres"]]
trimmed_movies.head(5)

Required columns
unclean_movies_dict = trimmed_movies.to_dict('records')
print('{} movies'.format(len(unclean_movies_dict)))
movies_dict = []
for movie in unclean_movies_dict:
if movie["overview"] == movie["overview"] and movie["release_date"] == movie["release_date"] and movie["genres"] == movie["genres"] and movie["title"] == movie["title"]:
movies_dict.append(movie)
Connect to Milvus
Now that you have all the required columns, connect to Milvus to start uploading the data. To connect to the Milvus cloud instance, you’ll need the Uniform Resource Identifier (URI) and token, which you can download from your Zilliz Cloud dashboard.
Zilliz Cloud dashboard
# import milvus dependency
from pymilvus import *
# connect to milvus
milvus_uri="YOUR_URI"
token="YOUR_API_TOKEN"
connections.connect("default", uri=milvus_uri, token=token)
print("Connected!")
Generate Embeddings for Movies
Now it’s time to calculate the embeddings for the text data in the movie dataset. First, create a collection object that will store the movie ID and embeddings for the text data. Also, create an index field to make searches more efficient:
COLLECTION_NAME = 'film_vectors'
PARTITION_NAME = 'Movie'
# Here's our record schema
"""
"title": Film title,
"overview": description,
"release_date": film release date,
"genres": film generes,
"embedding": embedding
"""
id = FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=500, is_primary=True)
field = FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=384)
schema = CollectionSchema(fields=[id, field], description="movie recommender: film vectors", enable_dynamic_field=True)
if utility.has_collection(COLLECTION_NAME): # drop the same collection created before
collection = Collection(COLLECTION_NAME)
collection.drop()
collection = Collection(name=COLLECTION_NAME, schema=schema)
print("Collection created.")
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 128},
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
print("Collection indexed!")
from sentence_transformers import SentenceTransformer
import ast
# function to extract the text from genre column
def build_genres(data):
genres = data['genres']
genre_list = ""
entries= ast.literal_eval(genres)
genres = ""
for entry in entries:
genre_list = genre_list + entry["name"] + ", "
genres += genre_list
genres = "".join(genres.rsplit(",", 1))
return genres
# create an object of SentenceTransformer
transformer = SentenceTransformer('all-MiniLM-L6-v2')
# function to generate embeddings
def embed_movie(data):
embed = "{} Released on {}. Genres are {}.".format(data["overview"], data["release_date"], build_genres(data))
embeddings = transformer.encode(embed)
return embeddings
Send Embeddings to Milvus
Now you can create the embeddings using the embed_movie() method. This dataset is too large to send to Milvus in a single insert statement, but sending data rows one at a time would create unnecessary network traffic and add too much time. So, instead create batches (e.g., 5,000 rows) of data to send to Milvus:
# Loop counter for batching and showing progress
j = 0
batch = []
for movie_dict in movies_dict:
try:
movie_dict["embedding"] = embed_movie(movie_dict)
batch.append(movie_dict)
j += 1
if j % 5 == 0:
print("Embedded {} records".format(j))
collection.insert(batch)
print("Batch insert completed")
batch=[]
except Exception as e:
print("Error inserting record {}".format(e))
pprint(batch)
break
collection.insert(movie_dict)
print("Final batch completed")
print("Finished with {} embeddings".format(j))

Sending embeddings to Milvus
Recommend New Movies Using Milvus
Now you can leverage Milvus’ near real-time vector search functionality to get a close match of movies that meet the viewer’s criteria. For this, create two different functions:- embed_search(): You need a transformer to convert the user’s search string to an embedding. This function takes the viewer’s criteria and passes it to the same transformer you used to populate Milvus.
- search_for_movies(): This function performs the actual vector search using the other function for support.
# load collection memory before search
collection.load()
# Set search parameters
topK = 5
SEARCH_PARAM = {
"metric_type":"L2",
"params":{"nprobe": 20},
}
# convert search string to embeddings
def embed_search(search_string):
search_embeddings = transformer.encode(search_string)
return search_embeddings
# search similar embeddings for user's query
def search_for_movies(search_string):
user_vector = embed_search(search_string)
return collection.search([user_vector],"embedding",param=SEARCH_PARAM, limit=topK, expr=None, output_fields=['title', 'overview'])
from pprint import pprint
search_string = "A comedy from the 1990s set in a hospital. The main characters are in their 20s and are trying to stop a vampire."
results = search_for_movies(search_string)
# check results
for hits in iter(results):
for hit in hits:
print(hit.entity.get('title'))
print(hit.entity.get('overview'))
print("-------------------------------")

Movie recommendation output
Conclusion
After reading this article, you know what a vector-based recommendation system is and how to create a movie recommender system with Milvus. Milvus helps build an efficient and scalable movie recommendation system. Leveraging vector storage and similarity search, Milvus has great potential for enabling personalized recommendations, enhancing user engagement and showcasing the role of advanced vector-based models in modern recommendation systems. You can learn more about it on the Milvus website.
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don't miss an episode. Subscribe to our YouTube
channel to stream all our podcasts, interviews, demos, and more.