Using a Vector Database to Search White House Speeches

Semantic search allows us to look for text that is semantically similar so that we can search for speeches using a general description.

Apr 28th, 2023 7:41am by Yujian Tang

Featued image for: Using a Vector Database to Search White House Speeches

Campaign season is coming up for the U.S. presidential campaign. It’s a good time to look back at some of the speeches given by the Biden administration during his first two years in office. Wouldn’t it be great to search through some speech transcripts to learn more about the White House’s messaging about certain topics so far? Let’s say we want to search the content of a speech. How would we do that? We could use semantic search. Semantic search is one of the hottest topics in artificial intelligence (AI) right now. It has become more important as we’ve seen the rise in popularity of natural language processing (NLP) applications like ChatGPT. Instead of repeatedly pinging GPT, which is both economically and ecologically expensive, we can use a vector database to cache the results (such as with GPTCache). In this tutorial, we will spin up a vector database locally so we can search Biden’s speeches from 2021 to 2022 by content. We use “The White House (Speeches and Remarks) 12/10/2022” dataset, which we found on Kaggle and made available to download via Google Drive for this example. A walkthrough notebook of this tutorial is available on GitHub. Before we dive into the code, please make sure to download the prereqs. We need four libraries: PyMilvus, Milvus, Sentence-Transformers and gdown. You can get the necessary libraries from PyPi by running: `pip3 install pymilvus==2.2.5 sentence-transformers gdown milvus`.

Preparing the White House Speech Dataset

As with almost any artificial intelligence/machine learning project based on real-world datasets, we first need to prepare the data. We use `gdown` to download the dataset and `zipfile` to extract it into a local folder. After running the code below, we expect to see a file titled “The white house speeches.csv” in a folder titled “white_house_2021_2022”.

import gdown
url = 'https://drive.google.com/uc?id=10_sVL0UmEog7mczLedK5s1pnlDOz3Ukf'
output = './white_house_2021_2022.zip'
gdown.download(url, output)


import zipfile


with zipfile.ZipFile("./white_house_2021_2022.zip","r") as zip_ref:
   zip_ref.extractall("./white_house_2021_2022")

We use `pandas` to load and inspect the CSV data.

import pandas as pd
df = pd.read_csv("./white_house_2021_2022/The white house speeches.csv")
df.head()

What do you notice when you look at the `head` of the data? The first thing I notice is that the data has four columns: title, date and time, location, and speech. The second thing is that there are null values. Null values aren’t always a problem, but they are for our data.

Cleaning the Dataset

Speeches without any substance (null values in the “Speech” column) are entirely useless to us. Let’s drop our null values and reexamine the data.

df = df.dropna()
df

Now we see that there is actually a second problem that wasn’t immediately obvious from looking at just the `head` of the data. If you look at the last entry, you’ll see that it is just a time; “12:18 P.M. EST” is hardly a speech. It doesn’t make sense to save this entry. We can’t derive any value from saving a vector embedding. Let’s get rid of all the speeches that are less than a certain length. For this example, I’ve chosen 50, but you can choose whatever makes sense to you. I chose 50 after exploring many different numbers. If you look for speech transcripts between 20 and 50 characters, you’ll see many are locations or times with a few random sentences thrown in.

cleaned_df = df.loc[(df["Speech"].str.len() > 50)]cleaned_df

With the short, no-substance speeches taken care of, we once again look at our data and notice another issue. Many of the speeches contain `\r\n` values — newlines and returns. These characters are used for formatting, but don’t contain any semantic value. The next step in our data-cleaning process is to get rid of these.

cleaned_df["Speech"] = cleaned_df["Speech"].str.replace("\r\n", "")
cleaned_df

That’s looking way better. The final step is to convert the “Date_time” column into a better format to store it in our vector database and be compared to other datetimes. We use the `datetime` library to simply convert this datetime format into a universal YYYY-MM-DD format.

import datetime

# Convert the 'date' column to datetime objects
cleaned_df["Date_time"] = pd.to_datetime(cleaned_df["Date_time"], format="%B %d, %Y")

cleaned_df

Setting Up a Vector Database for Semantic Search

Our data is now clean and ready to use. The next step is to spin up a vector database to actually search the speeches by their content. For this example, we use Milvus Lite, a lite version of Milvus that you can get running without Docker, Kubernetes or dealing with any sort of YAML file. The first thing we do is define some of our constants. We need a collection name (for the vector database), the number of dimensions in our embedded vector, a batch size and a number that defines how many results we want to receive when we search. This example uses the MiniLM L6 v2 sentence transformer, which produces 384 dimension embedding vectors.

COLLECTION_NAME = "white_house_2021_2022"
DIMENSION = 384
BATCH_SIZE = 128
TOPK = 3

We use the `default_server` from Milvus. Then we use the PyMilvus SDK to connect to our local Milvus server. If there is a collection in our vector database that has the same name as the collection name we defined earlier, we drop that collection to ensure we start with a blank slate.

from milvus import default_server
from pymilvus import connections, utility


default_server.start()
connections.connect(host="127.0.0.1", port=default_server.listen_port)


if utility.has_collection(COLLECTION_NAME):
   utility.drop_collection(COLLECTION_NAME)

Like most other databases, we need a schema to load data into the Milvus vector database. First, we define the data fields that we want each object to have. Good thing we looked at the data earlier. We use five data fields, the four columns we had earlier and an ID column. But this time, we use the vector embedding of the speech instead of the actual text.

from pymilvus import FieldSchema, CollectionSchema, DataType, Collection


# object should be inserted in the format of (title, date, location, speech embedding)
fields = [
   FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
   FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=500),
   FieldSchema(name="date", dtype=DataType.VARCHAR, max_length=100),
   FieldSchema(name="location", dtype=DataType.VARCHAR, max_length=200),
   FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=COLLECTION_NAME, schema=schema)

The last thing we need to define before we are ready to load data into the vector database is the index. There are many vector indexes and patterns, but for this example we use the `IVF_FLAT` index with 128 clusters. Larger applications usually use more than 128 clusters, but we only have slightly more than 600 entries anyway. For our distance, we measure using the L2 norm. Once we define our index parameters, we create the index in our collection and load it for use.

index_params = {
   "index_type": "IVF_FLAT",
   "metric_type": "L2",
   "params": {"nlist": 128},
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()

Getting Vector Embeddings from Speeches

Much of what we’ve gone over so far applies when working with almost any database. We cleaned some data, spun up a database instance and defined a schema for our database. Other than defining an index, another thing we need to do for vector databases in particular is get the embeddings. First, we get the sentence transformer model MiniLM L6 v2, as mentioned above. Then we create a function that performs a transformation on the data and inserts it into the collection. This function takes a batch of data, gets the embeddings for the speech transcripts, creates an object to insert and inserts it into the collection. For context, this function performs a batch update. In this example, we are batch-inserting 128 entries at once. The only data transformation we do in our insert is turn the speech text into an embedding.

from sentence_transformers import SentenceTransformer


transformer = SentenceTransformer('all-MiniLM-L6-v2')


# expects a list of (title, date, location, speech)
def embed_insert(data: list):
   embeddings = transformer.encode(data[3])
   ins = [
       data[0],
       data[1],
       data[2],
       [x for x in embeddings]
   ]
   collection.insert(ins)

Populating Your Vector Database

With a function that creates batch embeds and inserts complete, we are ready to populate the database. For this example, we loop through each row in our dataframe and append to a list of lists we use for batching our data. Once we hit our batch size, we call the `embed_insert` function and reset our batch. If there’s any leftover data in the data batch after we finish looping, we embed and insert the remaining data. Finally, to finish off populating our vector database, we call `flush` to ensure the database is updated and indexed.

data_batch = [[], [], [], []]


for index, row in cleaned_df.iterrows():
   data_batch[0].append(row["Title"])
   data_batch[1].append(str(row["Date_time"]))
   data_batch[2].append(row["Location"])
   data_batch[3].append(row["Speech"])
   if len(data_batch[0]) % BATCH_SIZE == 0:
       embed_insert(data_batch)
       data_batch = [[], [], [], []]


# Embed and insert the remainder
if len(data_batch[0]) != 0:
   embed_insert(data_batch)


# Call a flush to index any unsealed segments.
collection.flush()

Semantic Search White House Speeches Based on Descriptions

Let’s say that I’m interested in finding a speech where the president spoke about the impact of renewable energy at the National Renewable Energy Lab (NREL) and a speech where the U.S. vice president and the prime minister of Canada speak. I can find the titles for the most similar speeches given by the members of the White House in 2021-2022 by using the vector database we just created. We can search our vector database for speeches most similar to our descriptions. Then, all we have to do is convert the description into a vector embedding using the same model we used to get the embeddings of the speeches, and then search the vector database. Once we convert the descriptions into a vector embedding, we use the `search` function on our collection. We pass the embeddings in as the search data, pass in the field we are looking for, add some parameters for how to search, a limit for the number of results and the field that we want to return. In this example, the search parameters we need to pass are the metric type, which has to be the same type we used when creating the index (L2 norm) and the number of clusters we want to search (setting `nprobe` to 10).

import time
search_terms = ["The President speaks about the impact of renewable energy at the National Renewable Energy Lab.", "The Vice President and the Prime Minister of Canada both speak."]


# Search the database based on input text
def embed_search(data):
   embeds = transformer.encode(data)
   return [x for x in embeds]


search_data = embed_search(search_terms)


start = time.time()
res = collection.search(
   data=search_data,  # Embeded search value
   anns_field="embedding",  # Search across embeddings
   param={"metric_type": "L2",
           "params": {"nprobe": 10}},
   limit = TOPK,  # Limit to top_k results per search
   output_fields=["title"]  # Include title field in result
)
end = time.time()

for hits_i, hits in enumerate(res):

When we search for the sentences in this example, we expect to see output like the image below. This was a successful search because the titles are what we expect to see. The first description returns the title of a speech given by President Biden at NREL, and the second description returns a title reflective of a speech given by Vice President Kamala Harris and Prime Minister Justin Trudeau.

Summary

In this tutorial, we learned how to use a vector database to semantically search through speeches given by the Biden administration before the 2022 midterm elections. Semantic search allows us to take a blurb and look for semantically similar text, not just syntactically similar text. This allows us to search for a general description of a speech instead of searching for a speech based on specific sentences or quotes. For most of us, that makes finding speeches we would be interested in way easier.

Yujian Tang is a developer advocate at Zilliz. He has a background as a software engineer working on AutoML at Amazon. Yujian studied computer science, statistics, and neuroscience with research papers published to conferences including IEEE Big Data. He enjoys...