Sitemap

Ranking for Relevance with BM25

6 min readJul 4, 2024

BM25 — Milvus — Vector Database — Ranking — Documents — SPLADE Spare Vectors — Keyword Search — Semantic Search — Fastest Vector Database

Press enter or click to view image in full size

BM25 (Combining full text search and semantic search with Milvus)

We’ll be doing this with the first production vector database and the fast for the majority of large use cases. Milvus is the first vector database to offer production grade hybrid search and do it full open source.

With BM25 (Best Matching 25) you use a text-matching algorithm to do some math to get the scoring. Who knew AI would bring back all the math you thought you finished in college? I didn’t. The good news is you are not doing these calculations, put away your TI-84. Now you may have to tune it a little, but that’s not rocket science.

Press enter or click to view image in full size
Zilliz Diagram from zilliz.com

You can experiment with your results for k of 0.5 to 2, often 1.2. Sure try that first. And for b you can start with 0.75 but optimally between 0.3 and 0.9. This depends on a lot of factors with the number and size of your documents and what kind of document it is. If you have a consistent type of document, say long Vector Database Medium posts, once you find your optimal value you can stick with it. If they vary from thesis papers to blogs to emails to childrens notes to magazine articles, then you are going to have to be watching those outputs.

You can try out this notebook:

Now everything seems pretty good, but:

This highlights a drawback of traditional sparse vectors. When the exact match of the query keyword cannot be found in a document, the sparse vector of BM25 fails to capture the importance of the keyword, even though the document may discuss a similar topic. —

https://medium.com/@zilliz_learn/comparing-splade-sparse-vectors-with-bm25-53368877359f

We can also use SPLADE which generates a more advanced sparse vector utilizing BERT’s Masked Language Modeling (MLM) for word tokenization.

Fortunately we don’t just have to use BM25 for our vector, we can also use SPLADE. We can use both and have two vectors stored in our collection. The flexibility and ability to use different algorithms for transforming text into sparse embeddings really supercharges the Milvus database.

Let’s do take a look at some Travel Advisories from the US government.

RSS END Point

Example Row

<item>
<title>Bhutan - Level 1: Exercise Normal Precautions</title>
<pubDate>Wed, 26 Jun 2024</pubDate>
<link>http://travel.state.gov/content/travel/en/traveladvisories/traveladvisories/bhutan-travel-advisory.html</link>
<guid>http://travel.state.gov/content/travel/en/traveladvisories/traveladvisories/bhutan-travel-advisory.html</guid>
<category domain="Threat-Level">Level 1: Exercise Normal Precautions</category>
<category domain="Country-Tag">BT</category>
<category domain="Keyword">advisory</category>
<dc:identifier> BT,advisory</dc:identifier>
<description>
<![CDATA[ <p><b><i>Reissued after periodic review without changes.</i></b></p> <p>Exercise normal precautions in Bhutan.</p> <p>Read the&nbsp;<a href="https://travel.state.gov/content/travel/en/international-travel/International-Travel-Country-Information-Pages/Bhutan.html">country information page</a>&nbsp;for additional information on travel to Bhutan.</p> <p>If you decide to travel to Bhutan:</p> <ul> <li>Enroll in the&nbsp;<a href="https://step.state.gov/step/">Smart Traveler Enrollment Program</a>&nbsp;(<a href="https://step.state.gov/step/">STEP</a>) to receive Alerts and make it easier to locate you in an emergency.</li> <li>Follow the Department of State on&nbsp;<a href="https://www.facebook.com/travelgov/">Facebook</a>&nbsp;and&nbsp;<a href="https://twitter.com/StateDept?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor">Twitter</a>.</li> <li>Review the&nbsp;<a href="https://www.osac.gov/Content/Browse/Report?subContentTypes=Country%20Security%20Report">Country Security Report</a>&nbsp;for Bhutan.</li> <li>Visit the CDC page for the latest&nbsp;<a href="https://wwwnc.cdc.gov/travel/destinations/list">Travel Health Information</a>&nbsp;related to your travel.</li> <li>Prepare a contingency plan for emergency situations. Review the&nbsp;<a href="https://travel.state.gov/content/travel/en/international-travel/before-you-go/travelers-checklist.html">Traveler&#8217;s Checklist</a>.</li> </ul> ]]>
</description>
</item>

Quick Query

query_results = milvus_client.query(
collection_name=COLLECTION_NAME,
filter='title like "%Level 3%"',
output_fields=["pk","title","link","summary","publisheddate"],
limit=3
)
print(query_results)
Press enter or click to view image in full size

LIBRARIES

  • pymilvus
  • feedparser
  • beautifulsoup

SOURCE CODE

In this application we create a collection with our sparse vector as well as some scalar fields for PK, Title, Link, Summary (the actual text) and Published Data that come from the RSS. We build our corpus for BM25 embedding from some snippets of Travel Advisory summaries.

We pull down the RSS feed using the feedparser library.

We grab the summary field and have Beautiful Soup clean out the HTML tags.

Then we iterate through out posts from the parsed RSS and build String arrays for all the fields.

We then run our encoding on the summaries array.

Finally we iterate through the summaries array and add the todok() value and the values of our scalars ending with an easy insert into our collection.

Once complete we add our Sparse Inverted Index for fast search. We can search immediately if we turn on eventual consistency.

That’s it, BM25, not bad.

Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size

RESOURCES

--

--

Tim Spann
Tim Spann

Written by Tim Spann

Snowflake Senior Solutions Engineer | Data Engineer: GenAI, IoT, Deep Learning, Streaming, AI, Iceberg, Snowflake, NiFi, Kafka. https://www.datainmotion.dev