Ranking for Relevance with BM25

6 min readJul 4, 2024

BM25 — Milvus — Vector Database — Ranking — Documents — SPLADE Spare Vectors — Keyword Search — Semantic Search — Fastest Vector Database

BM25 (Combining full text search and semantic search with Milvus)

We’ll be doing this with the first production vector database and the fast for the majority of large use cases. Milvus is the first vector database to offer production grade hybrid search and do it full open source.

Comparing SPLADE Sparse Vectors with BM25

medium.com

With BM25 (Best Matching 25) you use a text-matching algorithm to do some math to get the scoring. Who knew AI would bring back all the math you thought you finished in college? I didn’t. The good news is you are not doing these calculations, put away your TI-84. Now you may have to tune it a little, but that’s not rocket science.

You can experiment with your results for k of 0.5 to 2, often 1.2. Sure try that first. And for b you can start with 0.75 but optimally between 0.3 and 0.9. This depends on a lot of factors with the number and size of your documents and what kind of document it is. If you have a consistent type of document, say long Vector Database Medium posts, once you find your optimal value you can stick with it. If they vary from thesis papers to blogs to emails to childrens notes to magazine articles, then you are going to have to be watching those outputs.

You can try out this notebook:

Google Colab

Edit description

colab.research.google.com

Now everything seems pretty good, but:

This highlights a drawback of traditional sparse vectors. When the exact match of the query keyword cannot be found in a document, the sparse vector of BM25 fails to capture the importance of the keyword, even though the document may discuss a similar topic. —

https://medium.com/@zilliz_learn/comparing-splade-sparse-vectors-with-bm25-53368877359f

We can also use SPLADE which generates a more advanced sparse vector utilizing BERT’s Masked Language Modeling (MLM) for word tokenization.

Fortunately we don’t just have to use BM25 for our vector, we can also use SPLADE. We can use both and have two vectors stored in our collection. The flexibility and ability to use different algorithms for transforming text into sparse embeddings really supercharges the Milvus database.

Let’s do take a look at some Travel Advisories from the US government.

RSS END Point

Example Row

<item>
<title>Bhutan - Level 1: Exercise Normal Precautions</title>
<pubDate>Wed, 26 Jun 2024</pubDate>
<link>http://travel.state.gov/content/travel/en/traveladvisories/traveladvisories/bhutan-travel-advisory.html</link>
<guid>http://travel.state.gov/content/travel/en/traveladvisories/traveladvisories/bhutan-travel-advisory.html</guid>
<category domain="Threat-Level">Level 1: Exercise Normal Precautions</category>
<category domain="Country-Tag">BT</category>
<category domain="Keyword">advisory</category>
<dc:identifier> BT,advisory</dc:identifier>
<description>
<![CDATA[ <p><b><i>Reissued after periodic review without changes.</i></b></p> <p>Exercise normal precautions in Bhutan.</p> <p>Read the&nbsp;<a href="https://travel.state.gov/content/travel/en/international-travel/International-Travel-Country-Information-Pages/Bhutan.html">country information page</a>&nbsp;for additional information on travel to Bhutan.</p> <p>If you decide to travel to Bhutan:</p> <ul> <li>Enroll in the&nbsp;<a href="https://step.state.gov/step/">Smart Traveler Enrollment Program</a>&nbsp;(<a href="https://step.state.gov/step/">STEP</a>) to receive Alerts and make it easier to locate you in an emergency.</li> <li>Follow the Department of State on&nbsp;<a href="https://www.facebook.com/travelgov/">Facebook</a>&nbsp;and&nbsp;<a href="https://twitter.com/StateDept?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor">Twitter</a>.</li> <li>Review the&nbsp;<a href="https://www.osac.gov/Content/Browse/Report?subContentTypes=Country%20Security%20Report">Country Security Report</a>&nbsp;for Bhutan.</li> <li>Visit the CDC page for the latest&nbsp;<a href="https://wwwnc.cdc.gov/travel/destinations/list">Travel Health Information</a>&nbsp;related to your travel.</li> <li>Prepare a contingency plan for emergency situations. Review the&nbsp;<a href="https://travel.state.gov/content/travel/en/international-travel/before-you-go/travelers-checklist.html">Traveler&#8217;s Checklist</a>.</li> </ul> ]]>
</description>
</item>

Quick Query

query_results = milvus_client.query(
    collection_name=COLLECTION_NAME,
    filter='title like "%Level 3%"',
    output_fields=["pk","title","link","summary","publisheddate"],
    limit=3
)
print(query_results)

LIBRARIES

pymilvus
feedparser
beautifulsoup

SOURCE CODE

GitHub - tspannhw/AIM-TravelAdvisories: BM25 + Travel Advisories

BM25 + Travel Advisories. Contribute to tspannhw/AIM-TravelAdvisories development by creating an account on GitHub.

github.com

In this application we create a collection with our sparse vector as well as some scalar fields for PK, Title, Link, Summary (the actual text) and Published Data that come from the RSS. We build our corpus for BM25 embedding from some snippets of Travel Advisory summaries.

We pull down the RSS feed using the feedparser library.

We grab the summary field and have Beautiful Soup clean out the HTML tags.

Then we iterate through out posts from the parsed RSS and build String arrays for all the fields.

We then run our encoding on the summaries array.

Finally we iterate through the summaries array and add the todok() value and the values of our scalars ending with an easy insert into our collection.

Once complete we add our Sparse Inverted Index for fast search. We can search immediately if we turn on eventual consistency.

That’s it, BM25, not bad.

RESOURCES

BM25 | Milvus Documentation

BM25 is a ranking function used in information retrieval to estimate the relevance of documents to a given search…

milvus.io

Mastering BM25: A Deep Dive into the Algorithm and Its Application in Milvus - Zilliz blog

We can easily implement the BM25 algorithm to turn a document and a query into a sparse vector with Milvus. Then, these…

zilliz.com

Milvus Hybrid Search | 🦜️🔗 LangChain

Milvus is an open-source vector database built to power embedding similarity search and AI applications. Milvus makes…

python.langchain.com

Okapi BM25 - Wikipedia

In information retrieval, Okapi BM25 ( BM is an abbreviation of best matching) is a ranking function used by search…

en.wikipedia.org

Enhancing Information Retrieval with Sparse Embeddings | Zilliz Learn - Zilliz blog

Explore the inner workings, advantages, and practical applications of learned sparse embeddings with the Milvus vector…

zilliz.com

bootcamp/bootcamp/RAG/advanced_rag/hybrid_and_rerank_with_langchain.ipynb at…

Dealing with all unstructured data, such as reverse image search, audio search, molecular search, video analysis…

github.com

Introducing PyMilvus Integration with Embedding Models

Milvus is an open-source vector database designed specifically for AI applications. Whether you're working on machine…

milvus.io

Integrate Milvus with Jina | Milvus Documentation

This guide demonstrates how to use Jina embeddings and Milvus to conduct similarity search and retrieval tasks. |…

milvus.io

Jina AI - Your Search Foundation, Supercharged.

Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.

jina.ai

A Review of Hybrid Search in Milvus

medium.com

Remove all style, scripts, and HTML tags using BeautifulSoup - GeeksforGeeks

A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and…

www.geeksforgeeks.org

GitHub - tspannhw/FLaNK-TravelAdvisory: Travel Advisory - RSS Processing - Apache NiFi - Apache…

Travel Advisory - RSS Processing - Apache NiFi - Apache Kafka - Apache Flink - SQL - tspannhw/FLaNK-TravelAdvisory

github.com

Travel

Privacy | Copyright & Disclaimer | FOIA | No FEAR Act Data | Office of the Inspector General | USA.gov |…

travel.state.gov

Star Us On GitHub and Join Our Discord!

If you liked this blog post, consider starring Milvus on GitHub, and feel free to join our Discord! 💙

GitHub - milvus-io/milvus: A cloud-native vector database, storage for next generation AI…

A cloud-native vector database, storage for next generation AI applications - milvus-io/milvus

github.com

Get Milvused!

Vector database — Milvus

Milvus is a powerful vector database tailored for processing and searching extensive vector data. It stands out for its…

milvus.io

Read my Newsletter every week!

AIM Weekly 17 June 2024

17-June-2024

medium.com

For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here:

Zilliz

Zilliz is a leading vector database company for production-ready AI. Built by the engineers who created Milvus, the…

www.youtube.com

x.com

Edit description

x.com

Edit description

x.com

https://www.linkedin.com/company/zilliz/

https://www.linkedin.com/in/timothyspann/

Join the Milvus Discord Server!

Check out the Milvus community on Discord — hang out with 1734 other members and enjoy free voice and text chat.

discord.com

https://milvusio.medium.com

Open Source Vector Databases

Open Source Vector Databaseswww.opensourcevectordb.cloud

Ranking for Relevance with BM25

Comparing SPLADE Sparse Vectors with BM25

Google Colab

Edit description

RSS END Point

Example Row

Quick Query

LIBRARIES

SOURCE CODE

GitHub - tspannhw/AIM-TravelAdvisories: BM25 + Travel Advisories

BM25 + Travel Advisories. Contribute to tspannhw/AIM-TravelAdvisories development by creating an account on GitHub.

RESOURCES

BM25 | Milvus Documentation

BM25 is a ranking function used in information retrieval to estimate the relevance of documents to a given search…

Mastering BM25: A Deep Dive into the Algorithm and Its Application in Milvus - Zilliz blog

We can easily implement the BM25 algorithm to turn a document and a query into a sparse vector with Milvus. Then, these…

Milvus Hybrid Search | 🦜️🔗 LangChain

Milvus is an open-source vector database built to power embedding similarity search and AI applications. Milvus makes…

Okapi BM25 - Wikipedia

In information retrieval, Okapi BM25 ( BM is an abbreviation of best matching) is a ranking function used by search…

Enhancing Information Retrieval with Sparse Embeddings | Zilliz Learn - Zilliz blog

Explore the inner workings, advantages, and practical applications of learned sparse embeddings with the Milvus vector…

bootcamp/bootcamp/RAG/advanced_rag/hybrid_and_rerank_with_langchain.ipynb at…

Dealing with all unstructured data, such as reverse image search, audio search, molecular search, video analysis…

Introducing PyMilvus Integration with Embedding Models

Milvus is an open-source vector database designed specifically for AI applications. Whether you're working on machine…

Integrate Milvus with Jina | Milvus Documentation

This guide demonstrates how to use Jina embeddings and Milvus to conduct similarity search and retrieval tasks. |…

Jina AI - Your Search Foundation, Supercharged.

Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.

A Review of Hybrid Search in Milvus

Remove all style, scripts, and HTML tags using BeautifulSoup - GeeksforGeeks

A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and…

GitHub - tspannhw/FLaNK-TravelAdvisory: Travel Advisory - RSS Processing - Apache NiFi - Apache…

Travel Advisory - RSS Processing - Apache NiFi - Apache Kafka - Apache Flink - SQL - tspannhw/FLaNK-TravelAdvisory

Travel

Privacy | Copyright & Disclaimer | FOIA | No FEAR Act Data | Office of the Inspector General | USA.gov |…

Star Us On GitHub and Join Our Discord!

GitHub - milvus-io/milvus: A cloud-native vector database, storage for next generation AI…

A cloud-native vector database, storage for next generation AI applications - milvus-io/milvus

Vector database — Milvus

Milvus is a powerful vector database tailored for processing and searching extensive vector data. It stands out for its…

AIM Weekly 17 June 2024

17-June-2024

Zilliz

Zilliz is a leading vector database company for production-ready AI. Built by the engineers who created Milvus, the…

x.com

Edit description

x.com

Edit description

Join the Milvus Discord Server!

Check out the Milvus community on Discord — hang out with 1734 other members and enjoy free voice and text chat.

Open Source Vector Databases

Open Source Vector Databases

Written by Tim Spann