TNS
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
NEW! Try Stackie AI
AI / AI Engineering / Databases / Large Language Models

Solving the RAG vs. Long Context Model Dilemma

Long context models are great for reducing hallucination for certain use cases that warrant longer context but are not ideal for all situations.
Jan 21st, 2025 6:37am by
Featued image for: Solving the RAG vs. Long Context Model Dilemma
Image from Krot_Studio on Shutterstock.

Many developers have been using retrieval-augmented generation (RAG) with large-scale context corpus to build GenAI applications and tame problems such as AI hallucinations faced by general-purpose large language models (LLMs).

Now long context models are emerging like Gemini with a context window of 2 million tokens,  and its potential benefits make you wonder whether you should ditch RAG altogether. The key to dealing with this dilemma is to understand the pros and cons of using a long context model and make an informed decision about its suitability for your use case.

Benefits and Limitations of RAG vs. Long Context Models

Traditionally LLMs have had smaller context windows that limit the amount of text or tokens that can be processed at once. RAG has been an effective solution thus far to address this limitation. By retrieving the most relevant chunks of text or context, augmenting the user prompt with it and then passing those to the LLM, RAG works effectively with much larger data sets than the context window would normally support.

However, a long context model such as Gemini directly allows processing the provided context, without needing a separate RAG system, simplifying application workflow and potentially reducing latency. To put a context window of 1 million tokens into perspective, it is equivalent to eight average-length English novels or the transcripts of over 200 average-length podcast episodes. However, it’s not a panacea for reducing hallucinations by any means and has its share of limitations.

First, long context models suffer from a diminished focus on relevant information, which leads to potential degradation in answer quality per research from NVIDIA.

Second, for use cases such as QA chatbots, it’s not so much about the quantity of the information in the context but rather the quality. Higher-quality context is achieved via highly selective granular searches specific to the question asked, which is what RAG enables.

Finally, long context models require more GPU resources for processing the long context, leading to higher processing times and higher costs. Suffice to say that these models have higher costs per query. You may be able to address this using the key-value (KV) cache to cache the input tokens to be reused across requests, but that has significant GPU memory requirements and hence drives up the associated costs. The key is to achieve high answer quality with fewer input tokens.

Despite its limitations, long context models support a few compelling use cases that require longer context such as translation or summarization, for example, translating documents from English into Sanskrit (the least-spoken language in India) for educational purposes. LLMs struggle with such translation into Sanskrit due to the language’s complex grammatical structure and the limited availability of training data compared to other widely spoken languages. Hence, providing a sufficiently large number of examples as context will help boost the accuracy of the translation. Other ways include summarization and comparison across multiple large documents at once to generate insights, for example, comparing the 10K reports of multiple companies to create financial benchmarks.

Long context models are great for reducing hallucination for certain use cases that warrant longer context. However, for all other use cases, we recommend using RAG to retrieve context relevant to answer the user’s question with high accuracy and cost-effectiveness. If RAG does not meet the desired accuracy, we suggest using RAG in conjunction with fine-tuning to increase domain specificity.

Couchbase’s Capella AI Services helps developers like you build performant RAG and agentic applications quickly. Feel free to sign up for our private preview to get started with your AI project.

Group Created with Sketch.
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.