Context Engineering: The Foundation for Reliable AI Agents
Context is king in the agentic world. Pairing a performant reasoning model like Claude, DeepSeek or GPT-5 with the right context drives efficient planning and tool usage and improves multistep reasoning, leading to personalized conversations, higher task accuracy and relevant responses. In this article, we present the need for context engineering and associated benefits, identify challenges developers face as they use it for developing AI agents, and propose a high-level architecture to help address them.
Solving the Context Dilemma: Too Much vs. Too Little
Enterprises have access to vast amounts of structured and unstructured data. However, feeding this data as context directly to agents leads to confusion around task comprehension due to inherent noise and loss of important information, which can hurt a large language model’s (LLM) situational awareness as the limited context window is breached. Using a long context is not always the solution to this problem, as I’ve written before. On the other hand, sending too little context can cause agents to hallucinate. Simply put: garbage in, garbage out. Context engineering refers to a collection of techniques and tools used to ensure an AI agent has only the necessary information to complete assigned tasks successfully. Based on the concept of context engineering described by Harrison Chase of LangChain, context engineering consists of the following:- Tool selection means ensuring the agent has access to the right tools for retrieving the information needed to accomplish the specified task. For example, consider a scenario where an agent is asked to complete an action, such as planning a trip to Maui for a family with two kids and a dog. It should be able to retrieve all tools that are required to answer the user’s question and execute tasks reliably.
- Memory use is also a factor. It’s important to equip the agent with short-term memory that provides context for personalizing the ongoing session between the user and the agent, as well as long-term memory that offers context across multiple sessions to make the interactions cohesive, factual and even more personalized. This spans various memory types such as profile, semantic, episodic, conversational and procedural. It also includes working memory, which is used for sharing context for seamless task coordination among agents in a multiagent system.
- Another component is prompt engineering. This ensures the agent has access to the right prompt, which is clearly defined in terms of the agent’s behavior, including specific instructions and constraints.
- Finally, there’s retrieval. Dynamically retrieving relevant data based on the user’s question and inserting it into the prompt before sending it to the LLM ensures AI success. This is achieved by using Retrieval-Augmented Generation (RAG) and direct database calls. Enterprises generally have a polyglot environment with multiple sources of truth. In such cases, the Model Context Protocol (MCP) allows developers to retrieve context from numerous data sources in a standardized manner.

Figure 1: Conceptual view of the architecture for context engineering (source: Couchbase).
Extracting Context From Unstructured Data at Scale
Eighty percent of enterprise data is unstructured and is largely unusable as context. Therefore, to extract the context required to power important use cases, developers currently write extract, transform, load (ETL) jobs for Spark, Flink or other data processing engines. These jobs read unstructured data from a source database, process it and write back results for subsequent consumption by agents. These DIY solutions, albeit performant, not only slow down developer velocity but also create operational and maintenance overhead. A few example use cases include summarizing the details of the “support_ticket_desc” field in a document so that the customer support AI agent can easily understand and take action; extracting medical terms (diseases, medications, symptoms) from the “patient_diagnosis” field so that a triaging agent can come up with an initial diagnosis for the patient; and labeling whether text in the “email_content” field is “irrelevant,” “promotional spam,” “potentially a scam” or “phishing attempt” so an email assistant can reason whether to automatically respond to an email. AI functions allow developers to invoke LLMs from within SQL statements with the ability to write prompts to control the format, tone and other aspects of the LLM output. Here’s an example: A developer augments product reviews stored in a database with sentiment and summary using AI functions. A retail AI agent later reads it via a tool call and reasons whether to provide a compelling offer to a dissatisfied user to improve the Customer Satisfaction Score (CSAT) based on the severity of the issues they reported. This agent also creates a product feature request to drive. Consider the following product review left by a customer who was disappointed with the performance and durability of a blender: “I had high hopes for this blender based on the product description and reviews, but it’s been a let-down from day one. The motor struggles even with soft fruits, and it overheats after just a couple of minutes of use. I’ve had to stop mid-smoothie several times to let it cool down, which completely defeats the purpose of having a ‘high-speed’ blender.” Here’s a no-code analysis using SQL:| SQL Statement | Response |
| SELECT review_id, SUMMARIZE(review_text) AS summary, SENTIMENT(“review_text”, prompt = “Evaluate the sentiment of the “customer_review” field on a 5-point scale: very negative, negative, neutral, positive, very Positive”) AS sentiment FROMcustomer_reviews WHERE review_text IS NOT NULL; | “sentiment”: very negative “summary”: The blender needs a stronger motor to handle frozen fruits and ice without overheating, sharper blades for smoother blends and a better-sealed lid to prevent leaks. Durability should be improved to eliminate loud grinding noises and burning smells after short-term use. |
Fitting Context Into a Limited Context Window
When it comes to context, less (but relevant) is more! A 1-million+ token limit does not mean you can treat the context like unlimited memory. Each additional token has cost, latency and performance implications. Instead of stuffing the prompt with long, unnecessary context, causing important details to get lost (especially in the middle of the prompt), consider using techniques like RAG to keep the context lean and highly relevant. Listing all available tools that the LLM could use leads to prompt bloat and potentially confuses the agent due to similar tools having similar names or tool specs. Further, the proliferation of tools caused primarily by a lack of tool reusability and governance maximizes the likelihood of agent failure. However, cataloging all tools in a centralized location not only supports reusability but also retrieves only the tools that are relevant to answering the user’s question. This can be used in conjunction with well-written tool descriptions and tool routing to boost tool call accuracy. For example, the below API could retrieve only the tools within the agent application that are relevant to answer the user’s query:
catalog.find_tools(query="Plan a trip to Maui")
catalog.find_prompt("query="Plan a trip to Maui")