Retrieval Augmented Generation (RAG) Explained

Jan 20, 2025
3 min read

Updated: Oct 23, 2025

An overview of Retrieval Augmented Generation (RAG) for product team members and stakeholders.

View Project File

Key Takeaways

RAG retrieves relevant information from a knowledge base to enhance LLMs’ responses.
It overcomes context window limits by passing only the most relevant context to LLMs.
It delivers more accurate answers, scales to large data, and reduces costs and latency.
It requires preprocessing, robust retrieval mechanisms, and initial setup, which can be tedious.
It should be used when context-aware responses, cost/performance efficiency, scalability, or access to large pieces of data is required.
While traditional RAG retrieves relevant context only once, agentic RAG iteratively refines the retrieved context, leading to improved accuracy.

Overview

https://video.wixstatic.com/video/4246c5_dd2112c784bb471b80329d5f57395732/720p/mp4/file.mp4

Retrieval Augmented Generation (RAG)

Note: Generated using NotebookLM using content from Hugging Face | Building Agentic RAG Systems, Anthropic | Claude With The Anthropic API: Retrieval Augmented Generation, and Nvidia | Traditional RAG vs. Agentic RAG.

FAQs

What is Retrieval Augmented Generation (RAG)?

RAG enables AI to retrieve relevant information from a knowledge base before responding.

Retrieval Augmented Generation (RAG) is a powerful technique designed to enhance how large language models (LLMs) work with large information repositories (documents, knowledge bases, etc.). Instead of attempting to include all the information in a single prompt, RAG intelligently breaks information into smaller, manageable chunks. When a user asks a question, RAG identifies the most relevant chunks to that question and includes only those pertinent pieces in the prompt sent to the LLM. This allows the LLM to focus on the most relevant content, process requests more efficiently, and provide more accurate and precise responses.

What problem does RAG solve?

RAG addresses the challenge of providing relevant information to the LLM.

Most LLMs have a limited context window, which means they can only process a certain amount of information at once. This can be a challenge when we need to provide a large amount of context to the LLM. RAG helps us solve this problem. It breaks down information into smaller, manageable chunks, retrieves only the most relevant pieces, and uses those pieces of information to generate a response to the user’s query. This addresses the critical limitations of traditional prompting, such as prompt length limits, reduced effectiveness with long prompts, and increased processing costs and time.

How does RAG work?

RAG employs a smarter, multi-step approach:

Preprocessing (Chunking): Information is broken down into smaller, manageable "chunks".
Embedding Generation: Each chunk is converted into a numerical representation called an “embedding," which captures its meaning and context. These are stored in a vector database.
Query Processing: When a user asks a question, their query is also converted into an embedding.
Retrieval (Semantic Search): The user's query embedding is compared against the embeddings in the database to identify the most semantically similar chunks (the most relevant chunks).
Augmented Generation: The most relevant retrieved chunks are combined with the user's original question to form a new, focused prompt, which is then sent to the LLM and used to generate a more accurate and context-aware response.

Adding Relevant Chunks To The Final Prompt Sent To The LLM

Source: Anthropic | Claude With The Anthropic API: Retrieval Augmented Generation

What are the benefits of RAG?

It helps models focus on the most relevant content, leading to more accurate responses.
It scales up to handle large data sources (e.g., multiple lengthy documents).
It leads to smaller, more focused prompts, which cost less and run faster.

What are the challenges with RAG?

It requires a preprocessing step to chunk documents.
It needs a robust search mechanism to find "relevant" chunks.
It can retrieve chunks that might not always contain all the context the LLM needs.
It requires more upfront work compared to simply stuffing everything into a prompt.

When should we use RAG?

RAG systems are especially helpful in the following scenarios:

You are dealing with complex queries that require context-aware responses.
You need to optimize the cost, performance, or accuracy of the AI application.
You need to work with large documents, collections of documents, or knowledge bases.
You need to generate responses with specific, up-to-date, or proprietary data that was not part of the LLM's initial training data.

What’s the difference between traditional RAG and agentic RAG?

Traditional RAG systems use an LLM to answer queries based on retrieved data. They rely on a single retrieval step to gather information relevant to a user’s query, which can lead to vital details being overlooked. Agentic RAG systems intelligently control both retrieval and generation processes, improving efficiency and accuracy. Unlike traditional RAG, agentic RAG systems can autonomously formulate search queries, critique retrieved results, and conduct multiple retrieval attempts to generate more tailored and comprehensive output.

Priank Ravichandar