Retrieval-Augmented Generation, commonly abbreviated RAG, is one of the most widely adopted patterns in enterprise AI systems. It addresses a fundamental limitation of large language models: they know only what they were trained on, and their knowledge has a cutoff date. RAG gives LLMs access to fresh, organization-specific, or domain-specific information at inference time — without retraining the model.
RAG connects a retrieval system to a language model, so the model can ground its responses in retrieved facts rather than relying solely on trained parameters.
A raw LLM generates responses based on patterns learned from its pre-training corpus. This creates several problems in production:
RAG addresses all of these by retrieving relevant documents at query time and providing them to the model as context for generating a response.
A RAG system has two distinct phases: indexing (done ahead of time) and retrieval + generation (done at query time).
Source Documents (PDFs, web pages, databases, etc.)
↓
Chunking (split documents into manageable pieces)
↓
Embedding Model (convert chunks to dense vectors)
↓
Vector Store (index and store embeddings for similarity search)
User Query
↓
Query Embedding (same embedding model as indexing)
↓
Vector Similarity Search (retrieve top-k most similar chunks)
↓
Optional: Reranking (refine and reorder retrieved chunks)
↓
Prompt Assembly (combine query + retrieved context + system prompt)
↓
LLM Generation (produce grounded response)
↓
Response to User
Embedding models are the component that converts text to vectors. The quality of embeddings directly determines retrieval accuracy — if the embedding model does not capture the semantic similarity between a query and a relevant document, that document will not be retrieved.
Key characteristics to evaluate in embedding models:
NVIDIA provides embedding models as part of its NeMo Retriever NIM offering, including specialized enterprise-grade embedding models for retrieval tasks.
A vector database is optimized for storing and querying embedding vectors at scale. The core operation is approximate nearest neighbor (ANN) search: given a query vector, find the stored vectors that are most similar.
Common vector databases used in RAG systems include:
Choosing a vector database involves trade-offs in scalability, operational complexity, hybrid search support (combining vector and keyword search), metadata filtering, and NVIDIA GPU acceleration support.
Retrieval based on embedding similarity retrieves semantically related documents, but similarity in embedding space is not always aligned with relevance to the specific query. A reranker model takes the retrieved candidates and scores each one against the original query more precisely.
Rerankers are typically cross-encoder models (they process query and document together, not independently) which provides higher accuracy than the embedding-based retrieval stage alone. The trade-off is cost: cross-encoders are slower than embedding lookups, so they are applied only to the top-k retrieved candidates.
NVIDIA provides reranker models as NeMo Retriever NIMs for use in production RAG pipelines.
Chunking is more nuanced than it first appears. Common strategies:
There is no universally best chunking strategy — the right approach depends on document type, query patterns, and the embedding model’s optimal input length.
Pure vector search misses exact keyword matches that do not appear as semantically similar in embedding space — for example, rare terms, codes, identifiers, or specialized vocabulary. Hybrid search combines:
The results are merged with a score fusion strategy (such as Reciprocal Rank Fusion). Hybrid search consistently outperforms pure vector search for enterprise knowledge base queries because real user queries often contain both semantic intent and specific terms.
The most common RAG failure: the correct information exists in the knowledge base but is not retrieved. Causes include:
Retrieving too many chunks (or chunks that are too long) fills the LLM’s context window with irrelevant material. This dilutes the signal and increases generation cost. Good reranking and top-k selection are important mitigations.
Some LLMs, especially smaller ones, do not reliably attend to long retrieved contexts. Prompt engineering techniques (placing context before the query, using explicit grounding instructions) and model selection both affect this behavior.
Each RAG query involves at least one embedding computation, one vector search, and one LLM call. Optionally a reranker pass. This adds latency compared to a direct LLM call. Latency budgets need to be designed with the full pipeline in mind.
NVIDIA provides several components for building production RAG systems:
The NeMo Retriever NIMs are specifically designed for enterprise retrieval pipelines, with models trained and evaluated for retrieval-specific tasks rather than general text similarity.
For complex or multi-part queries, a pre-processing step rewrites or decomposes the original query before retrieval. This improves recall for questions that have multiple sub-questions or that require clarification of ambiguous terms.
Instead of embedding the raw query, the LLM generates a hypothetical answer to the query, and the embedding of that hypothetical answer is used for retrieval. This often retrieves documents more similar to what a correct answer would look like, improving retrieval quality for abstractive questions.
In agentic AI systems, RAG retrieval is one tool among several that an LLM agent can invoke. The agent decides when to retrieve, what to retrieve, and how to combine retrieved information with other tool outputs. This is more flexible than fixed-pipeline RAG but requires careful system design.
Evaluation of RAG pipelines is typically split into retrieval evaluation and generation evaluation:
Retrieval evaluation:
Generation evaluation:
Frameworks like RAGAS provide automated metrics for RAG evaluation. Human evaluation remains important for nuanced quality assessment.