What Is RAG?
RAG (Retrieval-Augmented Generation) is a pattern where, instead of asking a language model to answer from its training weights alone, you first retrieve relevant text from your own data and pass it into the prompt as context. The model then generates an answer grounded in those retrieved passages. The retrieval step is what separates RAG from a plain LLM call: the model sees the facts at query time rather than recalling them from training.
This matters because an LLM’s weights are frozen at training time. They don’t know your internal docs, your latest pricing, or anything published after the cutoff. RAG closes that gap without retraining the model — you change the data the model reads, not the model itself.
The Retrieval Pipeline
A RAG system has two phases. The first runs once (or whenever data changes); the second runs on every query.
Indexing (offline):
documents → chunk → embed → store in vector DB
Querying (per request):
user query
│
▼
embed query → retrieve top-k similar chunks → augment prompt → generate
(vector search) (chunks + query) (LLM)
Walking through the five steps:
- Chunk. Split source documents into passages — typically 200 to 800 tokens, often with overlap so a sentence isn’t cut mid-thought. Chunk size is a real tuning knob: too large and you dilute relevance, too small and you lose context.
- Embed. Convert each chunk into a vector using an embedding model (for example
text-embedding-3-largeor an open model likebge). Semantically similar text lands near in vector space. - Retrieve. At query time, embed the user’s question and run a similarity search (cosine or dot product) against the index to pull the top-k closest chunks. Many systems add a reranker or hybrid keyword+vector search here to improve precision.
- Augment. Insert the retrieved chunks into the prompt, usually with an instruction like “Answer using only the context below; if it’s not there, say so.”
- Generate. The LLM produces an answer from the augmented prompt and, ideally, cites which chunks it used.
What You Need to Build One
| Component | Purpose | Common choices |
|---|---|---|
| Embedding model | Text → vector | OpenAI embeddings, Cohere, bge, e5 |
| Vector store | Store and search vectors | pgvector, Pinecone, Qdrant, Weaviate |
| Retriever | Find top-k relevant chunks | Vector search, hybrid (BM25 + vector) |
| Reranker | Reorder candidates by relevance | Cohere Rerank, cross-encoder models |
| LLM | Generate the final answer | GPT, Claude, Gemini, Llama, Mistral |
| Orchestration | Wire the steps together | LangChain, LlamaIndex, or custom code |
For many teams pgvector on an existing Postgres database is enough — you don’t need a dedicated vector database until scale or latency demands it.
Why Use RAG
- Current and private data. The model can answer about documents it was never trained on. Update the index, and answers update — no retraining.
- Reduced fabrication. Grounding answers in retrieved text cuts down on invented facts, especially when you instruct the model to refuse if the context doesn’t contain the answer.
- Citations. Because you know which chunks were retrieved, you can show sources, which is often a hard requirement for internal tools and support.
- Cost and control. Swapping or updating data is cheap compared to fine-tuning, and you keep proprietary data out of model weights.
Where RAG Falls Short
RAG is not a cure-all. Retrieval quality caps answer quality: if the right chunk isn’t retrieved, the model can’t use it, and you get a confident answer built on the wrong context. Chunking strategy, embedding choice, and reranking all need tuning against real queries. Long or multi-hop questions that require synthesizing many documents stress the top-k retrieval model, and stuffing more chunks into the prompt raises token cost and can bury the relevant passage. RAG also doesn’t teach the model new behavior or output format — it only changes what facts the model sees.
RAG vs Fine-Tuning
These solve different problems. RAG changes what the model knows at query time; fine-tuning changes how the model behaves by adjusting its weights on example data. Use RAG when the answer depends on a body of facts that changes or that the model never saw. Use fine-tuning when you need a consistent tone, a strict output format, or a task the base model handles poorly — and the two are often combined. For a side-by-side breakdown of cost, freshness, accuracy, and effort, see our RAG vs fine-tuning comparison.
Building a RAG System
The hard parts of a production RAG system are rarely the LLM call — they’re chunking strategy, retrieval precision, evaluation, and keeping the index in sync with changing source data. We build RAG pipelines on your own documents, measure retrieval quality against real queries, and wire in citations and refusal behavior so answers stay grounded. Tell us your data sources and the questions users need answered, and we’ll scope the embedding model, vector store, and retrieval setup end to end.