RAG & Knowledge Bases
Retrieval-augmented generation: how it works, why it beats fine-tuning for company knowledge, and vector databases on AWS.
Retrieval-Augmented Generation (RAG) fixes two LLM weaknesses at once — hallucination and stale knowledge — by fetching relevant, current documents *at question time* and giving them to the model as context. The model answers grounded in your data, and can cite its sources. No model training required.
How RAG works
- Ingest: your documents are split into chunks; an embedding model converts each chunk to a vector, stored in a vector database.
- Retrieve: a user's question is embedded too; the database returns the chunks whose vectors are closest (most semantically similar).
- Augment: the retrieved chunks are inserted into the prompt as context.
- Generate: the LLM answers using that context, often with citations.
RAG is an open-book exam. Instead of hoping the student memorized everything (fine-tuning), you hand them the right pages of the textbook (retrieval) as they answer each question.
RAG on AWS
Fully managed RAG: point it at S3 documents and it handles chunking, embeddings, vector storage, retrieval, and citations.
Popular vector database option for semantic search.
PostgreSQL as a vector store.
Other AWS options with vector search support.
Managed intelligent search that can also feed retrieval for GenAI apps.
Choose RAG when: answers must come from company/current data, must include citations, data changes frequently, or hallucinations must be reduced — all without training. Choose fine-tuning instead when you need new *behavior/style/format*, not new *facts*. Frequently-changing knowledge in a fine-tuned model = constant expensive retraining; in RAG it's just a document update.
A chatbot must answer questions using the company's internal HR policies, which change monthly, and cite the source document. Which approach fits BEST?